One reason to use the Textract program is the
validation features that are built into it. It ensures all of the following:
- All characters are printable, accents have been stripped, and
tabs have been converted to blanks.
- Every < symbol is matched by a corresponding >. There are no
orphan < or > symbols, no nested tags (tags within tags), and
no two < or > in sequence.
- The following tags are always recognized and accepted: The six levels of headings
<H1> through <H6>, paragraph <P>, lists <UL> and <OL>, table <TABLE>,
table row <TR>, table cell <TD>, blockquote (for indenting) <BLOCKQUOTE>, bold <B>,
italics <I>, underline <U>, and line break <BR>. The counterpart ending tag
is expected for every tag except the line break.
- Additional tags are recognized if they are listed in the file HTMLTags.txt.
HTMLTags.txt is a self-documented file that you may edit to specify
which tags you wish retained from HTML input.
- All tags that require ending
tags are in LIFO (last in, first out) order.
- The program warns if a blank line in input is followed by content that
starts in lower case. This may signal a need to remove
extraneous blank lines from input. This is not an error as such.
- Headings may contain no tags except links.
- When ascending through heading levels, intermediate levels
are not skipped. For example, an <H4> may not be the next
heading after an <H2>.
- If a <TABLE> occurs, text may be located only within table cells,
and table cells may occur only within table rows. Each table cell starts with
<TD> and ends with </TD>. Each table row starts with
<TR> and ends with </TR>. There should be the same number of
table cells within each table row. The only other tags allowed within a table
are bold, italics, and underline, in LIFO order and each finishing within the same
cell in which it starts.
- Lists start with either <UL> (unordered ... a bullet before each list item) or <OL>
(ordered ... sequential numbers before the successive list items). All text in a list must be
within a list item, which starts with <LI> and ends with </LI>. Indented lists may
occur within lists, up to five levels altogether. The only other tags allowed within a list
are bold, italics, and underline, in LIFO order, each finishing within the same
list item in which it starts.
- Use of ampersand codes is restricted to the following only:
- Variations of the ampersand codes for < and > (ampersand followed by lt; or gt;) are permitted anywhere
that text content is allowed.
- The symbol for a non-breaking space (ampersand followed by nbsp;) is permitted only between
<TD> and </TD> to preserve an otherwise empty cell.
- All other ampersand codes are reduced to normal print form. In
other words, the only other occurrence of '&' is as an
ampersand and not part of an ampersand code.
- Lengths: Headings must be under 300 characters in length.
Paragraphs may be any length whatsoever.
- Every heading is followed either by a more junior heading
or by text. This ensures that every heading has child text.