Current selection:

Advanced Search     Switch Selection     Preferences

The essentials

If you want to search a collection of text using Words Close Together, ensure that your collection fits the following specs. That's not as difficult as may appear; there are preparation tools that can do most or all of the work for you.

Limit the text to printable ASCII characters.

The computer keyboard that is standard in North America has four primary rows of keys. The top row consists of numeric digits and punctuation characters. The other rows begin with "QWERTY", "ASDF...", and ZXCV..."; each row has more punctuation characters at the right end. All these characters are printable; touch one of these keys in a word processor, and something shows on the screen. Printable characters are text. The space bar and the "Enter" key are also considered text characters.

In short, if in one keystroke you can type it, you can use it.

If you are saving text using a word processor, be sure to save it as unformatted text.

Change tabs to blanks.

The Words Close Together search engine presents its results in the form of HTML (HyperText Markup Language) pages. Tabs and multiple spaces are ignored by web browsers, which have their own methods of formatting text. The search engine produces lists of search results in brief, expanded individual results, as well as tables of contents and pages for reading. Here too tabs and extra spaces don't help. So change each tab to a blank. (The Textract program does this for you automatically.)

Substitute unaccented for accented characters.

English has no acutes, graves, cedillas, umlauts, or other accents. That's good for typists, typesetters, and all of us computer slaves. But there is a downside. Many English speaking computer users don't know how to input an accented word, and those who do know have to put up with very round-about ways of inserting the accented letters if we use software designed for North American markets.

Therefore we decided to focus early releases of our search engine on languages that do not use accents. In case you don't speak Latin, Words Close Together also works well with English.

Words with accented letters creep into English text. For now, simply replace them with the unaccented form -- letter 'e' instead of e-acute, etc. Some recent word processors provide an option when saving a file as text to substitute regular letters for special characters (for example, an ordinary letter 'u' instead of a 'u' with an umlaut). The substitution feature usually works well. Some other programs (such as Textract) also make these substitutions for you.

Rework orphan greater than, less than symbols.

The less than symbol (<) and the greater than symbol (>) have special meaning in the HyperText Markup Language. To avoid confusion, when one of these symbols occurs in text not as part of an HTML tag, convert it to a special ampersand format -- an ampersand followed by either "gt;" for greater than, or "lt;" for less than.

Replace special symbols with words.

A letter c with a circle around it looks neat when typeset. Most of us recognize it as a claim of copyright. Unfortunately, that's difficult to search for. In your collections of text, change the copyright and other such symbols to words ... circle (C) to "Copyright", circle (R) to "Registered", superset "TM" to "Trademark", etc.

When you are finished replacing accented characters and special symbol characters, there should be no bytes in the text collection with values higher than 126 (hexadecimal 0x7E).

Rework ampersand codes.

The ampersand ("and sign" or '&') has a special meaning in HyperText Markup Language. Unfortunately, it also is used commonly in English corporate names such as AT&T. To avoid confusion, translate all ampersand codes to their text equivalent, with the exception above of orphan greater than and less than symbols. If you use the Textract program, you will find it has a support file Ampersand.txt which you can refine as needed to handle replacements of ampersand codes.

Keep words under 64 characters long.

The average length of words in continuous English text is about 7.3 letters. Even made-up words (example: supercalifragilisticexpialidocious) rarely get past 34 letters. A limit of some sort is appropriate. It has been set at 63 characters. Hope that's enough!

Observe limit of 3.2 million words per index.

Behind the scenes, the Words Close Together indexing software divides text into blocks of 100 words each. The rationale is that words more than 100 words apart are not particularly meaningful, whereas words take on more meaning as they are found closer together. In order to facilitate search across many collections of text at a time, we have set the limit at a very geeky number -- 32,768 batches of 100 words within one collection. If every heading has 100 words and every other group has 100 words, a collection could reach the absolute maximum of 32,768 times 100 equals 3,276,800 words.


 
Website is for ...    preparing searchable text ...    using tools ...
 
Proximity Search .com Search technologies by Marpex, Inc.