Current selection:

Advanced Search     Switch Selection     Preferences

Tools for Preparing Searchable Text

HTML Converters

An HTML converter is a computer program that reads the content of a particular type of file, and outputs a HyperText Markup Language equivalent of the content. There are many thousand such products available on the Internet, for almost any file format that you can think of (PDF = portable document format, RTF = rich text format, DOC = Microsoft Word, etc.) These programs vary tremendously in features, quality of output, flexibility, and price. HTML converters are useful here as an interim step toward creating Words Close Together searchable data set. Even if the quality of HTML produced by a converter is poor, the Textract program can usually bring the text into shape for indexing and search.

Example: Suppose you or the people you serve are frustrated by what passes as search capability in the help system of a large computer program. You search for a combination of four words. You are presented with 388 hits, but you find that the second, third, and fourth hit each do not have one of the words you requested. (This is not dreamed up! If you want to know the search term and the identity of the guilty program, ask. The tragedy is that this was from a company that prides itself on its search capability.) The solution to this frustration was to purchase a copy of the ABC Amber HxS Converter v. 2.01, which extracted the text content as HTML from the HXS file format. That gave us the base from which to build a searchable set of helps.

If you have large quantities of content that you wish to be made searchable, and they are in a format from which it is difficult otherwise to extract text, consider purchasing an HTML converter.

Text Extractors

Text is the common factor in all full text search. People want to be able to find content. The various computer programs that enable people to input and maintain their data normally wrap that data within all manner of codes, signals, and what have you that make the data easy to show in specialized ways. Therefore, in order to access content that is held in a variety of different formats, people turn to text extractors as a means to bring their data to one common format -- straight text.

It is illegal to reverse engineer most computer programs. Typically, it is a breach of contract, violating the terms of the EULA or End User License Agreement, that stultifying mountain of legal sounding words that few people take the time to read. Reverse engineering files produced by those computer programs appears to be seen in an entirely different light. The software producer owns the program; however, the software producer does not own the files that the user creates with the program. Based on this distinction, an entire industry has sprung up, offering text extraction from hundreds of different file formats. The more professional text extraction programs tend to be pricey. There is, however, a growing number of open source free programs that do the job ... sometimes well, sometimes poorly.

The drawback with text extractors is that you can lose information that would make the presentation clearer. Poorer quality text extractors run paragraphs together, mess up the treatment of hyphens, and substitute a single hyphen for a long dash. Whether to use a text extractor or an HTML converter depends on the your data and the format from which you are trying to derive searchable content.

Custom Filters

Information Technology Departments with programming resources may prefer to write custom software to filter the organization's data from its various file formats. The upside factors are control, automated handling of large quantities of data, and one stop processing. The downside: cost in time and/or money.

An organization which collects its data in one consistent layout over an extended period of time might be justified in having an outsider supply custom software to prepare their data. The key word is "consistent". If you publish a periodical for which each new editor feels compelled to change the layout, that might be a problem.

Word Processors

Word processors (WordPerfect, Microsoft Word, etc.) can be used effectively to prepare many kinds of data. If you are willing to learn a bit about basic HTML tags, you can build in headings, lists, tables, and other formatting. One precaution with word processors: always save the content as text, and never in HTML or in the processor's preferred format.


There is another way, possibly simpler, possibly more effective ... the Textract program.


 
Website is for ...    preparing searchable text ...    using tools ...
 
Proximity Search .com Search technologies by Marpex, Inc.