The Textract Program
What is it?
The Textract program started out as a generalized text extractor and formatter that could handle a broad range of file types, and put text into an ideal format for creating the searchable data sets that are at the heart of Words Close Together search technology.
Reality check!
After a few months, reality set in:
- There is an untold number of different file layouts.
- Their number is growing.
- Within any one format we found that there are many different ways of achieving the same appearance -- witness the different styles of headings within any collection of web pages.
- Many of the more common file formats are proprietary; suitable documentation of their peculiarities is hard to find.
- Writing text extractors is not rocket science; it's not that hard. But it takes more time than we care to use up.
- We can serve best through advancing our specialty which is search.
Outcome: We now know that attempting to develop expertise in text extraction is a distraction from our primary mission.
Textract is useful ..
At the same time, the Textract program makes it trivially simple to set up a wide variety of new content to make it ready or near-ready for indexing and search:
- Straight text can be made searchable within seconds.
- The most commonly used source of all, HTML files (as in web pages), is generally handled quite well.
- Rich Text Format (.RTF) uses a public HTML converter; Textract takes it the rest of the way to indexing-ready text. Most Rich Text Format comes through fairly well.
- Email in DBX and EML formats usually result in good text.
- Early WordPerfect files also yields reasonably consistent text.
- Adobe Acrobat PDF files which are not security protected yield text through a public HTML converter program called by Textract.
- Microsoft Word from versions 1997 through 2003 come out generally quite well, including tables, lists, etc. It is best if headings are reviewed prior to indexing.
- Microsoft Word prior to 1997 seems to have used a variety of formats. Most such files yield reasonable text.
- Microsoft PowerPoint files 1997 through 2003 yield fairly good text, although some slides tend to be repeated. Pre-1997 PowerPoint is okay.
- Microsoft Excel files -- Text is extracted, but its only value is to tell you that various words are present in the file. None of the row headings, column headings, or words within cells are "near" each other in any meaningful way.
Increasing its usefulness ... Heading toward open source
We believe that Words Close Together search technology makes a significant contribution to more meaningful search in computerized text. We are making some of that technology free, in order that it may serve the needs of the widest possible constituency. Specifically, the level one reader (collections up to 3.2 million words) in both desktop and Internet server editions are free. Indexing capability for level one is being offered at the lowest price consistent with our ongoing ability to serve.
We have also decided to make the Textract program free. Please be aware that it has limitations, and includes no guarantees of any kind whatsoever. In the near term (through 2007) we are making source code available selectively. We hope to learn more of the implications of making the source code widely available under either the GNU or the Apache Foundation Software license.
Note on patents
Since March 2004 Words Close Together search technology has been patent pending. The patent application is all about compression indexing. A "compression index" is simply a technical way of saying "searchable data set".
There is a clear boundary between:
Technically, the Textract program could be used to prepare text for any search engine whatsoever. If we decide to commit the text extraction / preparation / formatting aspects of our work to open source, we hope in turn that open source programmers will see value in applying communal efforts to improving it. Time will tell.
- preparing text, and
- indexing that prepared text suitably for a particular search engine.
For now, if you or your organization have C++ programming capabilities and wish to strengthen aspects of the Textract program, we are open to that possibility.
Downloading Textract
As of January 2009 this program continues to evolve. If you wish an interim executable (Windows XP), please contact us. In the subject line, include the word "Textract".
Website is for ... preparing searchable text ... using tools ... Proximity Search .com Search technologies by Marpex, Inc.