Home Background of the Project Shelomo Dov Goitein TextGarden

Background and History of the Project (1994-1997)

 

In 1994, the new computer consultant to the humanities at Princeton University, Dr. Peter Batke, proposed a solution to the "search engine" dilemma. Given the history of the project in the DOS environment and the ongoing work of Prof. Cohen with the NB indexing option (and the problem of the machine hanging with large files), it made sense to look for a DOS based tool of roughly the same vintage that is adept at handling the Hebrew right-to-left and also can handle a corpus over 10MB.

WordCruncher 4.5 has the advantage that it indexes large corpora in relatively small memory regions. Wordcruncher separates the function of "indexing" from the function of "retrieval." "Indexing" is only done once for a given corpus and can take a considerable amount of time depending on the size of the corpus and the speed of the machine. An indexing job of all the Geniza transcriptions may take 2 hours to run on an 1985 vintage IBM-PC XT; the same job would run in 8 minutes on a modern Pentium. "Retrieval" is based on the files created during the "indexing" phase and is quite instantaneous and independent of the speed of the machine or the size of a file. Thus, the "retrieval" performance of an IBM-XT is as fast as it would be on a modern Pentium. This is due to the fact that retrieval uses only "index-searching" which requires no CPU intensive processing.

Another advantage of WordCruncher is that the index files can be distributed via ftp or on a CD-ROM independently of the retrieval program. The retrieval program will work on any PC, no matter how old, provided it has enough diskspace to hold the files.

WordCruncher was designed to be a sophisticated index, search and retrieval program for MS-DOS based computers before the era of large RAM memory regions and the Windows interface. It was developed at Brigham Young University for large textual corpora like The Collected Works of Shakespeare and the Bible and the Book of Mormon. WordCruncher was also designed to handle files in Hebrew characters and to interface with the Duke Language Toolkit.

Furthermore, and very importantly, WordCruncher has an interactive feature (a "word-wheel") which displays all the words in the database alphabetically. Since the spelling of the Judaeo-Arabic Geniza documents is very inconsistent, this feature makes it possible for a user to search for and retrieve terms or passages that might otherwise be missed. At the same time, the interactive feature reveals keywords that the user might not expect to find in the corpus, thus considerably increasing the usefulness of the database.

The revival on the project in 1994 was greatly aided by a grant of equipment and student labor from the Department of Near Eastern Studies. The work plan for the revival of the project drew heavily on experience Dr. Batke had gained working with text archives he had built at Duke University and at Johns Hopkins. His work as one of the designers of the Duke Language Toolkit made it easy for him to grasp the quite involved procedures for entering and editing the texts with right-to-left display and Hebrew screen and printer fonts. This is a quite esoteric area of DOS computing.

Starting in the Fall of 1994, Dr. Batke designed a workplan to 1. demonstrate the technical feasibility of right-to-left indexing with WordCruncher; 2. to consolidate the individual transcription files into large files encompassing entire collections; 3. to design a flexible markup that would retrieve the crucial shelf-information of a document and would be flexible enough not to preclude other markup-schemes.

A series of consultations between Prof. Cohen and Dr. Batke yielded a working prototype of some 200 "documents" in the provisional corpus that are housed in the Bodleian Library, Oxford.

On the basis of this working prototype, the Department of Near Eastern Studies allocated $6000 to purchase two new 486 PC's and to hire a post-doc with expertise in the Geniza to assemble the collections.

In June 1995, Dr. Hassan Khalilieh was hired to start work on the corpus. His tasks included marking up the self-information, standardizing the "document descriptions" and separating the Hebrew text from the English descriptions and formatting information. In addition, Dr. Khalileh added a general category of the "document" based on its content.

At the present moment (June, 1996) all 2300 texts are being coded for indexing in WordCruncher. Additional work of proofreading will then be done using the word-wheel. Soon thereafter, a CD-ROM will be prepared, including the provisional corpus plus the search-and-retrieval software. After trials at Princeton, the package will be made available to scholars and libraries. As the corpus is enlarged in subsequent years (subject to adequate funding), the CD-ROM will be updated periodically.

Looking down the road, another desideratum involves creating digitized images of the actual documents so users can compare transcriptions with originals on their monitors, and even recommend corrections to be incorporated into the database. A parallel goal entails making retrieval of text and of digitized images accessible on the World Wide Web.