Special Problems of Electronic Searching the "Documentary Geniza"
Background Searching Theory Sample Queries Conclusion

1. Background

From the OED: genizah geni.za. Also geniza. Pl. genizoth. Heb., lit., a hiding, hiding-place, f. ganaz to set aside, hide. A store-room or repository for damaged, discarded, or heretical books and papers and sacred relics, attached to many synagogues; also, the contents of a genizah.

I will assume that anyone reading this document is familiar with the role of the Geniza in Jewish life and is well aware of the contents of the Geniza of the Ben Ezra Synagogue in Cairo. I will also assume a clear understanding of the difference between the "Documentary Geniza" and other Geniza documents from Ben Ezra of a rabbinical or aesthetic nature. In our site we work only with the "Documentary Geniza" - that is to say, documents from everyday life, everything from legal documents, personal letters, list of building materials, bills of lading, business correspondence and the like. Example. In short, the "Documentary Geniza" covers everything that is not a religious work or a work of art. I will also assume that it is well known that most documents are written in Judeo-Arabic - though many are in Hebrew or Aramaic - using the Hebrew alphabet with liberal intermixing of Hebrew words in the Arabic text; and, most importantly for our problem of searching the corpus, spelling is not standardized, not for names, nor places, nor commonplace words of the household.

I should add finally, that the exploration of these documents is still at an early stage. The Pioneering work of Shelom Dov Goitein, [A Mediterranean Society: The Jewish Communities of the Arab World as Portrayed in the Documents of the Cairo Geniza (1967-1993)] can be considered the first systematic attempt the get at the history manifest in these documents. The work at Princeton to put the "Documentary Geniza" on-line under the leadership of Prof. Mark Cohen [Under Crescent and Cross the Jews in the Middle Ages, (1994)] of Princeton University aims to make electronic tools for the study of these texts available to the community of scholars.

The first step in the process was to create transcriptions of the published texts. To date about 3000 documents are in our archive (click for a preliminary list of our transcriptions). Our plans are to continue adding transcriptions, from Prof. Goitein's unpublished notes, from original paleography done at Princeton, as well as from the work of Geniza scholars from around the world.

The second step is to make electronic searching of the files possible. Recent advances in the web and in the easy access to Hebrew fonts across various computing platforms have made that possible. Collateral opportunities arise from an indexed/searchable text archive: the possibility of creating wordlists, and beyond wordlists, creating lists of names and places as well as ordering the semantic information into appropriate categories. Essentially we hope to build the foundation for a dictionary/encyclopedia of medieval life as reflected in the "Documentary Geniza."

A third step is the linking of images of the original documents to our transcriptions. This will give scholars a chance to check the transcriptions and gather additional clues from the shape of the writing and the appearance of the document.

We hope that by following this simple plan that we shall be able to finish the exacting philological presentations of these documents at some time in the not too distant future and turn them over to historians with various specializations - other than Judeo-Arabic philology - to write about life in the Mediterranean in the early Middle-Ages.

2. List-based Searching and the Geniza

Given the nature of the texts of the "Documentary Geniza" (discarded household documents and other scraps of paper), and given that they have no standardized bibliographical provenance, generally no date, no recognized author, no clear place of authorship, and given that there is no orthographic standardization, "searching" takes on the meaning of looking for something lost or concealed and not the current library usage of "finding" something a bibliographer has put in the correct place. This applies especially to the word fragments that appear throughout our transcriptions caused by holes in the original.

Thus it became obvious that the standard practice of search engines of typing keywords would not work for our documents. However, using list of the lexical material of the documents as the central input device for the search engine seemed a workable strategy for overcoming several problems. We solve the technical problem of typing right to left in Hebrew characters on various computer platforms. We solve the problem of making the user aware of wordfragments. And finally we solve the perennial problem with electronic searching - what to type into the little window. This is the skeleton that is hanging in the closets of most search engines attached to commercial electronic texts. Yes there is a search button; yes there is a complex form for submitting boolean queries; but no - I generally don't frame questions in that form - who but a designer of query systems does?

By using alphabetized wordlists we remove that nagging question and let scholars find the thematic material that really interests them. They browse the list to find likely candidates and they rely on their knowledge of the material to construct lists of items to search in relatively short order. Let me try to make this problem clear by means of an anecdote. Imagine a Patisserie in Paris which consists of a bare room with a little window in the back and a cash register and a keyboard. To get your pastries you have to type, without an error: "2 mille feuille et 2 pain au chocolat" and press [enter]. If you did not make a mistake the little window will open and your pastries will be handed to the clerk who will take your money. The only unrealistic aspect of this anecdote in analogy to commercial search engines is that generally your institution will already have paid handsomely for more than all the pastries you could type in for a year. But don't worry about the "mille feuille;" the collective wisdom of pastry chefs since the time of Vercingetorix has developed a technique of arranging the various pastries in a vitrine facing the customers. The customer can just point and say - two of these and two of these without necessarily having to be able to spell the names of those gorgeous little things. This is the approach we have taken with the "Geniza Browser."

Electronic search engines, be they in Hebrew or in English have one thing in common, they depend on standard orthography and the notion that a keyword will find the appropriate information. Thus if I am looking for a "Bed and Breakfast" place in Colorado - I can try various strategies: type "Bed and Breakfast" into Alta Vista and see if anything turns up about Colorado. Or I can go to Yahoo and find Colorado and type in "Bed and Breakfast" there. In each case, I know what I am looking for and I expect for someone to have arranged the information so that I can get it with a couple of keystrokes.

One could say the same applies for searching the plays of Shakespeare although here the case is weaker. Those who provide "searchable" literary text often assume detailed knowledge of the text. Or to put it uncharitably, the implementers of search engines - some costing several thousand dollars a year - define their task as providing only the opportunity and the technical means to search: "See, it found the word "unhelpful" in the entire corpus. The search obviously works. G'day." A case in point, I may know that I am interested in the use of "money" in Shakespeare's plays, and I will be able to get some hits even by typing in "money" - but I will never know if I have done an exhaustive search until I have organized all the ducats and doubloons alongside the generic gold and silver. And that requires inspecting every one of the 16,000 unique lexical items to be found in the 47 plays. And unless I perform that task with the integrity and exactness expected of philological work, I will never discover that there is a radically different approach to money in the comedies and the histories, not an unexpected result.

Here most search engines will abandon the user, even expensive commercial products with scolarly content which are sold or leased to universities. The Patrologica Latina, the Poetry databases and the various editions of Shakespeare for example, provide only a single-line search window for packing elaborate queries. (There are some notable exceptions to this sad state of affairs which means the situation is not without hope for improvement). While the boolean capabilities of the engines are generally impressive, they are of little help to a scholar and even less help to the student. The assumption may be that you should take a printed concordance off the shelf, construct a query string and then type that into the little one-line window. Lets assume, I am from Elizabethan England and I want to spend a quiet few weeks in the mountains but I have never heard either of Colorado or a "Bed and Breakfast." At that point our modern searching technology, with all its pragmatic assumptions will leave me in the lurch, staying at the Pink Flamingo in Newark.

Perhaps we have not clearly differentiated "searching" well-ordered data such as an electronic card-catalog, from the exploration of semantic fields in a work of literature; or, as in our case, the exploration of hand-written documents of everyday life - with holes and smudges - rescued from the dustbin of history by an absence of moisture in Cairo, that want to tell us about life a thousand years ago. The embarrassingly weak theoretical basis for much of contemporary text markup jargon is probably the culprit and also explains why most practicing literary scholars have not embraced this technology. It may also be that the commercial publishers simply cannot find developing really useful search interfaces in their budgets. It should be obvious to all but the professional SGML consultants that the technology happily used by insurance companies to track customers and by the Federal Government to process grant applications will not easily serve scholars working with difficult texts.

Searching then, in all but the most trivial cases, becomes a matter of systematic study of a semantic category and of the morphology of a language; the wordlists we provide as part of the "Geniza Browser" give the scholar that opportunity. Only after all the members of a semantic field and all morphological forms of those words have been identified does it make sense to submit a query to the search engine. The practice of typing random words into a search window is a reprehensible practice, not worthy of HS seniors desperate for ideas for a paper on Moby Dick, and has given the whole field of electronic text research a bad name.

Fortunately there are some search engines that employ a complete morphological analysis before doing a query - so we are not without examples of how to do it right. Also, there are some PC and Mac based search engines that use word-wheels to allow relatively quick access to all the lexical strings arranged in alphabetical order and then allow an easy way to submit all the selected words as a batch to the engine. Our search engine is patterned after the word-wheel model with the one exception that there in nothing easy about our interface.

At present we depend on a primitive - but effective - cut and paste technique to construct the query string. While the mouse-swipes and cut and paste keystrokes seem strange at first, the quickly become routine. In addition, the notion of editing the search text-box and keeping queries in a wordprocessing format allows and even encourages repeated iterations of queries that can be refined.

At present there is no facility for submitting multiple lists with boolean operators. While this is a favorite with the logic buffs, it does not do much for the textual scholar. At some point, if it becomes obvious from the work of practicing Geniza scholars that this would be a real enhancement, the additional programming and interface design can be undertaken. This test-bed is just a first step. The big innovation is that a researcher (trained in this field) will be able to start - empirically - from a list of the extant words in the documents. One will be able to find all the spellings of a name in its Arabic and Hebrew form, one will be able to go through the lists and based on knowledge of the language and the vocabulary of the time and extract stings concealed in morphological structures.

3. Sample Queries

1. Ibrahim and Avraham are easy since they are both under "aleph" next to each other. Go to the "alephs," scroll down half a screen and highlight both lines. When the lines are highlighted execute the copy function. Then set your cursor in the textbox by clicking in the box and execute the paste function. Your textbox should look like this:

14 םיהרבא
58 םהרבא

Click on the "getdocs" button and you will get all 72 occurrences of "Abraham." NB. they will be intermixed based on the shelfmark.

2. If you are looking for "abu" you will also want to add ve-Abu to the query. First, go to the "alephs" and pick up the 91 "abu's," (91 ובא) make sure the line is flush in the upper left-hand corner of the text box. Then go to the "vavs" and pick up the 5 ve-Abu's (5 ובאו). Your textbox should look like this:

91 ובא
5 ובאו

Click on the "getdocs" button and you will get all 96 occurrences of "abu." NB. they will be intermixed based on the shelfmark.

3. The same principle applies to searching "ben" and "ibn." Your textbox should look like this:

233 ןב
33 ןבא

4. A variation of this technique is searching out the same word in Arabic and Hebrew. For example, ocean or sea: yam and bahr. Your textbox should look like this:

12 םי
2 דחב

4. Conclusion

One final note of caution - this tool has been specifically designed for scholars of the "Documentary Geniza" from the Ben Ezra Synagogue. It is not intended to be a commercial product and should not be approached with the expectation of a lot of "whiz-bang." Rather it is intended to facilitate the excruciating task of collecting references from many documents. It is not intended to present you with any interpretation of the text - that is still up to you the scholar. Success in using the search engine will depend on your detailed knowledge of the vocabulary of the texts. However, if you have studied the vocabulary of a set of documents, you will be able to find documents with similar themes by collecting the vocabulary into the text box and then inspecting the hitlist in painstaking detail. As cumbersome as the interface might seem, it still is superior to the present method of collecting notes from a reading of the documents. We hope to put the word lists of our transcribed documents in the center of attention. We hope it will be possible for Geniza scholars to increase the number and the range of documents they look at. Eventually we hope to use Goitein's index as the basis of the searching, yet there are still many hundreds of hours of scholarship and programming before that goal can be attained. For all its shortcomings, this tool will speed the process.