1. Background
From the OED: genizah geni.za. Also geniza.
Pl. genizoth. Heb., lit., a hiding, hiding-place,
f. ganaz to set aside, hide. A store-room or
repository for damaged, discarded, or heretical
books and papers and sacred relics, attached to
many synagogues; also, the contents of a genizah.
I will assume that anyone reading this document is
familiar with the role of the
Geniza in Jewish
life and is well aware of the contents of the
Geniza of the
Ben Ezra Synagogue in Cairo. I will
also assume a clear understanding of the
difference between the "Documentary Geniza" and
other Geniza documents from Ben Ezra of a
rabbinical or aesthetic nature. In our site we
work only with the "Documentary Geniza" - that is
to say, documents from everyday life, everything
from legal documents,
personal letters, list of building materials,
bills of lading, business
correspondence and the like.
Example. In short, the
"Documentary Geniza" covers everything
that is not a religious work or a work of art. I
will also assume that it is well known that most
documents are written in Judeo-Arabic
- though many are in Hebrew or Aramaic -
using the
Hebrew alphabet with liberal intermixing of Hebrew
words in the Arabic text; and, most importantly
for our problem of searching the corpus, spelling
is not standardized, not for names, nor places,
nor commonplace words of the household.
I should add finally, that the exploration of these
documents is still at an early stage. The
Pioneering work of
Shelom Dov Goitein, [A Mediterranean
Society: The Jewish Communities of the Arab World
as Portrayed in the Documents of the Cairo
Geniza (1967-1993)] can be considered the first systematic
attempt the get at the history manifest in these
documents. The work at Princeton to put the
"Documentary Geniza" on-line under the leadership
of Prof. Mark Cohen [Under Crescent and Cross the Jews in the Middle Ages, (1994)]
of Princeton University aims
to make electronic tools for the study of these
texts available to the community of scholars.
The first step in the process was to create
transcriptions of the published texts. To date
about 3000 documents are in our
archive (click for a preliminary list of our
transcriptions). Our plans are to continue
adding transcriptions, from Prof. Goitein's
unpublished notes, from original paleography done
at Princeton, as well as from the work of Geniza
scholars from around the world.
The second step is to make electronic searching of
the files possible. Recent advances in the web and
in the easy access to Hebrew fonts across various
computing platforms have made that possible.
Collateral opportunities arise from an
indexed/searchable text archive: the possibility
of creating wordlists, and beyond wordlists,
creating lists of names and places as well as
ordering the semantic information into appropriate
categories. Essentially we hope to build the
foundation for a dictionary/encyclopedia of
medieval life as reflected in the "Documentary
Geniza."
A third step is the linking of images of the
original documents to our transcriptions. This
will give scholars a chance to check the
transcriptions and gather additional clues from
the shape of the writing and the appearance of the
document.
We hope that by following this simple plan that we
shall be able to finish the exacting philological
presentations of these documents at some time in
the not too distant future and turn them over to
historians with various specializations - other
than Judeo-Arabic philology - to write about life
in the Mediterranean in the early Middle-Ages.
2. List-based Searching and the Geniza
Given the nature of the texts of the "Documentary Geniza" (discarded
household documents and other scraps of paper), and given
that they have no standardized
bibliographical provenance, generally no date, no
recognized author, no clear place of authorship,
and given that there is no orthographic
standardization, "searching" takes on the meaning
of looking for something lost or concealed and not
the current library usage of "finding" something a
bibliographer has put in the correct place. This applies especially
to the word fragments that appear throughout our transcriptions
caused by holes in the original.
Thus it became obvious that the standard practice
of search engines
of typing keywords would not work for our documents.
However, using list of
the lexical material of the documents as the central
input device for the search engine
seemed a workable strategy
for overcoming several problems. We solve the technical
problem of typing right to left in Hebrew characters
on various computer platforms. We solve the problem of making
the user aware of wordfragments. And finally we solve
the perennial problem with electronic searching -
what to type into the little window. This is the
skeleton that is hanging in the closets of most
search engines attached to commercial electronic
texts. Yes there is a search button; yes there is
a complex form for submitting boolean queries; but
no - I generally don't frame questions in that form -
who but a designer of query systems does?
By using alphabetized wordlists we remove that nagging question
and let scholars find the thematic material that really interests them.
They browse the list to find likely candidates and they rely on their knowledge of
the material to construct lists of items to search in relatively short order.
Let me try to make this problem clear by means of an anecdote.
Imagine a Patisserie in Paris which consists of a bare room
with a little window in the back and a cash register and a keyboard.
To get your pastries you have to type, without an error: "2 mille
feuille et 2 pain au chocolat" and press [enter].
If you did not make a mistake the little window will
open and your pastries will be handed to the clerk who will take your money.
The only unrealistic aspect of this anecdote
in analogy to commercial search engines is that generally your institution
will already have paid handsomely for more than all the pastries you could type in
for a year. But don't worry about the "mille feuille;"
the collective wisdom of pastry chefs since the time of
Vercingetorix has developed a technique of arranging the
various pastries in a vitrine facing the customers.
The customer can just point and say - two of these
and two of these without necessarily having to be
able to spell the names of those gorgeous little things.
This is the approach we have taken with the "Geniza Browser."
Electronic search engines, be they in Hebrew or in
English have one thing in common, they depend on
standard orthography and the notion that a keyword
will find the appropriate information. Thus if I
am looking for a "Bed and Breakfast" place in
Colorado - I can try various strategies: type "Bed
and Breakfast" into Alta Vista and see if anything
turns up about Colorado. Or I can go to Yahoo and
find Colorado and type in "Bed and Breakfast"
there. In each case, I know what I am looking for
and I expect for someone to have arranged the
information so that I can get it with a couple of
keystrokes.
One could say the same applies for searching the
plays of Shakespeare although here the case is
weaker. Those who provide "searchable" literary
text often assume detailed knowledge of the text.
Or to put it uncharitably, the implementers of
search engines - some costing several thousand
dollars a year - define their task as providing
only the opportunity and the technical means to
search: "See, it found the word "unhelpful" in the
entire corpus. The search obviously works. G'day."
A case in point, I may know that I am interested in
the use of "money" in Shakespeare's plays, and I will be
able to get some hits even by typing in "money" -
but I will never know if I have done an exhaustive
search until I have organized all the ducats and
doubloons alongside the generic gold
and silver. And that requires inspecting every one
of the 16,000 unique lexical items to be found in
the 47 plays. And unless I perform that task
with the integrity and exactness expected of
philological work, I will never discover that
there is a radically different approach to money in the comedies
and the histories, not an unexpected result.
Here most search engines will abandon the user,
even expensive commercial products with scolarly
content which are sold or leased to
universities. The Patrologica Latina,
the Poetry databases and the various editions of
Shakespeare for example, provide only a single-line
search window for packing elaborate queries.
(There are some notable exceptions to this sad state
of affairs which means the situation is not without
hope for improvement).
While the boolean capabilities of the engines are
generally impressive,
they are of little help to a scholar and even less help
to the student.
The assumption may be that you should take a
printed concordance off the shelf, construct a
query string and then type that into the little
one-line window. Lets assume, I am from
Elizabethan England and I want to spend a quiet
few weeks in the mountains but I have never heard either of
Colorado or a "Bed and Breakfast." At that point
our modern searching technology, with all its
pragmatic assumptions will leave me in the lurch,
staying at the Pink Flamingo in Newark.
Perhaps we have not clearly differentiated
"searching" well-ordered data such as an
electronic card-catalog, from the exploration of
semantic fields in a work of literature; or, as in
our case, the exploration of hand-written documents
of everyday life -
with holes and smudges -
rescued from the dustbin of history by an absence
of moisture in Cairo, that want to tell us about life a
thousand years ago. The embarrassingly weak
theoretical basis for much of contemporary text
markup jargon is probably the culprit and also
explains why most practicing literary scholars
have not embraced this technology.
It may also be that the commercial publishers
simply cannot find developing really useful
search interfaces in their budgets.
It should be
obvious to all but the professional SGML
consultants that the technology
happily used by insurance
companies to track customers and by the Federal
Government to process grant applications will not
easily serve scholars working with difficult texts.
Searching then, in all but the most trivial cases,
becomes a matter of systematic study of a semantic
category and of the morphology of a language;
the wordlists we provide as part of the "Geniza Browser"
give the scholar that opportunity. Only
after all the members of a semantic field and all
morphological forms of those words have been
identified does it make sense to submit a query to
the search engine. The practice of typing random
words into a search window is a reprehensible
practice, not worthy of HS seniors desperate for
ideas for a paper on Moby Dick, and has
given the whole field of electronic text research
a bad name.
Fortunately there are some search engines that
employ a complete morphological analysis before
doing a query - so we are not without examples of
how to do it right. Also, there are some PC and
Mac based search engines that use word-wheels to
allow relatively quick access to all the lexical
strings arranged in alphabetical order and then
allow an easy way to submit all the selected words
as a batch to the engine. Our search engine is
patterned after the word-wheel model with the one
exception that there in nothing easy about our
interface.
At present we depend on a primitive - but
effective - cut and paste
technique to construct the query string.
While the mouse-swipes and cut and paste
keystrokes seem strange at first, the quickly
become routine. In addition, the notion of
editing the search text-box and keeping
queries in a wordprocessing format allows
and even encourages repeated iterations of
queries that can be refined.
At present
there is no facility for submitting multiple lists
with boolean operators. While this is a favorite
with the logic buffs, it does not do much for
the textual scholar. At some point, if it becomes obvious
from the work of practicing Geniza scholars that this
would be a real enhancement, the additional programming
and interface design can be undertaken.
This test-bed is just a first step. The big innovation is
that a researcher (trained in this field) will be
able to start - empirically - from a list of the
extant words in the documents. One will be able to
find all the spellings of a name in its Arabic and
Hebrew form, one will be able to go through the
lists and based on knowledge of the language and
the vocabulary of the time and extract stings
concealed in morphological structures.