Putting a novel on the electronic Workbench

Let us consider a more controversial use of the same tool which crosses the line of advanced word processing routines towards the realm on "method." And let us start out with thar innocent empiricism that displays all the data sources first - before the act of conjecture turns into an idea. For example, in Figure 1 below we see the frequency wordlist of Death of Vergil.

If you are of a generation of Germanists who were trained after the Second World War and who became fascinated with Broch's TDV as an epic pinnacle, the order of the 200 words below will stimulate an amazing aesthetic/intellectual cascade effect. For the rest of you - who have not struggled with TDV - I am afraid this whole example will not mean much; however, a few words of explanation may be in order. It is precisely the linguistic difficulty that obscures a tremendous promise - in the sense of potential. TDV promises no less than a first hand account of the German Götterdämmerung of the thirties and fourties from a lofty neo-classical perspective, but a neoclassicism that understands the proletariat, that understands power and the "state" and that understands human transcendence.

On one level there is aesthetic transcendence - obviously - but there is also a mystical transcendence of llife on the edge of death. [Vergil expires within hours of his arrival in Brundisium and after one final interview with his and all Romans' political and cultural master - Augustus Caesar, the physical incarnation and embodiment of all spirit that emanates from the Mediterranean.] It is the transcendence through death in Book 4. The actual death og Broch and the consequent lack of further explanations that has added to the mystery and fascination.

Every encounter with TDV has a mysterious dimension - several hours with the book and a brandy - an hour rreading passages to a friend - weeks spent trying to nail down chronologies or space relationships. not to speak of days spent figuring out what other may have thought.

Through all that - the tantalizingly impenetrable fog of Broch's language defeats all systematic analysis - that is - until you see the list below for the first time. Poof goes the fog. Suddenly, all sorts of questions can be asked...

Figure 1. Sorted wordlist for Herman Broch's Tod des Vergil.
5296 und
4809 der
4292 die
2651 das
2324 es
2295 in
1936 war
1847 des
1786 zu
1779 er
1592 sich
1544 nicht
1543 den
1443 dem
1324 sie
1138 von
1102 ist
940 ein
928 als
927 du
860 mit
810 ich
806 im
785 noch
781 so
780 daß
719 wie
683 auch
594 auf
593 hatte
562 um
561 an
559 mehr
546 aus
522 zur
514 ihm
484 da
471 zum
470 für
468 sein
457 nur
401 nichts
393 oh
386 doch
383 eine
361 ihn
350 wird
348 aber
336 vor
333 selber
327 wurde
321 werden
317 über
311 dies
309 ihr
297 immer
294 denn
285 wieder
279 nun
275 ja
270 vom
269 dir
267 oder
264 wir
254 einem
242 ihrer
242 alles
240 was
237 ins
236 sehr
236 ohne
235 waren
235 einer
234 wenn
233 seiner
232 dich
231 durch
230 nach
220 selbst
217 gewesen
216 hat
215 seine
214 mir
203 augustus
200 zeit
196 schon
190 vergil
185 wirklichkeit
180 sind
179 mich
174 menschen
174 ihre
169 weil
167 hier
167 alle
166 man
165 hast
163 erkenntnis
156 wäre
154 plotius
152 kein
151 nacht
150 wissen
149 unter
149 dort
148 dein
146 keine
146 hätte
146 cäsar
145 stimme
143 mein
142 lucius
141 all
139 leben
139 haben
139 am
138 uns
138 ihnen
137 dieses
137 damit
137 bis
136 jetzt
135 diese
135 dennoch
135 bloß
134 trotzdem
133 dieser
132 seinem
131 ob
131 geworden
131 eines
130 habe
130 etwas
129 zwischen
126 einen
122 bist
121 irdischen
120 dann
119 sogar
114 dessen
113 muß
113 kaum
112 schönheit
111 wohl
110 niemals
110 aller
109 tod
109 einmal
108 will
108 liebe
107 plotia
107 allein
105 weiter
105 nein
105 mußte
103 schicksal
103 deine
103 also
102 vielleicht
102 seinen
102 sagte
100 sprache
100 knabe
100 bei

[NOTE: Hermann Broch, Der Tod des Vergil, (Rhein Verlag: Zurich) 1952 and Hermann Broch, The Death of Vergil, (Pantheon Books: New York) 1945. The 1945 English translation and the 1952 German edition were put into electronic for by means of a Kuzweil Data Entry Machine at the Duke University Humanities Computing Facility.]

A glance at the alphabetical wordlist as well as the sorted frequency list can yield tremendous insight into a text. For example, there are 145,822 word in the original German edition; of those, 21,572 are unique. To rephrase that, there are 21,572 different words in the text; in the table above you see the frequencies up to 100. Figure 2, following shows the frequencies for the English translation which contains 165,491 words of which 13,332 are unique. Even the cursory comparison of the two lists yields amazing insights. There are roughly 20,000 more words in the English translation than in the original. Yet there are many more (8,000+) unique German words than English words.

This is best seem in tabular format:

TDV vs DOV Figure 2.
text Total # of Wds. Total # of unique Wds.
German (TDV) 145,822 21,572
English (DOV) 165,491 13,3322

Chapter Breakdown Figure 3.
BK 1 BK 2 BK 3 BK 4
Total German b1 b2 b3 b4
Unique German b1 b2 b3 b4
Total English b1 b2 b3 b4
Unique English b1 b2 b3 b4

This seems to indicate that German has many more compound forms and many more inflected forms than English. However, the numbers of the coordinating conjunction "und" and "and" are 5296 and 5339 respectively. Given the wide disparity of the total number of words and to number of unique words, a difference of 43 for a coordinating conjunction seems puzzling. A closer look at the breakdown of chapter frequencies shows that the numbers stay very close in each chapter. Further investigation leads to preliminary conclusions of the use of and-pairs to anchor a translation.

"and" in English and German Figure 5.
BK 1 BK 2 BK 3 BK 4
und (TDV) b1 b2 b3 b4
and (DOV) b1 b2 b3 b4

Figure 5. Sorted wordlist for the Death of Vergil translation.
13231 the
6588 of
5339 and
4397 to
3165 in
2706 a
2523 it
2484 was
2048 that
1786 he
1466 as
1430 had
1358 for
1274 you
1239 his
1223 not
1220 with
1141 which
1128 this
1071 by
1058 is
1056 be
1033 from
868 into
841 ?
803 i
796 all
790 on
755 no
747 but
726 its
714 him
709 one
653 have
619 so
564 there
557 were
551 more
543 been
540 at
537 an
532 even
522 they
497 only
496 their
447 who
438 oh
434 your
432 are
415 or
408 out
392 them
391 if
383 time
374 now
371 again
355 itself
352 still
348 yet
320 could
318 my
315 nothing
311 like
308 me
300 would
298 without
297 will
290 though
284 what
280 death
274 over
263 reality
260 through
256 must
255 himself
254 up
253 we
252 being
246 than
246 back
236 human
232 life
232 because
225 own
223 other
222 did
220 do
218 has
214 light
213 night
211 perception
207 very
197 augustus
196 beyond
195 too
184 longer
183 voice
182 virgil
181 every
179 within
179 just
178 here
178 earthly
176 once
174 come
173 when
173 became
170 also
169 boy
168 toward
165 then
165 before
163 came
162 gods
161 might
161 her
158 man
156 well
156 should
156 never
156 become
155 such
152 yes
152 plotius
152 knowledge
150 world
150 people
148 these
148 caesar
148 beauty
146 any
146 almost
146 after
144 fate
141 lucius
140 most
140 down
139 way
137 work
137 order
136 said
136 about
132 last
132 first
132 dream
132 between
131 where
131 may
131 love
131 always
130 able
128 off
127 state
127 memory
127 how
127 both
126 she
124 our
124 hand
124 creation
121 truth
121 existence
121 earth
121 breath
121 art
120 slave
119 same
115 soul
115 name
115 already
114 while
114 us
114 something
114 away
114 although
113 remained
112 upon
112 am
111 everything
108 plotia
108 nor
108 darkness
108 above
105 shall
104 seemed
104 know
104 held
103 however
103 far
102 past
102 face
101 symbol
101 new
101 great
101 each
100 much

A quick glance at the whole wordlist shows some predictable but also some surprising items. For example, there are 74 instances of genze. Some quick detective work shows that Günter Grass' Die Blechtrommel [NOTE: Günter Grass, Die Blechtrommel, (xxx:XXX, dddd).] a work roughly contemporary to Broch's Death of Vergilhas only three references. Without any idea of proving anything, or even of knowing what this particular feature might mean, let us investigate some of the grammatical variants and compounds. A search for the string "grenz" surrounded by wild cards, "*grenz*" will yield the following list:

Figure 5. Frequencies for "grenze" in Tod des Vergil.
2 begrenzt
2 begrenzte
4 begrenzten
2 begrenztheit
1 begrenzung
1 gewölbegrenzen
74 grenze
31 grenzen
6 grenzenlos
2 grenzenlose
6 grenzenlosen
1 grenzenloser
5 grenzenlosigkeit
1 grenzentrückten
1 grenzerkenntnis
1 grenzfernen
1 grenzgleichgewicht
2 grenzjenseitigkeit
1 grenznachbarn
2 grenzraum
1 grenzraumes
1 grenzspiel
1 grenzt
1 grenzüberschreitung
1 grenzüberwachsend
1 grenzumschlossen
1 grenzzustand
1 kippgrenze
1 klarheitsgrenzen
1 nachtgrenzen
1 raumesgrenze
1 raumesgrenzen
2 reichsgrenzen
1 sphärengrenze
1 sphärengrenzen
3 traumgrenze
4 unbegrenzt
1 unbegrenztheit
1 unendlichkeitsgrenze
1 unwirklichkeitsgrenzen
1 waldgrenze
1 weltengrenzen
1 wendegrenze
1 wirklichkeitsgrenze
2 zeitengrenze
1 zeitgrenze

We shall consider the implications of such data for an quite specific and detailed interpretation of Tod des Vergil in the theoretical section of this presentation. At present we would like only to clear some intellectual space, to create a positive climate for this sort of computer magic.

The pulley forever changed the method of erecting buildings; the gun forever changed warfare. More subtle but equally far reaching effects on civilization were opportunities in media management caused by the pencil and eraser or the loose-leaf binder. Perhaps electronic indexing of texts will have an analogous influence on the way we interpret and analyze texts. Today we can say with some confidence that the computer will be involved in all phases of life, especially the intellectual life, and not just confined to a technological, scientific-engineering sphere.

But let us give the concept of a "tool" in the study of literature a historical dimension within the debates of the time. For example, the tools of the early 19th century philologists, paleography and indo-european linguistics must have seemed bizarre to the majority of conservative, rationalist teachers of the classics of that time. Similarly, today the tools of "Ideologiekritik," the explication of cultural bias,

[NOTE: Or perhaps one should speak of explicating texts of various periods in terms of cultural and political determinants. See: Christa Bürger, Textanalyse als Ideologiekritik. Zur Rezeption zeitgenössischer Unterhaltungsliteratur (Frankfurt / a.M.: Athenäum, 1973). Bürger is an early practicioner of "Ideologiekritik" who traces her roots to the discussion of "Kulturindustrie" by Adorno and Horkheimer in the sixties, p. 55. The focus on "popular literature" (read: Trivialliteratur, as the genre was generally recognized at the time before the contemporary canon crisis) is logical since one is not studying "eternal aesthetic or moral values" whispered into human consciousness by muses, but the products of a semi-industrialized production process.]

seem bizarre to the practitioners of close readings and the discovery of aesthetic values. There is a natural tension of competing perspectives that seems to divide scholars along ideological lines. We hope that the implementation of "electronic tools" will not be seen as a last chance by some faction to defend the discipline from the evil of the "word-counting devils" who are chipping away at their view of the semiotic process. Rather we hope that electronic tools will be embraced by all factions as an opportunity to make more precise the gathering of data to support arguments, be they historical, aesthetic or gender-based.

Literal vs. Metaphoric Tools.

The question then is this: does an "electronic tool," consisting of the computer hardware and the computer software that allows the indexing of literary texts as described below (see section 1.4, below), have the same status as a "conceptual tool" such as "Ideologiekritik," "Reader Response" or "Deconstruction." Yet the whole idea of "conceptual tool" is actually a metaphoric use to the word "tool;" in this case "tool" really means a strategy of interpretation, an analysis paradigm, by definition one of many and often incompatible attempts to explain a phenomenon of text art. In speaking of the computer and electronic indexing as a "tool" we use the word in the most pedestrian and literal sense, the sense of "shovel" and "bin," to get words and phrases separated out from the stream of text for the purpose of further analysis.

Thus we would like to carve out for the computer and the indexing of literary texts a role more fundamental, than most analysis paradigms would claim. Full text research has a "pre-analysis" data gatering function. It is possible that many competing and succeeding analysis paradigms would ALL use computers and indexed texts. These arguments can be considered an internal question, to be addressed by literary critics of various stripes. We will try to focus on describing the actual the tool. "tool" and not be sidetracked by what we may want to prove by The actual tool, however, the computer and the theories that make it work are quite external questions that demand attention to other fields of study.

Stealing from Computer Science and Linguistics.

Like any trespassers worth their salt, humanists on technological/scientific grounds must appropriate (i.e. steal) some of the sacred knowledge of both computing and linguistics to drag back to our own cave. From the field of computing we will appropriate "string handling," "indexing" and the concept of "interface." For example, an interface designed according to modern standards of micro computer word processing, allows a novice user to extract quotations, i.e. indexed strings, for a text of several megabytes based on the intersection of two or more lists of words.

We will develop two points in our discussion of the relevance of computing to literature, texts and literary scholars:

DOS word processing has been widely recognized as a "trojan horse" to bring computers to non-technological areas. However, we would like to supplement this commonplace by adding a corollary - familiarity or virtuosity with the DOS word processing interface puts the capability to do powerful and sophisticated computing into the hands of relative computer neophytes. Often the humanists who master word processing think no more of that skill than they do about mastering type writing. This is a vast underestimation of the potential for computer aided research within easy grasp.

[NOTE: The only prerequisite for this type of work is to expand the range of activity of a user of word processing. A computer literate humanist already has a word processor and is familiar with the concept of "text," i.e. word processor files, on-line. The only additional concepts are to replace the word processor with an indexing program and to replace the word processing files with the "full" text of the primary works to be studied.]

The capacity to handle "strings" (defined as a sequence of letters - a word or a phrase) is a very fundamental and easy function for a computer. "Indexing" strings puts words into an alphabetical order within the computer that allows rapid access to all occurrences of the words. Thus, for example, the word "butterfly" occurs 14 times in Lord Jim. In the index of the string "butterfly," fourteen pointers indicate the location of the fourteen references in the text in the memory of the computer. At the touch of a single key, [Return], all fourteen references can be pointed and retrieved without any elaborate search in a time frame of milliseconds. Thus even inexpensive ($1,000) and computationally slow micro computers can do very rapid retrieval of strings.

From the field of linguistics we shall appropriate semantics, since defining semantic categories is an important prerequisite to deriving useful information from a full text data base. In addition, we will appropriate the notion of syntactic structures to disambiguate strings into the appropriate semantic category and to achieve a "lemmatized" text in which the retrieval of a "root" form will bring out all inflected, regular and irregular forms.

[NOTE: For an introduction to some of the questions, discussions and vocabulary of this area of computational linguistics see: Graeme Hirst, Semantic Interpretation and the resolution of ambiguity (Cambridge: Cambridge University Press, 1987). The book treats both lexical and structural disambiguation, gives a summary of representative projects and sketches the direction of the discipline.]

We shall have more to say in the body of the book about the relationship between linguistics (computational and otherwise) and literary studies, which should be exploited for the creation of parsing tools for humanists so that the historical and aesthetic side of language is not lost entirely in the quest for the universal phrase structure grammar.