The Philological Workstation BAMBI (Better Access to Manuscripts and Browsing of Images)

Abstract

This paper presents the results of the European project LIB-3114 for Digital Libraries called BAMBI (Better Access to Manuscripts and Browsing of Images). The project has produced a hypermedia system allowing historians, and more particularly codicologists and philologists, to read manuscripts, transcribe manuscripts, write annotations, and navigate between the words of the transcription and the matching piece of image in the numerized picture of the manuscript. After an introduction on the objectives of the project and the related works, the second part is devoted to the description of the functions and the design of the Philological Workstation. The third part describes how the international standard HyTime (Hypermedia/Time-based Structured Language) has been used as a modelling language to describe works on manuscripts (description, transcription, annotations, links, ...). Finally, the architecture of the BAMBI workstation is presented.

1 Introduction

The transfer of texts and images on to digital media presents an interesting set of possibilities for those who, in various capacities, are concerned with the conservation of written documents, such as members of library staff, as well as for anyone carrying out studies in the field of philology.

Several products (commercial or research products) already exist for helping philologists to consult or to work with manuscripts.

MEDIUM (IRHT, Paris) and PLAO (Poste de Lecture Assistée, par Ordinateur) are databases of medieval manuscripts. By using these databases, the user is able to find the different copies of a manuscript, and to access the translations and the comments. But this system is only used for consultation. The user cannot enrich the database with comments.
GRAFOS (OAK, Madrid) and the EAMMS project (Hill Monastic Manuscript Library and Vatican Film Library) allow philologists to consult, transcribe and index the manuscripts.
COLLATE 2 (Oxford University Centre for Humanities Computing) aims to help scholars in the preparation of a critical edition based on many manuscripts.
ICARUS (Historical Archive of the Comune of Genes) is a system for scanning and optical storage of archive documents, with indexing of the document contents.
The KLEIO system (Max Planck Institut of Göttingen) performs integrated management of images and description data.
OPERA (IRISA, Rennes) (André and Richy 1994) allows the consulting of manuscripts, transcription and indexing.

This project applies the concept of structured document and, using the formatter Grif (André and Richy 1994), handles indexes as hypertext links. Other systems, generally simple database management systems, have been described elsewhere.

On the basis of a project carried out in the Institute for Computational Linguistics in Pisa, a workstation for philologists, papyrologists, epigraphists and other scholars working with ancient texts, has been developed within the UE Libraries Programme. The name of the research, terminated in April 1998, was 'Better Access to Manuscripts and Browsing of Images (BAMBI)', and the Consortium was formed by A.C.T.A. (coordinator, Florence), Central National Library (BNR, Rome), National Research Council (ILC-CNR, Pise), Pise Research Consortium (CPR, Pise), INSA (Lyon) and Max Planck Institut fuer Europ. Rechtsgeschichte (Frankfurt a. M.).

The BAMBI project is aimed at two categories of users: the first is represented by the general users of a library who wish to examine manuscript sources ; the second category of users envisaged by BAMBI is made up of professional students of texts: philologists or critical editors of classical or medieval works that are handwritten on material supports of various types (paper, papyrus, stone). It includes, therefore, students of ancient texts as papyrologists, epigraphists, palaeographers and codicologists: all those, in short, who are interested in studying, annotating or transcribing the text contained in digitized and accessible manuscript documents.

The BAMBI Workstation (Fig. 1) makes it easy to:

look-up an image archive with digital representation of the source documents on a high resolution monitor
transcribe, annotate and index the text prsented in the images
view the transcribed version and the Index Locorum (see section 2.3) in a window adjacent to the display of the source document
automatically match each word of the transcription, of the Index Locorum and of the annotations with the portion of the source document image in which the word is found
export information on manuscripts in the form of SGML/HyTime-formatted text.

Figure 1. BAMBI workstation (full size image)

The BAMBI project allows the constitution of specialized large databases of ancient manuscripts, by giving to each owner of the manuscript contents the opportunity to create its own documentary database.

Moreover, this database enables users to enrich the information database associated with the image by entering the transcription (if it does not exist), annotations, etc., at several levels: the manuscript, or the image. The use of hypermedia ensures optimal user-friendliness and ease of navigation. Links between images and texts constitute an innovative function of this application. All these elements confer on BAMBI its originality with respect to existing products and contribute to create a strong interest in the user population.

2 Description of the philological workstation

2.1 Search of a manuscript

The philological workstation offers a number of search tools that can speed up the selection of documents. Three methods of retrieval have been implemented: selection from an ordered list, multi-criteria search, and the use of keywords.

The ordered list option corresponds to an exhaustive list of all documents contained in the database. Different criteria can be applied to control the order of occurrences: alphabetical order for titles, increasing or decreasing order for dates, grouping by types, etc.

Search can be performed on the following criteria or a combination of them: dates of creation, the language, the place of origin, etc. (Fig. 2)

Each text is associated with a certain number of keywords that describe the document. These keywords are in several languages (German, English, French, Italian or Spanish). This enables historians to make highly sophisticated searches through the inclusion of logical operators (AND, OR, NOT) and special characters (*, ?, #).

Figure 2. Selection of manuscripts

2.2 Transcription of manuscripts

Philologists can transcribe the text contained in the digitized image of the manuscript (if it doesn't exist), following transcription criteria based on the palaeographic model (Bischoff 1992; Pratesi 1979; Tognetti 1982).

We recall that the transcription of the manuscript is a process which leads to note the pronunciation of a given language by means of the system of signs of a conversion language. In the era under consideration, the early and high Middle Ages, abbreviations can be divided into the following types : syllabic abbreviation (omission and elision of letters), abbrevation by suspension (an example is provided by the names of jurists : ac. = Accurcius, bul. = Bulgarus, ...), abbreviation by contraction (which have endings written on the line), and the use of special signs.

The transcription of a manuscript for BAMBI is a non-automatic operation that must adhere to a number of rules and conventions in order for it to be correctly interpreted and used by the application. The rules are:

round brackets '(' and ')' are used to enclose characters that in the original are represented by signs of abbreviation written above or below other characters and which therefore do not modify the spelling;
angle brackets '<' and '>' are used to enclose integrated characters for which there is no equivalent in the original, or initials of large size;
square brackets '[ ' and '] ' are used to enclose characters that correspond to a single or graphically modified sign in the image.

These rules are given interactively to the user when transcribing with the BAMBI Workstation. The transcription can be exported to a file in RTF or SGML (see section 3), allowing it to be reused either by standard word-processing programs or by commercially available bibliographical software.

2.3 Indexing of transcriptions

When the text of the transcription is complete, the indexing tool generates an index verborum and an index locorum (Fig. 3). The index verborum contains all the words appearing in the transcription (without the characters of ellipsis () [], which are automatically stripped away) as well as the words corrected by the user (with the text variant function), identified by an asterisk.

Figure 3. Index verborum and index locorum

Each entry in the index is followed by the number of times it occurs in the manuscript, as well as in the page under examination.

The use of more than one script in the same manuscript (Greek and Latin for example) requires the creation of an index locorum for each alphabet.

The index locorum allows the positions in which each word occurs in the manuscript to be displayed. The reference to a given word takes the form of a list containing the page number, column number, line number and word number.

The indexing technique used is full-text indexing (see section 4).

2.4 Annotations on manuscript transcription

Annotations can be added to the historian works on manuscripts. On the one hand, these annotations permit the addition of personal comments on a group of words in the text and, on the other hand, to add corrections or synonyms.

Thus an annotation is made up of two distinct fields: one for free comments or critical apparatus (Fig. 4), and another for variants, synonyms or corrections of syntax. This distinction between the annotations is necessary so that the synonyms can be included in the manuscript's index verborum, which is useful for research purposes.

Figure 4. Free comments (full size image)

2.5 Word-image concordance

2.5.1 Automatic column and line recognition

The original or photographical (even microfilmed) document is optically scanned. The digital image appears in a screen window with a very high definition to ensure perfect legibility even after considerable enlargements. A second window placed alongside the first one contains the scholar's transcription of the document. New text is entered, or text already transcribed can be corrected, in this window (Fig. 5.).

Figure 5. Image-to-text matching (full size image)

The numerical representation of the image is automatically processed to identify - by distributing the main values on the histogram - those parts which can reasonably represent the background and those which can be interpreted as written parts.

Columns are automatically identified in the case of more than one column of text. Digital image intervention involves identifying particular features which enable the machine to distinguish the areas with printed text from those referring to the margins. This is made possible by analysing the histogram with the distribution of the 'grey-level' values. The point at which the program identifies a clear-cut separation between black and white tonalities, observing the image from top to bottom, is considered a column partition and is marked with a line appearing on the screen, allowing it to be checked for correct disposition.

This method is also used for the third type of intervention which is represented by automatic line identification. The program no longer works vertically but horizontally and analyses the histogram to check the clear separation of the 'grey-level' values. Identification of the lines is easy and the program can count and number them progressively.

2.5.2 Word-image concordance

A final check of the image - only possible if transcription of the text is available - matches each word of the text with the portion of the image in which it is inscribed. The system first examines the 'grey-level' values pertinent to each line, and evaluates them along the vertical axis; more precisely, the system tries to identify separation between words; it exploits the textual transcription from which it is possible to draw the exact number of words for each line, to control the segmentation of the line image in words zones.

The higher the grade of definition by which the image is digitized, the more precisely is it possible to identify each region, or word.

Automatic segmentation of the image checked by the transcription presents a further advantage: without the operator's intervention, it adapts to the changes carried out in the text. Any second thoughts regarding word division or different interpretation produce a different word-image link.

The module shows the text and the segmented image in two separate windows on the screen, and the operator can check and correct any mistakes. The software examines the text transcription produced by the philologist in a document stored in RTF format. An 'ideal' graphical region is built around the words of the transcribed text, taking into account the diacritical symbols for which the size of the image and that of the transcription are different. In this way, the system is able to maintain an equivalence between the real image (obtained on the basis of the digitized document) and the virtual image (realized on the basis of the transcription).

Each region (produced by the system) is assigned an address which allows a relation to be maintained between the corresponding words of the transcription. In other words, the image is organized as a mosaic in which each rectangular area is associated with a pointer to the relevant word of the transcribed text.

The electronic segmentation of the image controlled by the textual transcription is automatically updated when the transcription is modified. The module shows the text and the segmented image in two separate windows on the screen (Fig. 5.) so that the operator can check any mistakes, which can be corrected by clicking on the mouse. Segmentation is therefore recorded and from this moment on a number of queries can be made:

by selecting a word on the transcription (or in the index locorum), a window appears showing the word evidenced in the image. The system supplies both the reference to the selected locus, and all the loci in which that word can be read. The selection of any other reference implies immediate visualization of the corresponding parts of text and image;
by selecting a word on the image, a window appears which shows the transcription of the word in the text. Even in this case, the system suggests the references of the other loci in which it appears.

2.5.3 Manual correction of the results of the automatic concordance

The algorithm that matches word and image, although powerful, does not allow definite identification of blocks of words. Certain peculiarities of the manuscript (stains, words that are joined together, drawing, illuminations, etc.) require manual intervention in order to be correctly analysed (Fig. 6.). Once automatic matching has been completed, the user is free to alter the result so as to correct these errors. The corrections will be stored and applied each time the algorithm is used again.

Figure 6. Manual correction of segmentation (full size image)

The data relevant to the texts and the images are encoded in different ways: original encoding envisions RTF for text representation and JPEG for representing compressed images. The RTF embedding was used to allow multiple character fonts in a single text file. Compared with raw text, it allows, for instance, Greek and Latin characters to be transcribed in the same page, which was one of the specifications from the CNR of Pise. This format is also, by definition, richer (while not taking too much space for storage) and is more 'open' for further development of the project. The text is stored as RTF on disk, but it is not the text which is stored in the database ; the database contains only links to files stored either on the CD-ROM (for the original texts) or in the user's personal storage system (hard-drive, shared directory, etc.) as defined in the 'set-up window'.

Information contained in the BAMBI database has been modelled according to SGML and HyTime.

3 Modelling of the manuscript database with HyTime

One significant utility that is provided in the BAMBI system is a filter for the transparent export of text (including image-to-text matching data) in the form of SGML/HyTime-formatted text. To save and export data (i.e. encoded text containing tagged references to bookmarks, annotations and the position of each word inside the bitmap) in a SGML-compliant format is a useful feature, especially in view of growing implementation of the international Text Encoding Initiative for the standardised (through SGML and HyTime) encoding of classical text in machine-readable form.

A custom BAMBI document type definition (DTD) has been developed, which is able to handle the most common tags implemented in BAMBI: manuscript description, annotations, links, bookmarks, words in Greek, text-to-image matching pairs, etc. This DTD has been written in SGML using some of the advanced features of HyTime for representing links.

After introducing HyTime and showing how it can been used as a modelling language to describe works on manuscripts, we present relevant parts of the HyTime model and prove that the model can serve as a basis for implementation.

3.1 HyTime

HyTime (Hypermedia/Time-based Structured Language) (ISO 1992; DeRose and Durand 1994; Newcomb et al. 1991) is an international standard which defines the structure of multimedia and hypermedia documents. HyTime is an extension of SGML (ISO 1986; VanHerwijnen 1990) mainly addressing the problems of:

locating data, and any type, by using a standard notation that is independent of the processing system and the data itself
describing links within and between documents
structuring contents
describing relations between temporal and spatial events occuring in documents.

HyTime is characterized by its meta definition capabilities. It consists of a DTD which defines a set of SGML element types each having a semantic meaning. The elements are called architectural forms. Architectural forms (F.A) can be compared with abstract classes in object-oriented programming. Their specification is expressed by a narrative text combined with a formal definition and associated programs. The set of specifications of architectural forms authorized by HyTime is gathered under the name of meta-DTD because it represents the DTD model of the HyTime application. An architectural form corresponds therefore to the definition of a meta-element. It represents a basic information structure. An element inherited from this meta-element can then be used in any DTD. It can be specialized (content model, attributes) following rules defined by the HyTime standard. These architectural forms are then linked by relations like generalization/specialization analogous to those encountered in object models.

Figure 7 shows an example of a hyperdocument which includes a very simple link (it corresponds to an annotation, a free comment, in BAMBI). This type of link corresponds to the architectural form clink. Figure 8.1 shows the architectural form clink; Fig. 8.2 outlines the derivation of an element model 'lien' in the DTD of the hyperdocument. The element model 'lien' is instantiated in Fig. 8.3.

Figure 7. Hyperdocument example containing a contextual link between two nodes

Figure 8. HyTime description of the hyperdocument presented in Fig. 7

3.2 BAMBI document type definition

This section describes the HyTime conforming DTD for BAMBI (see Fig. 9.).

<!-- DTD for a class of document exported from BAMBI project -- >       

<!ENTITY % doctype "MANUSCRI"                                           >

        <!-- Document STRUCTURE -->
<!--      ELEMENTS              MIN     CONTENT (EXCEPTIONS) --         >

<!ELEMENT %doctype;             - -     (InfoManu, Pages*, Fin?)        >

<!ELEMENT InfoManu              - -
        (UserName,Title,Author,Library,Incipit,Material,Date,Size,
        Languages,Handwriting,Bookmark*)                >
<!ELEMENT UserName              - -     (#PCDATA)                       >
<!ELEMENT Title                 - -     (#PCDATA)                       >
<!ELEMENT Author                - -     (#PCDATA)                       >
<!ELEMENT Library               - -     (#PCDATA)                       >
<!ELEMENT Incipit               - -     (#PCDATA)                       >
<!ELEMENT Material              - -     (#PCDATA)                       >
<!ELEMENT Date                  - -     (#PCDATA)                       >
<!ELEMENT Size                  - -     (#PCDATA)                       >
<!ELEMENT Languages             - -     (#PCDATA)                       >
<!ELEMENT Handwriting           - -     (#PCDATA)                       >

<!ELEMENT (Bookmark | Fin)- -   (#PCDATA)                       >

                <!-- Page STRUCTURE -->
<!ELEMENT Pages                 - -     (Image,Transcri)+               >

<!ELEMENT Image                 - -     (CoorMots*) +graphic            >

<!ENTITY % CoordXY      "(X1,Y1,X2,Y2)"                                 >
<!ELEMENT CoorMots              - -     (%CoordXY;)                     >

<!ELEMENT (X1,Y1,X2,Y2)         - -     (#PCDATA)                       >

<!ENTITY % Annot "(Annot1|Annot2|Annot3|Annot4|Annot5|Annot6)"          >

<!ELEMENT Transcri              - -     (Curpage,(Column,Ligne,Mots+,(%Annot;)*))*>
<!ELEMENT Curpage               - -     (#PCDATA)                       >
<!ELEMENT Column                - -     (#PCDATA)                       >
<!ATTLIST Column        NumCol  CDATA   #REQUIRED                       >
<!ELEMENT Ligne                 - -     (#PCDATA)                       >
<!ATTLIST Ligne         NumLine CDATA   #REQUIRED                       >
<!ELEMENT Mots                  - -     (#PCDATA|Mots*)                 >
<!ATTLIST Mots          Police  CDATA   #IMPLIED                        >


<!ELEMENT (Annot1|Annot2|Annot3|Annot4|Annot5|Annot6)
                                - -     (#PCDATA)           >
 
                <!-- Attribute definition Lists -->

        <!--    Entity-name     contents -->
<!ENTITY        MAP1            "<X1> <!USEMAP MAP-INX1>"               >
<!ENTITY        MAP2            "</X1> <Y1> <!USEMAP MAP-INY1>"         >
<!ENTITY        MAP3            "</Y1> <X2> <!USEMAP MAP-INX2>"         >
<!ENTITY        MAP4            "</X2> <Y2> <!USEMAP MAP-INY2>"         >
<!ENTITY        MAP5            "</Y2> </CoorMots>"                     >

        <!--    Mapname         delimiter       Entity-name -->
<!SHORTREF      MAP-X1          "("             MAP1                  >
<!SHORTREF      MAP-INX1        ","             MAP2                    >
<!SHORTREF      MAP-INY1        ","             MAP3                    >
<!SHORTREF      MAP-INX2        ","             MAP4                    >
<!SHORTREF      MAP-INY2        ")"             MAP5                    >


        <!--    Mapname         element -->
<!USEMAP        MAP-X1          CoorMots                            >

Figure 9. BAMBI DTD

A manuscript is composed of a description (InfoManu, in the DTD) and of pages (pages, in the DTD). The element pages is composed of an image (elEment Image) and a transcription (elEment Transcri).

Each word in the image is described by its coordinates, that is X1, Y1, X2, Y2 (ELEMENT CoorMots). X1 represents the X upper-left corner of the rectangle around the word in the image, Y1 represents the Y upper-left corner, X2 the X lower-right corner and Y2 the Y lower-right corner.

Each word in the transcription is located by the line and column numbers where it appears. Moreover, if an annotation exists on the word (or on a group of words) it is specified with the ENTITY Annot.

3.3 Instantiation of the BAMBI DTD

The information on a manuscript is defined as an instance of the BAMBI DTD. For example, the document c4r.sgm is the instance of the BAMBI DTD describing the carta 4 recto of the manuscript 'Diario del viaggio in Terra Santa 1559'. The more important parts of the specific document (instance of BAMBI DTD) are described below.

Manuscript description

The model used to describe the manuscript is the computerized register for the cataloguing of manuscripts recently developed by the Istituto Centrale per il Catalogo Unico (ICCU) of the Italian libraries, i.e. identification of the manuscript, location, call mark, support, date, type of script, heading. The HyTime code in Fig. 10 corresponds to the description (Element InfoManu) of the manuscript 'Diario del viaggio in Terra Santa 1559'.

<INFOMANU>
<USERNAME>Mario</USERNAME>
<TITLE>Diario del viaggio in Terra Santa 1559</TITLE>
<AUTHOR>Luca da Gubbio</AUTHOR>
<LIBRARY>1</LIBRARY>
<INCIPIT>Unknown</INCIPIT>
<MATERIAL>Cartaceo</MATERIAL>
<DATE>Sec. XVI 2ÃƒÆ’Ã†'Ãƒâ€ Ã¢â‚¬â„¢ÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬Ãƒâ€¦Ã‚Â¡ÃƒÆ’Ã†''ÃƒÆ’Ã¢â‚¬Å¡Ãƒ'Ã‚Â° Meta</DATE>
<SIZE>CC  98</SIZE>
<HANDWRITING>8</HANDWRITING>
<BOOKMARK> Diario del viaggio in Terra Santa 1559 : c4r</BOOKMARK>
<BOOKMARK> Diario del viaggio in Terra Santa 1559 : c5r</BOOKMARK>
</INFOMANU>

Figure 10. Manuscript description

Link between part of image and part of text

As be seen from the portions of HyTime code in Fig. 11 (instance of BAMBI DTD), links can be defined between part of image and the corresponding part of text in the transcription using the architectural form hotspot. The links between a word in the transcription and the corresponding part of image is defined using the architectural form link. Figure 11 gives an example of such mechanisms:

<IMAGE>
<HYLOC>
      	<HOTSPOT ID=H1_1_1 GRAPHIC = Image5  REF=T1_1_1  RX= «205,02»  RY=«75,64» RW=«128,52» RH=«69,54»
      	.....
      	</HYLOC>
      	</IMAGE>
      	<TRANSCRI>
      	<CURPAGE>c4r</CURPAGE>
      	<COLUMN NumCol=1>
      	<LINE Numline=1>
      	<LINK ID=T1_1_1 LINKEND=H1_1_1>I(tem)</LINK>
          ...............
</LINE>
</COLUMN>
</TRANSCRI>

Figure 11. Link between part of image and part of text

For example, the word 'I(tem)' in the transcription is linked with the corresponding part of image using the architectural form link, and the part of image is linked with the word (in the text) using the architectural form hotspot. RX, RY, RW and RH specifies the coordinates of the part of image correponding to the word I(tem). The parentheses signify that « I » appears in the manuscript image and it is an abbreviation of Item.

4 BAMBI architecture

The philological workstation architecture (Fig. 12.) is composed of the following elements: the BAMBI database, the BAMBI application, the HyTime application, the HyTime engine and the SGML parser.

Figure 12. BAMBI architecture with HyTime modules

4.1 BAMBI database

A storage structure is required in BAMBI at several levels, the most obvious of which is the data repository layer where the images of the manuscripts and the transcription texts are kept. The BAMBI system offers complete data management capabilities enabling the treatment of collections of hundreds of manuscripts from several different locations, each with hundreds of pages and a corresponding number of transcriptions. The navigation functions offered by the user interface, that transparently expose a hierarchical structure, must be based on a solid relational database management engine (here Microsoft Access 7.0), either local or placed on a LAN or Web server. Another fundamental block in the structure is related to the image-to-text matching mechanism, which was only able to operate at run-time, with no provision for permanent storage of matching patterns. The present schedule envisages the creation of a specific structure inside the RDBMS engine, through which text-to-image correspondences will be made (optionally) permanent and saved on disk. At the same time, a convenient facility is available to implement the matching structure in the form of transcription text encoded with SGML-compliant tags. Another element, which produces an index of the words encountered in each manuscript and which allows the user to browse through the Index verborum and to perform word searches, is also based on the creation of a dedicated RDBMS structure.

The database (Fig. 13.) is composed of 17 tables: manuscript, page, link, keyword, user, sharing, annotation, bookmark, shortcut, exception, path, index, alphabet, storage, microfilm, library and script tables. :

Manuscript table, lists the characteristics of a manuscipts: title, author, incipit, library, ...
Link table. When a word is 'clicked' in the transcription or image, it is the link and exception tables that are used, and not the matching algorithm. This solution represents the best possible compromise between speed of execution and size of the database.
Exception table, contains information added manually by the user which allows the matching algorithm to remove certain ambiguities. At presen three types of exceptions have been identified (M=mask, I=illumination and C=correction). It should be noted that this table is not modified by the matching algorithm, unlike the Link table.
Index table, contains an exhaustive list of the words in each manuscript (including the variants provided by the user) for a given set of characters (Greek or Latin). This table will be used for searches that count the number of elements meeting certain criteria. The number of occurrences of a word in a manuscript will be obtained by calculation.
Alphabet, contains all the alphabets of the languages used in the manuscripts stored in the base, associated with the order of classification of each of their characters. This makes it possible to provide a relevant ordered index whatever the language.
Script table, contains the names of the types of scripts used in the manuscripts.

Figure 13. Database architecture (full size image)

4.2 BAMBI application

The implementation of the BAMBI functions is based on the following major treatments:

text processing
text-to-image matching
SGML/HyTime filtering

4.2.1 Text processing

The BAMBI software is able to treat only two types of texts: Latin and Greek. This was part of the specifications given by the CNR-Pise. The specific treatment of Greek words is based on the font used: only the words using the font specified in the setup window are considered as Greek characters. This allows language recognition while avoiding a heavy dictionary treatment for multi-language documents. This is an architectural limitation which could be easily changed if needed: the fact is that the specifications asked for only two sets of characters and these specifications were followed strictly at design time. It does not mean that this version of the software cannot handle other sets of characters, but for the moment the number of extra sets (excluding Latin) is limited to one. This extra character is called 'greek' in the software but it is of course possible to use a hieroglyphic, for example, to work on Egyptian papyrus.

No existing software is used for indexing; everything was created from scratch. Words are stored according to their alphabet (for each word, the algorithm checks 'does this word uses the alphabet defined in the setup window as the Greek alphabet?'). The whole manuscript is read and an internal structure counts the definitions one after the other. The Greek words need to use the font specified in the setup window (see below). To be sorted properly this font must respect the alphabetic order, i.e. the ASCII codes must follow the alphabetic order and not necessarily the classic Latin order as in many fonts.

4.2.2 Text-to-image matching

The method used for the text-to-image matching has already been described in section 2.5. The segmentation algorithm uses a histogram scheme based on grey-levels in the image. The only use of the textual representation is the length/number of words in each line. The algorithm is given the name of the text file and even handles the file access process.

4.2.3 SGML filtering

We recall that the aim of the SGML filter is to export the text on a manuscript in a SGML format. The information to be modelled are stored in the database: the manuscript title, the author name, the library name, the Incipit, the material, the date, ..., the number of pages, the number of lines, the graphical coordinates of the rectangle area around a word in the image.

SGML is used in the software for its export capability, and is not used as an internal representation. The SGML file is generated automatically. The SGML filter is based on a simple algorithm. The input data are stored in the BAMBI database. These data are then encapsulated with the tags specified in the BAMBI DTD (defined in section 3), during SGML file generation. Consequently, the SGML file is not stored in the database but created from the database.

More exactly, the SGML filter algorithm

writes the opening field tag in the SGML file (to be generated), for example : < TAG>
reads the information corresponding to the field from the database
writes the information to the file
writes the end tag in the SGML file, for example </ TAG>.

4.2.4 SGML/HyTime modules

The HyTime application is the visible part of the HyTime production chain (Fig. 14.). The HyTime application is a program which manages hyperdocuments. It determines how the document must be presented. It permits the translation of the HyTime concepts in presentation format (Buford et al. 1994). While the HyTime engines and the parsers are designed for all applications, the HyTime application must be developed individually. In the HyTime treatment chain, application checks all interactions with the user. In reality, when a SGML or HyTime document is treated, the application uses the HyTime engine (Newcomb et al. 1991). Figure 14 shows the HyTime presentation of the manuscript Diario del viaggio in Terra Santa 1559.
A HyTime engine (Buford et al. 1994), (Koegel et al. 1993) is a program (or portion of a program, or a combination of programs) that recognizes HyTime constructs in documents and processes then independently of the application (ISO 1992). The development of the SGML/HyTime filtering is based on a HyTime engine called Synex Viewport designed by AB Synex Viewport with the Royal Institute of Technology of Stockholm, Sweden. This HyTime engine is presented as a C++ library with a clear and easy-to-use interface. It allows SGML documents to be parsed and visualized .
A SGML parser is a syntactical analyser. It checks that the document is consistent with respect to its DTD.

Figure 14. HyTime application (link between part of image and part of text)(full size image)

4.3 BAMBI software

The software is divided between two media: a permanent CD-ROM and a rewritable medium, the local hard disk of the workstation (Fig. 15.). The CD-ROM can be shared by several workstations linked in a network. It contains:

images of manuscripts in JPEG 1200x1850 format in 256 shades of grey;
images of manuscripts in the reduced 180x180 format for preview;
transcription files in the ASCII format;
update files (database tables).

The local hard disk contains three types of data for the BAMBI software:

The executable part of the program as well as its drivers and initialisation and parametrisation files;
The database, as described in the previous paragraph;
The personal files of users, containing images of manuscripts and transcriptions that they want to include in the database even though they do not appear on the CD-ROM.

Figure 15. BAMBI software

The original prototype of the BAMBI software tools abstracted from the data handling and storage structures. These are in fact essential for the future use of the system in real-world contexts.

5 Conclusion and perspectives

The BAMBI initiative has explored issues relating to the digitisation of microfilms of manuscript materials and has developed advanced tools for the analysis and exploitation of digital images encompassing visual browsing, automatic text-to-image matching, text indexing, annotation, hyperlinking and text export in SGML/HyTime formats. The project has successfully striven to match the needs of libraries (extensive digitisation and accessibility/exploitation of valuable materials) with the requirements of scholars and students in the humanities (philologists, historians, etc.). Most of them (from Scuola Normale Superiore di Pisa, Università di Genova, Universita di Bari in particular) have experimented with the Philological Workstation and they gave a satisfied reaction in a presentation of the results of the BAMBI project in Rome in May 1997. At the time of writing, a decision had been taken to commercialise the BAMBI Workstation by making the CD-ROM available on the market at the beginning of 1998.

The perspective of the BAMBI project is the creation of several components of a comprehensive digital workstation solution for the preservation and study of manuscripts in digital form, with provision for handwriting recognition, electronic restoration and Web-based collaborative philological work. A new project will deal with a number of issues left unanswered by the initial project:

more comprehensive, standards-based tools for the description of manuscripts
better image processing routines for electronic enhancement of microfilm images and the preservation of document consistency and authenticity
OCR tools for the automation of the transcription process
a comprehensive solution for the management of text variants
tools based on image processing facilities and linguistic (statistical) facilities for electronic restoration of missing text elements
a client-server model for collaborative work based on Web servers
a thorough survey of the technical and legal framework for the development of widespread, multi-source services offering digital versions of library materials and tools for their use.

We have already experimented with Java and ActiveX approaches for generating the Web version of BAMBI. The proposed distributed application is based on an open architecture composed of one or several servers and client workstations, each having specific accesses according to the user profile (philologist, librarian, scholar). We have identified three kinds of access:

philological application information access: limited to the presentation of the philological application functionalities available on the Web
partial access: permits only consulting of manuscripts stored in the manuscript database (images or transcriptions)
full access: allows clients to use all the functionalities of the philological application.

We have chosen the ActiveX solution. Like Java applets, ActiveX controls are self-contained pieces of functionality that run inside some kind of container. As with Java applets, a Web browser is a good choice for that container, allowing ActiveX controls to be embedded in Web pages and downloaded on demand. This solution has been chosen because it allows us to reuse the code developed previously. Reusing this code has involved transforming the main forms of the BAMBI application into ActiveX controls which will be directly loaded in an HTML page.