Chinese Collections in the Digital Library: introduction to a special issue

Brian Bruya
Special issue editor, University of Hawai'i, USA

A floppy disk can store about 500,000 Chinese characters--compare that to an ancient bamboo strip measuring about the same in area but maxing out at about 28 characters. Text processing has come a long way. Still, some of those bamboo strips have been dug out of the ground and are still legible after 23 centuries. What are the chances of a floppy making it through two millennia in the dirt, or of there being a device around that could then decipher it? There is still much to be done in the field of text processing, and the articles here, with their emphasis on technical achievement and XML deployment, are samples of some recent work that applies advanced computing techniques to pre-modern Chinese texts.

In September 2001, I embarked on a fact-finding trip to a number of projects in Japan, Taiwan and Hong Kong that are digitizing vast amounts of the pre-modern Chinese corpus for electronic dissemination. As managing editor of the NSF-funded Shuhai Wenyuan project at the University of Hawai'i, where we are putting some Chinese classics online with a number of interpretive tools, I wanted to see how others were coping with the challenges and requirements of this type of endeavor. The field of digital libraries is still so new, and the materials that each different project handles so diverse, that there are no developed standards for how to manage the editorial and technical aspects of a large digitization and dissemination project, and doubly so for projects handling non-alphabetic languages.

The articles in this special issue of JoDI are a direct result of that trip. Originally I had planned to write an article summarizing and comparing each of the projects, but Cliff McKnight, Editor-in-Chief of JoDI, thought readers would prefer the voices of the projects themselves, in an issue devoted entirely to the subject. As a result, we have articles ranging from handling non-alphabetic languages to XML utilization to coordinating multiple databases of diverse topics into a single digital library.

Christian Wittern's work with the Chinese Buddhist Electronic Text Association (CBETA) is a textbook example of how to run a digitization and dissemination project. Currently the Chair of the TEI Working Group on Character Encoding, Wittern has long understood the desirability of developing a project that adheres to accepted standards. As such, he managed to have the entire CBETA corpus produced in XML, adhering to TEI guidelines. He offers an overview of CBETA, details of his work producing XML texts, and an insightful section on the difficulties that prevent the production of an 'ideal' version of the CBETA texts.

The Chinese University of Hong Kong's Chinese Ancient Texts (CHANT) project has put an enormous amount of effort into producing a sterling set of critical primary resources. Envisioned over a decade ago as a project that would not only produce definitive paper concordances of pre-modern Chinese texts, it was also decided that they would be maintained electronically for electronic distribution. This prescience has led to a Web site that will soon offer over 1100 titles with over 31 million Chinese characters, each title rigorously rescinded and meticulously proofread. Che Wah Ho's overview includes discussions of challenges maintaining accuracy and of producing distinct databases from source media as diverse as oracle bones, bronzes, and excavated bamboo strips.

Michel Mohr's article moves us into late antiquity Zen. Having taken over the work of Urs App and Christian Wittern at Hanazono University's International Research Institute for Zen, Mohr discusses his work attempting to provide unique identifiers for Zen texts and personages. Faced with several languages spanning more than 13 centuries, the task is daunting. Mohr's analysis of the need for ISBN-like numbers for pre-modern works applies as well to textual traditions in all languages, and his provocative coda on classification challenges XML and other types of electronic classification as belonging to an outdated, pre-electronic mode of thinking.

Charles Muller developed a dictionary of pre-modern Chinese terms that, while technically complex, relied solely on Word macros for its functionality. Faced with unmanageable complexity and increasing instability, he accepted the assistance of Christian Wittern and Michael Beddow (a seasoned programmer and also a current member of Wittern's TEI working group) in converting his database into XML and running it in XSLT, utilizing XLink and XPointer technologies. Both Muller and Beddow offer their views of this successful collaboration, including the creation of searching functions and the advantages of using XML over straight HTML.

Before the Chinese invented paper some 2000 years ago, they had been writing on tortoise plastrons, ox bones, silk, and strips of bamboo. The writings of the philosopher Hui Shi from this time are said to have filled five carts, and the First Emperor of China reviewed 100 pounds of documents per day. East Asian text handling has, indeed, come a long way. In these articles, we find that the authors are working not only at the cutting edge of sinology but at the cutting edge of humanities computing.