Issues for East Asian text databases

By Ellen McGill, Librarian, Yenching Library, Harvard University
Contact: John Wong, Harvard University [jwong338 (at)]

1. OCR and quality control:
Although OCRing works increasingly well for recently published texts, especially those created on computers, error rates are still comparatively high for historical texts because of unusual characters, paper deterioration, variations in woodblock printing (leaving aside manuscripts), and so on. These issues continue well into the 20th century for East Asian texts because of script reforms in the 20th century and the low quality of paper used (especially in China between the 1920s and 1950s, and in Japan and Korea during wartime).

I believe that the most respected Chinese scholarly text databases (e.g. Digital Heritage’s Si ku quan shu, Chinese University of Hong Kong’s CHANT, Academia Sinica’s full text databases, etc.) have all been created through manual entry (i.e. they retyped the texts as well as scanning) with quality checking done by graduate students and scholars. There has been a certain amount of government funding available in Taiwan, Japan, and Korea to support historical text digitization projects, which makes this kind of labor possible but increasingly difficult. For example, recent digitization projects at Academia Sinica or the Million Book Project are only scanning, rather than OCRing, historical texts. When we discussed OCR possibilities with Academia Sinica staff a year and a half ago, they felt that error rates were still far too high to make it useful for historical texts on any large scale, although it is an area they are watching closely.

The proliferation of image-only databases makes comprehensive metadata creation even more important, but of course this requires fairly specialized subject knowledge among staff.

2. Script input and matching:
The input method editors packaged with Windows systems (by far the most widely used in East Asia), Macintosh, etc. do not contain many historical or variant characters. For historical text databases to be searchable, it may be necessary either to create a dictionary function that matches currently used characters to variants so that people can use the standard IMEs to search; or they may develop a plug-in or use a different input method editor which may introduce compatibility and support problems (especially relevant when an institution is considering subscription or purchase).

3. Different sizes and formats of materials:
Besides fragility, text formats and sizes may introduce problems for scanning. Scrolls, rubbings, etc., come large sizes and may have to be photographed rather than scanned and be stitched together in photoshop or another program. This adds significantly to cost and may require special facilities. Paper was in widespread use much earlier in East Asia than in Europe, so a larger proportion of historical texts may be more fragile (compared to parchment, vellum, and so on). With stitchbound volumes, it is often necessary to unstitch them prior to scanning. Facilities for restitching are required. There is also some debate as to whether there is value in the bindings themselves as historical objects, since removing them typically means replacing them with modern thread, etc. Because of script directionality, page turners may need to go right to left rather than left to right, and so on.

4. Need for networking and compatibility issues:
It seems like it may still be more common with East Asian databases for networking of discs or physical installation of a hard drive on a server to be required. Although systems are increasingly standardized internationally with increasing multilingual capacity, this can still be a problem when trying to install outside the environment where a database was created. Windows and IE are overwhelmingly dominant in East Asia, and databases are often designed specifically for Windows/IE and they may not run properly on other systems and browsers. This affects all databases and not specifically historical ones, but it compounds the need for technical / institutional support mentioned above for special plug-ins and fonts.