Digital Libraries and Data Enhancement

This page was written by Steven J. DeRose around 1998, and was last updated on 2003-04-12.

Data enhancement in Digital Libraries

These are some unpolished thoughts re. how to search huge digital libraries better: specifically, the tradeoff between making your data smarter, and making your search tools smarter.

As in any other domain, the amount of information eventually reaches a point where tools cannot just get proportionately bigger anymore: the tools themselves must change. The Web is already at this point, and Digital Libraries are rapidly moving toward it.

As Digital Libraries approach the size of paper ones, we must create new and extraordinarily effective ways to navigate them, and to find the information we need without being deluged by irrelevance. This requires that computer systems be able to find documents accurately given our inevitably vague and prosaic specifications.

There are 3 sources of knowledge that a system can draw on to determine relevance:

These sources of information are normally combined through logical, stochastic, cluster, or other methods, to determine what the user wants. Much work has been done on how to process the information sources in IR. Much has also been done on the encoding of metadata on documents, but in several almost separate guises:

The first type has a great deal of experience and standardization to inform us. The new issues as I see them involve one basic change the Web has introduced: Physical objects and logical objects no longer correspond very well. Any logical object of non-trivial size (like, say, a book) must be broken into many fragments for practical Web delivery, because of bandwidth limitations (browser performance constraints also come in, since all the Web browsers use batch formatters -- my patents deal largely with non-batch formatting techniques).

When you break up a document into many physical parts, you run into problems that do not come up much in monographic cataloging. The problems are touched upon when catalogers deal with monographic series, collected works in which the parts must be cataloged, and particularly in archives, personal papers, and manuscripts (hi, Steve). Cataloging for the Web must come to terms with the problem of objects that not only are not in hand, but are not singulary.

The second type of metadata, document markup, has enormous potential for enhancing search and retrieval, particulary in terms of precision. Precision is arguablythe most needed improvement in WEb searching: there is seldom any problem getting enough hits for your Web search: the problem is getting hits you care about, or finding them in a maze of twisty little hits, all irrelevant. If you're putting a catalog online, you sure better be able to tag your PRICEs as such, and be able to search on that.

Large-scale document systems such as used in the software and high-tech industries (where I put a lot of design effort for DynaText) run into this problem at a huge scale. But so does any system with special data structures that should be searched in a database-like manner. Text searching without structure seems absurd -- by analogy, who would buy a database system where you could search for particular numeric values, but couldn't specify what field they must occur in? Enhanced text has not less structure available than databases, but more: it's just that it is a hierarchical and linked structure, not amenable to flat processing, and so needs more sophisticated query languages and tools. The issue of querying in this kind of data (variosly called "highly structured", "semi-structured", or "unstructured" depending on disciplinary or political factors) is growing quickly in importance (see, for example, the focus on it in a recent issue of the International Journal of Digital Libraries.

The third type of metadata, document schemata, is what SGML DTDs have traditionally done; they are much like relational database schemas, but have additional capabilities required to perspicuously describe tree structures: non-RDB-lke structures where many levels of information occur, fields re-cur without limit in most contexts, and order of fields and records" matters a lot (this final characteristic of documents makes it awfully hard to optimize relational represenations of ordered hierarchies, since it violates an underlying assumption of the relational model).

SGML DTDs, although they have crucial features beyond most other schema systems (such as for dealing with hieararchy as a first-class phenomenon), also lack some needed features. For example, SGML does not provide much for identifying widely-needed atomic datatypes such as integers and real numbers, boolean values, dates, times, and the like. This is pretty straightforward, and needs to be addressed. Simply enumerating a set of standard types will cover a large % of needs, and a way to point to external named types can accommodate the rest. HyTime has some facilities for doing this at a lexical level; XML-data provides a reasonable proposed set of atomic types, though they need further integration with central XML notions of element, attribute, entity, and content.

Metadata standardization

There are several proposals for standardizing one or all of these metadata types, using XML syntax. What remains needful, is a clear delineation of the three (and perhaps others), an analysis of the requirements under each (which I doubt will be totally compatible), and then a careful design of tool(s) appropriate to each task.

Cataloging data needs tools much like RDBs, unless the disaggregation poroblem described above change that.

Markup structures already have an effective standard representation: XML.

Schemas for XML are partly covered by DTDs, though there is sentiment for reducing the syntax for DTDs to XML itself: using a reserved tag-set defining elements such as <element-declaration>, <attribute-declaration>, and so on. This makes sense to me, since I see no particular advantage to having

<!ELEMENT chapter - - (title, abstract?, (sec+ | poem+), refs)>
<!ATTLIST chapter 
   id       ID      #REQUIRED
   type     NAME    #IMPLIED>

as opposed to

<ELEMENT name="chapter">
   <MODEL NOTATION='expr' (title, abstract?, (sec+ | poem+), 
   <ATTR name='id' type='ID' status='REQUIRED'>
   <ATTR name='type' type='NAME' status='IMPLIED'>

There is room for some simple improvements here, such as teasing apart the functions of semantic validation of IDs and their syntactic type, which are both crammed into the single token "ID" in SGML DTDs); there is also arguably need for adding the atomic-datatyping capabilities well-known from programming and database systems. This is also not hard, particularly in XML.

The astute reader will have noticed that I have not reduced the content model expression to tags, although it is obviously possible and conceptually trivial. I would not recommend actually doing so, because it seems to me to use the wrong tool for the job: tag structures are designed for dividing up fairly substantial chunks of data (that is, bigger than single tokens!), and where you need to attach property information to many chunks (via attributes). Expression languages such as content models and the tantalizingly similar regular expressions, are designed as compact, perspicuous ways to express combinatorial and sequence constraints on single characters and tokens. To my mind, that is precisely what we are trying to do in content models, and applying the other tool is inappropriate. I think any proposal for doing this should make a specific case for why, and should show realistic examples side-by-side in both expression form and markup-tree form, so readers can easily compare.

Back to home page of Steve DeRose or The Bible Technologies Group. or The Bible Technologies Group Working Groups. Or, contact me via email (fix the punctuation).