Perspectives on querying TEI-annotated data

Pre-conference workshop,

TEI Conference and Members Meeting 2013

Date: October 1, 2013 (Tuesday)

Topic and rationale

The TEI provides mechanisms to richly annotate a variety of digital resources used in the Humanities. The typical way in which many Humanities scholars use annotations is as instructions for processing them for the purpose of visualization or transformation into other formats. However, a major aim of TEI annotation is to enrich the data with the results of scholarly effort. It is therefore essential to be able to efficiently retrieve the various pieces of information in a structured way. This, in turn, requires accessible and user-friendly -- but at the same time reasonably powerful -- query languages.

Naturally, XQuery or XSLT provide access to all the information expressed in annotations. However, it should be borne in mind that, despite the warm feeling of power that good command of XQuery or XSLT offers to the researcher, not everyone is able to exploit their full capacity. Learning either of these Turing-complete programming languages requires an amount of time and devotion that not every scholar or student is able to allocate for this purpose. Like in the case of natural languages, one benefits greatly from long-time exposure and repetition – but these are conditions that characterize the tasks that face programmers or IT personnel rather than most literary scholars or students, who may greatly benefit from more specialized query languages which are at least one level of abstraction above XSLT or XQuery, and which offer user-friendliness instead of ultimate power and versatility.

The world of Digital Humanities – arguably the central focus of the TEI – has long ago expanded beyond simple working with electronic text in the word processor of the day. DH specialists gather, curate, and query various sorts of textual data, from plain text via semi-structured XML to records in relational databases. The nature of the objects of research varies as well: they come, among others, as single texts with sometimes very complex internal structure, bundles of base documents with hierarchies of annotations and all kinds of interrelationships among them, parallel multilingual data (e.g. original works and their translations) or scattered prosopographic fragments. Much of that can nowadays be wrapped in a TEI envelope.

Given the above issues, it is natural to wonder whether the strategy typically advocated in the work of the TEI Council and often voiced on TEI-L – to stress that the TEI should best be handled by general-purpose XML-oriented tools (to which XQuery and XSLT belong) – should carry over to the task of retrieval from richly annotated data, especially if said retrieval is to be made available to an average scholar or student. Or, more precisely, whether it would be better to offer scholars and students a language tied more tightly to the TEI data model and whether it is possible for such a query language to address the entire TEI universum of objects in a uniform manner.

Within the last decade, a lot of effort to create efficient and user-friendly query systems has been undertaken within corpus linguistics, but the knowledge about them spreads very slowly outside this field. On the other hand, corpus linguists are often not aware of specific issues and needs of querying digital texts used outside linguistics.

Therefore, the workshop aims at building a common ground for the sharing of experiences among researchers dealing with various aspects and forms of TEI-annotated digital text. The presentations will address the impact of experiences of querying richly annotated linguistic corpora on other fields within Digital Humanities and discuss specific TEI-related problems when dealing with queries.

The invited contributions as well as the panel discussion are expected to address, among others, the following range of issues:

query languages and query environments;
queries dealing with a variety of text objects in a variety of TEI-annotated structures;
enhancement of user-friendliness by, e.g., hiding the potential complexity under a simple set of agreed symbols or by the use of a graphical user interface;
a common query language to extend over the range of objects defined by the TEI data model

This workshop is meant to bring together, on the one hand, corpus linguists and computer scientists, who will present their suggestions of reflections on the possibility of creating a Corpus Query Lingua Franca for Humanists, and, on the other, TEI practitioners themselves, presenting both concrete tasks that combine textual and non-textual data in a novel manner, as well as theoretical challenges that a modern query system for Digital Humanists should tackle.

List of presentations

Peter Bouda (Centro Interdisciplinar de Documentação Linguística e Social) "Querying GrAF data in linguistic analysis" [abstract] [slides-odp]
Øyvind Eide, Vemund Olstad (Unit for Digital Documentation, University of Oslo) "TEI for Interactive Concordances: The New Menota Search System" [abstract] [slides-ppt] [slides-odp]
Serge Heiden (ICAR Research Lab – Lyon University and CNRS, France) "Exploiting TEI-annotated data with TXM" [abstract] [slides-pdf]
Thomas Krause, Carolin Odebrecht, Amir Zeldes, Florian Zipser (Humboldt-Universität zu Berlin) "Unary TEI Elements and the Token Based Corpus" [abstract] [slides-pdf]
Piotr Pęzik (University of Lodz) "Indexed graph databases for querying rich TEI annotation" [abstract] [slides-pdf]
Laurent Romary (INRIA & HUB-IDSL) "Data models and the (blind ?) query of lexical resources" [abstract] [slides-pptx]
Dirk Roorda (DANS) "System for HEBrew Text: ANnotations for Queries and Markup" [abstract] [slides-online]
Thomas Schmidt (IDS Mannheim) "Querying Spoken Language Corpora" [abstract] [slides-pptx]

Panel discussion

Topic: "How could the TEI community benefit from TEI-specific query solutions? What should they look like?"

The invited panelists are: Alexander Geyken, Wendell Piez [zip], Dirk Roorda, and Martin Wynne [transcript].

Organizing Committee

Piotr Bański1,2, Marc Kupietz1, Andreas Witt1

1. Institut für Deutsche Sprache, Mannheim

2. Institute of English Studies, University of Warsaw

{banski,kupietz,witt} @ ids-mannheim.de

Thanks!

The organisers would like to thank the speakers, all of whom had also acted as one another's non-anonymous peer reviewers, and thus were able to influence and improve the final shape of the abstracts and the talks.