Challenges in the Management of Large Corpora (CMLC-3)

(20^th of July, 2015, Lancaster; co-located with Corpus Linguistics 2015)

Creating extremely large corpora no longer appears to be a challenge. With the constantly growing amount of born-digital text – be it available on the web or only on the servers of publishing companies – and with the increasing number of printed texts digitized by public institutions or technological giants such as Google, we may safely expect the upper limits of text collections to keep increasing for years to come. Although some of this was true already 20 years ago, we have the strong impression that the challenge is now shifted to dealing with the large amounts of primary data and much larger amounts of annotation data.

On the one hand, the new challenges demand a discovery of new (corpus-) linguistic methodologies that can make use of extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains involving for example new methods for structuring search results (to cope with false positives) or visualization techniques that facilitate the interpretation of results or the abduction of new hypotheses. On the other hand, some fundamental technical methods and strategies call for re-evaluation. These include e.g. efficient and sustainable curation of the data, management of collections that span multiple volumes or that are distributed across several centres, innovative corpus architectures that maximise the usefulness of data, and techniques that allow to search and to analyze the data efficiently.

The third edition of CMLC will accompany Corpus Linguistics 2015 in Lancaster, and will be held on the 20^th of July 2015. This half-day workshop will gather the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide a platform for an intensive exchange of expertise, results and ideas.

Programme

Session A (9:30–10:40)

Introduction
Michal Křen, "Recent Developments in the Czech National Corpus" [slides]

The paper gives an overview of current status of the Czech National Corpus project. It covers all important aspects of its activities being carried out within the research infrastructure framework: compilation of a variety of different corpora (most prominently written, spoken, parallel and diachronic), morphological and syntactic annotation, development of tools for internal data processing and work flow management, development of user applications and providing user services. Finally, an outline of future plans is presented.
Dan Tufiş, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu, "CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language [slides]"

This article reports on the ongoing CoRoLa project, aiming at creating a reference corpus of contemporary Romanian, opened for on-line free exploitation by researchers in linguistics and language processing, teachers of Romanian, students. We invest serious efforts in persuading owners of IPR on relevant language data to join us and contribute the project with selections of their text and speech repositories. The project is coordinated by two Computer Science institutes, but enjoys cooperation and consulting from professional linguists. We foresee a corpus of more than 500 million word forms, including also about 300 hours of oral texts. The corpus (covering all functional styles of the language) will be pre-processed and annotated at several levels, and also documented with standardized metadata.

Poster presentations and coffee break (10:40–11:30)

Piotr Bański, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Eliza Margaretha, Andreas Witt, "KorAP – an open-source corpus-query platform for the analysis of very large multiply annotated corpora" [poster as PDF]

We present KorAP, the new open-source analysis platform for large corpora, a deliverable of a project concluded in June 2015 at the Institut für Deutsche Sprache in Mannheim. We overview the background for the project, its goals, and the architecture of the system, including the way it is meant to handle richly annotated textual data and facilitate the use of virtual collections, as well as the way it implements ISO CQLF (Corpus Query Lingua Franca, a nascent standard of ISO TC37 SC4 that KorAP provides a reference implementation for).
Hanno Biber, Evelyn Breiteneder, "Large Corpora and Big Data. New Challenges for Corpus Linguistics"

The "AAC – Austrian Academy Corpus" is a German language digital text corpus of more than 500 million tokens. This historical text corpus is annotated in XML formats and constitutes a large text source for research into various linguistic areas. Several of the research questions relevant for corpus linguistics are also determined by latest developments in the fields of big data research so that new challenges for corpus linguistics have to be faced. The AAC has a primary research aim to develop language resources for computational philology and the careful study of texts by making use of corpus research methodologies. Large digital text corpora need to be structured in a systematic way for these purposes. Corpus based digital text studies and similar analytical procedures are among other parameters also determined by the descriptive and visual potential of information representation in various formats. The digital representation systems of linguistic data need to take the specific design issues into account for the processes of creating, generating and analyzing large corpora and related structures of information by transforming and interpreting the language data.
Sebastian Buschjäger, Lukas Pfahler, Katharina Morik, "Discovering Subtle Word Relations in Large German Corpora"

With an increasing amount of text data available it is possible to automatically extract a variety of information about language. One way to obtain knowledge about subtle relations and analogies between words is to observe words which are used in the same context. Recently, Mikolov et al. proposed a method to efficiently compute Euclidean word representations which seem to capture subtle relations and analogies between words in the English language. We demonstrate that this method also captures analogies in the German language. Furthermore, we show that we can transfer information extracted from large non-annotated corpora into small annotated corpora, which are then, in turn, used for training NLP systems.
Johannes Graën, Simon Clematide, "Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora"

The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly interconnected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.

Session B (11:30–13:00)

Stefan Evert, Andrew Hardie, "Ziggurat: A new data model and indexing format for large annotated text corpora"

The IMS Open Corpus Workbench (CWB) software currently uses a simple tabular data model with proven limitations. We outline and justify the need for a new data model to underlie the next major version of CWB. This data model, dubbed Ziggurat, defines a series of types of data layer to represent different structures and relations within an annotated corpus; each such layer may contain variables of different types. Ziggurat will allow us to gradually extend and enhance CWB’s existing CQP-syntax for corpus queries, and also make possible more radical departures relative not only to the current version of CWB but also to other contemporary corpus-analysis software.
Roland Schäfer, "Processing and querying large web corpora with the COW14 architecture" [slides]

In this paper, I present the COW14 tool chain, which comprises a web corpus creation tool called texrex, wrappers for existing linguistic annotation tools as well as an online query software called Colibri2. By detailed descriptions of the implementation and systematic evaluations of the performance of the software on different types of systems, I show that the COW14 architecture is capable of handling the creation of corpora of up to at least 100 billion tokens. I also introduce our running demo system which currently serves corpora of up to roughly 20 billion tokens in Dutch, English, French, German, Spanish, and Swedish.
Jochen Tiepmar, "Release of the MySQL-based implementation of the CTS protocol" [slides]

In a project called "A Library of a Billion Words", we needed an implementation of the CTS protocol that is capable of handling a text collection containing at least 1 billion words. Because the existing solutions did not work for this scale or were still in development I started an implementation of the CTS protocol using methods that MySQL provides. Last year we published a paper that introduced a prototype with the core functionalities but without being compliant with the specifications of CTS. The purpose of this paper is to describe and evaluate the MySQL based implementation now that it is fulfilling the specifications version 5.0 rc.1 and mark it as finished and ready to use.
Closing remarks

Organizing Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt

Institute for Corpus Linguistics and Text Technology, Vienna

Hanno Biber, Evelyn Breiteneder

Programme Committee

Damir Ćavar (Indiana University, Bloomington)
Isabella Chiari (Sapienza University of Rome)
Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
Václav Cvrček (Charles University Prague)
Mark Davies (Brigham Young University)
Tomaž Erjavec (Jožef Stefan Institute)
Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
Andrew Hardie (Lancaster University)
Serge Heiden (ENS de Lyon)
Nancy Ide (Vassar College)
Miloš Jakubíček (Lexical Computing Ltd.)
Adam Kilgarriff (Lexical Computing Ltd.)
Krister Lindén (University of Helsinki)
Martin Mueller (Northwestern University)
Nelleke Oostdijk (Radboud University Nijmegen)
Christian-Emil Smith Ore (University of Oslo)
Piotr Pęzik (University of Łódź)
Uwe Quasthoff (Leipzig University)
Paul Rayson (Lancaster University)
Laurent Romary (INRIA, DARIAH)
Roland Schäfer (FU Berlin)
Serge Sharoff (University of Leeds)
Mária Simková (Slovak Academy of Sciences)
Jörg Tiedemann (Uppsala University)
Dan Tufiş (Romanian Academy, Bucharest)
Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)

Homepage

CMLC homepage is located at http://corpora.ids-mannheim.de/cmlc.html

Programme

Session A (9:30–10:40)

Introduction

Michal Křen, "Recent Developments in the Czech National Corpus" [slides]

Dan Tufiş, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu, "CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language [slides]"

Poster presentations and coffee break (10:40–11:30)

Piotr Bański, Joachim Bingel, Nils Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Eliza Margaretha, Andreas Witt, "KorAP – an open-source corpus-query platform for the analysis of very large multiply annotated corpora" [poster as PDF]

Hanno Biber, Evelyn Breiteneder, "Large Corpora and Big Data. New Challenges for Corpus Linguistics"

Sebastian Buschjäger, Lukas Pfahler, Katharina Morik, "Discovering Subtle Word Relations in Large German Corpora"

Johannes Graën, Simon Clematide, "Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora"

Session B (11:30–13:00)

Stefan Evert, Andrew Hardie, "Ziggurat: A new data model and indexing format for large annotated text corpora"

Roland Schäfer, "Processing and querying large web corpora with the COW14 architecture" [slides]

Jochen Tiepmar, "Release of the MySQL-based implementation of the CTS protocol" [slides]

Closing remarks

Organizing Committee

Institut für Deutsche Sprache, Mannheim

Institute for Corpus Linguistics and Text Technology, Vienna

Programme Committee

Homepage