Challenges in the management of large corpora (CMLC-2)

Workshop in conjunction with LREC 2014 

26-31 May, Reykjavik

Workshop date: Saturday, 31 May 2014 (afternoon)

Workshop programme                           

14:00 – 14:10 – Introduction

14:10 – 14:30

Marc Kupietz, Harald Lüngen, Piotr Bański and Cyril Belica,

Maximizing the Potential of Very Large Corpora: 50 Years of Big Language Data at IDS Mannheim

14:30 – 15:00

Adam Kilgarriff, Pavel Rychlý and Miloš Jakubíček,

Effective Corpus Virtualization

15:00 – 15:30

Dirk Goldhahn, Steffen Remus, Uwe Quasthoff and Chris Biemann,

Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web

15:30 – 16:00

Vincent Vandeghinste and Liesbeth Augustinus,

Making a large treebank searchable online. The SONAR case.

16:00 – 16:30 Coffee break

16:30 – 17:00

John Vidler, Andrew Scott, Paul Rayson, John Mariani and Laurence Anthony,

Dealing With Big Data Outside Of The Cloud: GPU Accelerated Sort

17:00 – 17:30

Jordi Porta,

From several hundred million words to near one thousand million words: Scaling up a corpus indexer and a search engine with MapReduce

17:30 – 17:50

Hanno Biber and Evelyn Breiteneder,

Text Corpora for Text Studies. About the foundations of the AAC-Austrian Academy Corpus

17:50 – 18:00 – Closing remarks

Workshop description

We live in an age where the well-known maxim that “the only thing better than data is more data” is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google's n-gram collection. Indeed, the challenge is now shifted towards dealing with large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-)linguistic methodologies that can make use of such extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyze it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.

Motivation and Topics of interest

The second LREC-workshop on “Challenges in the management of large corpora” aims at gathering the leading researchers in the fields of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas. In accordance with this LREC’s hot topic: “Big Data”, contributions concerned with national corpora, reference corpora and other very large corpora are particularly welcome.

The half day workshop will be wrapped up with a discussion about the common challenges, ideas for possible solutions and potential co-operations.

We invite submissions dealing with:

Venue

The half-day workshop will take place at the Conference venue, the Harpa Conference Centre, in the afternoon session of Saturday, 31 May 2014.

Organizing Committee

The workshop is co-organized by the following institutions:

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt

Institute for Corpus Linguistics and Text Technology, Vienna

Evelyn Breiteneder, Hanno Biber, Karlheinz Mörth

Programme committee:

Workshop homepage: http://corpora.ids-mannheim.de/cmlc.html