CMLC-2014

Home	CMLC-1 [proceedings] \| [proceedings 2014] \| CMLC-3 [proceedings] \| CMLC-4 [proceedings] \| CMLC-5 [proceedings] \| CMLC-6 [proceedings] \| CMLC-7 [proceedings] \| CMLC-8 [proceedings] \| CMLC-10

Challenges in the management of large corpora (CMLC-2)

Workshop in conjunction with LREC 2014

26-31 May, Reykjavik

Workshop date: Saturday, 31 May 2014 (afternoon)

Workshop programme

14:00–14:10 Introduction

14:10–14:30

Marc Kupietz, Harald Lüngen, Piotr Bański and Cyril Belica,

Maximizing the Potential of Very Large Corpora: 50 Years of Big Language Data at IDS Mannheim

14:30–15:00

Adam Kilgarriff, Pavel Rychlý and Miloš Jakubíček,

Effective Corpus Virtualization

15:00–15:30

Dirk Goldhahn, Steffen Remus, Uwe Quasthoff and Chris Biemann,

Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web

15:30 –16:00

Vincent Vandeghinste and Liesbeth Augustinus,

Making a large treebank searchable online. The SONAR case.

16:00–16:30 Coffee break

16:30–17:00

John Vidler, Andrew Scott, Paul Rayson, John Mariani and Laurence Anthony,

Dealing With Big Data Outside Of The Cloud: GPU Accelerated Sort

17:00–17:30

Jordi Porta,

From several hundred million words to near one thousand million words: Scaling up a corpus indexer and a search engine with MapReduce

17:30–17:50

Hanno Biber and Evelyn Breiteneder,

Text Corpora for Text Studies. About the foundations of the AAC-Austrian Academy Corpus

17:50–18:00 Piotr Bański, Closing remarks

Workshop description

We live in an age where the well-known maxim that “the only thing better than data is more data” is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google's n-gram collection. Indeed, the challenge is now shifted towards dealing with large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-)linguistic methodologies that can make use of such extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyze it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.

Motivation and Topics of interest

The second LREC-workshop on “Challenges in the management of large corpora” aims at gathering the leading researchers in the fields of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas. In accordance with this LREC’s hot topic: “Big Data”, contributions concerned with national corpora, reference corpora and other very large corpora are particularly welcome.

The half day workshop will be wrapped up with a discussion about the common challenges, ideas for possible solutions and potential co-operations.

We invite submissions dealing with:

tools for all aspects of management of very large corpora,
evaluation and investigation of the properties of large corpora
system- and database architectures for very large semi-structured data sets,
heavily annotated corpora,
managing multiple and concurrent annotation layers,
use of annotation standards for large data sets,
issues of interoperability and tool-chaining,
crowdsourcing for large data sets,
quality control of annotations in large data sets,
dealing with corpora physically distributed over different locations,
efficient and scalable user interfaces,
effective querying of large corpora with multiple annotation layers,
“put the computation near the data” as strategy for dealing with IPR restrictions,
open-source software and open-data corpora strategies,
other issues that arise in the context of management of large datasets.

Venue

The half-day workshop will take place at the Conference venue, the Harpa Conference Centre, in the afternoon session of Saturday, 31 May 2014.

Organizing Committee

The workshop is co-organized by the following institutions:

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt

Institute for Corpus Linguistics and Text Technology, Vienna

Evelyn Breiteneder, Hanno Biber, Karlheinz Mörth

Programme committee:

Lars Borin (University of Gothenburg)
Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
Václav Cvrček (Charles University Prague)
Mark Davies (Brigham Young University)
Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
Andrew Hardie (University of Lancaster)
Nancy Ide (Vassar College)
Miloš Jakubíček (Lexical Computing Ltd.)
Adam Kilgarriff (Lexical Computing Ltd.)
Krister Lindén (University of Helsinki)
Jean-Luc Minel (Université Paris Ouest Nanterre La Défense)
Christian Emil Ore (University of Oslo)
Adam Przepiórkowski (Polish Academy of Sciences)
Uwe Quasthoff (Leipzig University)
Pavel Rychlý (Masaryk University Brno)
Roland Schäfer (FU Berlin)
Marko Tadić (University of Zagreb)
Dan Tufiş (Romanian Academy, Bucharest)
Tamás Váradi (Hungarian Academy of Sciences, Budapest)

Workshop homepage:http://corpora.ids-mannheim.de/cmlc.html