|Home||CMLC-1 [proceedings] | [proceedings 2014] | CMLC-3 [proceedings] | CMLC-4 [proceedings] | CMLC-5 [proceedings] | CMLC-6 [proceedings] | CMLC-7|
Challenges in the management of large corpora (CMLC-2)
Workshop in conjunction with LREC 2014
26-31 May, Reykjavik
Workshop date: Saturday, 31 May 2014 (afternoon)
Marc Kupietz, Harald Lüngen, Piotr Bański and Cyril Belica,
Maximizing the Potential of Very Large Corpora: 50 Years of Big Language Data at IDS Mannheim
Adam Kilgarriff, Pavel Rychlý and Miloš Jakubíček,
Effective Corpus Virtualization
Dirk Goldhahn, Steffen Remus, Uwe Quasthoff and Chris Biemann,
Top-Level Domain Crawling for Producing Comprehensive Monolingual Corpora from the Web
Vincent Vandeghinste and Liesbeth Augustinus,
Making a large treebank searchable online. The SONAR case.
16:00–16:30 Coffee break
John Vidler, Andrew Scott, Paul Rayson, John Mariani and Laurence Anthony,
Dealing With Big Data Outside Of The Cloud: GPU Accelerated Sort
From several hundred million words to near one thousand million words: Scaling up a corpus indexer and a search engine with MapReduce
Hanno Biber and Evelyn Breiteneder,
Text Corpora for Text Studies. About the foundations of the AAC-Austrian Academy Corpus
17:50–18:00 Piotr Bański, Closing remarks
We live in an age where the well-known maxim that “the only thing better than data is more data” is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google's n-gram collection. Indeed, the challenge is now shifted towards dealing with large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-)linguistic methodologies that can make use of such extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyze it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.
The second LREC-workshop on “Challenges in the management of large corpora” aims at gathering the leading researchers in the fields of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas. In accordance with this LREC’s hot topic: “Big Data”, contributions concerned with national corpora, reference corpora and other very large corpora are particularly welcome.
The half day workshop will be wrapped up with a discussion about the common challenges, ideas for possible solutions and potential co-operations.
We invite submissions dealing with:
The half-day workshop will take place at the Conference venue, the Harpa Conference Centre, in the afternoon session of Saturday, 31 May 2014.
The workshop is co-organized by the following institutions:
Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt
Evelyn Breiteneder, Hanno Biber, Karlheinz Mörth