We live in an age where the well-known maxim that “the only thing better than data is more data” is something that no longer sets unattainable goals. Creating extremely large corpora is no longer a challenge, given the proven methods that lie behind e.g. applying the Web-as-Corpus approach or utilizing Google's n-gram collection. Indeed, the challenge is now shifted towards dealing with the large amounts of primary data and much larger amounts of annotation data. On the one hand, this challenge concerns finding new (corpus-) linguistic methodologies that can make use of such extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items or to find and represent fine-grained sub-regularities; on the other hand, some fundamental technical methods and strategies are being called into question. These include e.g. successful curation of the data, management of collections that span multiple volumes or that are distributed across several centres, methods to clean the data from non-linguistic intrusions or duplicates, as well as automatic annotation methods or innovative corpus architectures that maximise the usefulness of data or allow to search and to analyze it efficiently. Among the new tasks are also collaborative manual annotation and methods to manage it as well as new challenges to the statistical analysis of such data and metadata.
The half-day workshop on “Challenges in the management of large corpora” aims at gathering the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide for an intensive exchange of expertise, results and ideas.
The proceedings volume is available as:
14:00 – Opening
14.03 – 14.30 Keynote talk: Nancy Ide, Big, Clean, and Comprehensive – but is it Worth it?
14.30 – 15.00 Lars Bungum and Björn Gambäck, Efficient N-gram Language Modeling for Billion Word Web-Corpora
15.00 – 15.30 Hans Martin Lehmann and Gerold Schneider, Dependency Bank
15.30 – 16.00 Roman Schneider, Evaluating DBMS-based access strategies to very large multi-layer corpora
16:00 – 16:30 Coffee break
16.30 – 17.00 Hanno Biber and Evelyn Breiteneder, The AAC Container. Managing Text Resources for Text Studies
17.00 – 17.30 Damir Ćavar, Helen Aristar-Dry and Anthony Aristar, Large Mailing List Corpora: Management, Annotation and Repository
17.30 – 18.00 Ritesh Kumar, Pinkey Nainwani, Girish Nath Jha and Shiv Bhusan Kaushik, Creating and managing large annotated parallel corpora of Indian languages
18.00 – 18.30 Nelleke Oostdijk and Henk van den Heuvel, Introducing the CLARIN-NL Data Curation Service
18.30 – 19.00 Final discussion
The workshop will take place at the Conference venue, the Lütfi Kirdar Istanbul Exhibition and Congress Centre. Further details will be available in due time from conference homepage.
The workshop is co-organized by the following three institutions:
Piotr Bański, Marc Kupietz, Andreas Witt
Helen Aristar-Dry, Anthony Aristar, Damir Ćavar
Serge Heiden
Núria Bel (Universitat Pompeu Fabra)
Mark Davies (Brigham Young University)
Stefanie Dipper (Ruhr-Universität Bochum)
Tomaž Erjavec (Jožef Stefan Institute)
Stefan Evert (Technische Universität Darmstadt)
Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
Andrew Hardie (University of Lancaster)
Nancy Ide (Vassar College)
Sandra Kübler (Indiana University)
Martin Mueller (Northwestern University)
Mark Olsen (University of Chicago)
Adam Przepiórkowski (Polish Academy of Sciences, University of Warsaw)
Reinhard Rapp (Johannes Gutenberg-Universität Mainz, University of Leeds)
Laurent Romary (INRIA, Humboldt-Universität zu Berlin)
Pavel Straňák (Charles University in Prague)
Amir Zeldes (Humboldt-Universität zu Berlin)
Workshop homepage: http://corpora.ids-mannheim.de/cmlc.html