Creating extremely large corpora no longer appears to be a challenge. With the constantly growing amount of born-digital text – be it available on the web or only on the servers of publishing companies – and with the increasing number of printed texts digitized by public institutions or technological giants such as Google, we may safely expect the upper limits of text collections to keep increasing for years to come. Although some of this was true already 20 years ago, we have the strong impression that the challenge is now shifted to dealing with the large amounts of primary data and much larger amounts of annotation data.
On the one hand, the new challenges demand a discovery of new (corpus-) linguistic methodologies that can make use of extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains involving for example new methods for structuring search results (to cope with false positives) or visualization techniques that facilitate the interpretation of results or the abduction of new hypotheses. On the other hand, some fundamental technical methods and strategies call for re-evaluation. These include e.g. efficient and sustainable curation of the data, management of collections that span multiple volumes or that are distributed across several centres, innovative corpus architectures that maximise the usefulness of data, and techniques that allow to search and to analyze the data efficiently.
The third edition of CMLC will accompany Corpus Linguistics 2015 in Lancaster, and will be held on the 20th of July 2015. This half-day workshop will gather the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide a platform for an intensive exchange of expertise, results and ideas, in particular concerning the following topics:
- recent developments in ongoing web-as-corpus initiatives, national corpora, reference corpora, and other very large corpora
- evaluation and investigation of the properties of large corpora
- extraction, representation, and management of metadata
- virtualization / techniques for drawing and accessing stratified virtual corpora
- increasing the coverage of underrepresented strata
- legal issues including license models and license management
- acquisition and curation of large text archives from third parties
- legal and technological issues of corpora physically distributed over different locations
- system- and database architectures for very large semi-structured data sets
- heavily annotated corpora
- use of annotation standards for large data sets
- issues of interoperability and tool chaining
- interfaces for user-provided annotations
- quality control of annotations in large data sets
- dealing with efficient and scalable user interfaces
- effective querying of large corpora with multiple annotation layers
- effective techniques for analyzing corpus data
- strategies and techniques for maximizing recall and coping with large numbers of false positives
- visualization and other techniques that facilitate the linking between quantitative investigations and qualitative interpretations
- “put the computation near the data” as a strategy for dealing with IPR restrictions
- open-source software and open-data corpora strategies
- other issues that arise in the context of management of large datasets.
We invite extended abstracts (up to 4 pages standard size, references excluded) addressing some of the topics listed above.
A volume of proceedings is planned.
The home page of CMLC events is located at http://corpora.ids-mannheim.de/cmlc.html.
- 22.03: deadline for extended abstract proposals
- 10.04: notifications of acceptance
- 20.07: CMLC-3
Abstract submission details
We invite extended anonymized abstracts (up to 4 pages standard size, references excluded) addressing some of the topics listed above.
Submission address: http://linguistlist.org/easyabs/cmlc-2015
If you experience difficulties with the submission system or need extra information, kindly contact us at banskiids-mannheim.de .
[hover for the e-mail address]
Institut für Deutsche Sprache, Mannheim
Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt
Institute for Corpus Linguistics and Text Technology, Vienna
Hanno Biber, Evelyn Breiteneder
- Damir Ćavar (Indiana University, Bloomington)
- Isabella Chiari (Sapienza University of Rome)
- Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
- Václav Cvrček (Charles University Prague)
- Mark Davies (Brigham Young University)
- Tomaž Erjavec (Jožef Stefan Institute)
- Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
- Andrew Hardie (Lancaster University)
- Serge Heiden (ENS de Lyon)
- Nancy Ide (Vassar College)
- Miloš Jakubíček (Lexical Computing Ltd.)
- Adam Kilgarriff (Lexical Computing Ltd.)
- Krister Lindén (University of Helsinki)
- Martin Mueller (Northwestern University)
- Nelleke Oostdijk (Radboud University Nijmegen)
- Christian-Emil Smith Ore (University of Oslo)
- Piotr Pęzik (University of Łódź)
- Uwe Quasthoff (Leipzig University)
- Paul Rayson (Lancaster University)
- Laurent Romary (INRIA, DARIAH)
- Roland Schäfer (FU Berlin)
- Serge Sharoff (University of Leeds)
- Mária Simková (Slovak Academy of Sciences)
- Jörg Tiedemann (Uppsala University)
- Dan Tufiş (Romanian Academy, Bucharest)
- Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)