Creating extremely large corpora no longer appears to be a challenge. With the constantly growing amount of born-digital text – be it available on the web or only on the servers of publishing companies – and with the increasing number of printed texts digitized by public institutions or technological giants such as Google, we may safely expect the upper limits of text collections to keep increasing for years to come. Although some of this was true already 20 years ago, we have the strong impression that the challenge is now shifted to dealing with the large amounts of primary data and much larger amounts of annotation data.
On the one hand, the new challenges demand a discovery of new (corpus-) linguistic methodologies that can make use of extremely large corpora e.g. in order to investigate rare phenomena involving multiple lexical items, to find and represent fine-grained sub-regularities, or to investigate variations within and across language domains involving for example new methods for structuring search results (to cope with false positives) or visualization techniques that facilitate the interpretation of results or the abduction of new hypotheses. On the other hand, some fundamental technical methods and strategies call for re-evaluation. These include e.g. efficient and sustainable curation of the data, management of collections that span multiple volumes or that are distributed across several centres, innovative corpus architectures that maximise the usefulness of data, and techniques that allow to search and to analyze the data efficiently.
The third edition of CMLC will accompany Corpus Linguistics 2015 in Lancaster, and will be held on the 20th of July 2015. This half-day workshop will gather the leading researchers in the field of Language Resource creation and Corpus Linguistics, in order to provide a platform for an intensive exchange of expertise, results and ideas.
- Stefan Ewert, Andrew Hardie, "A new data model and indexing format for large annotated text corpora"
- Michal Křen, "Recent Developments in the Czech National Corpus"
- Roland Schäfer, "Processing and querying large web corpora with the COW14 architecture"
- Jochen Tiepmar, "Release of the MySQL based CTS Implementation"
- Dan Tufis, Verginica Barbu Mititelu, Elena Irimia, Stefan Dumitrescu, Tiberiu Boros, Horia Nicolai Teodorescu, "CoRoLa Starts Blooming – An update on the Reference Corpus of Contemporary Romanian Language"
- Piotr Bański, Joachim Bingel, Niels Diewald, Elena Frick, Michael Hanl, Marc Kupietz, Andreas Witt, "KorAP -- an open-source corpus-query platform for the analysis of very large multiply annotated corpora"
- Hanno Biber, Evelyn Breiteneder, "Large Corpora and Big Data. New Challenges for Corpus Linguistics"
- Sebastian Buschjäger, Lukas Pfahler, Katharina Morik, "Discovering Subtle Word Relations in Large German Corpora"
- Johannes Graën, Simon Clematide, "Challenges in the Alignment, Management and Exploitation of Large and Richly Annotated Multi-Parallel Corpora"
[hover for the e-mail address]
Institut für Deutsche Sprache, Mannheim
Piotr Bański, Marc Kupietz, Harald Lüngen, Andreas Witt
Institute for Corpus Linguistics and Text Technology, Vienna
Hanno Biber, Evelyn Breiteneder
- Damir Ćavar (Indiana University, Bloomington)
- Isabella Chiari (Sapienza University of Rome)
- Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
- Václav Cvrček (Charles University Prague)
- Mark Davies (Brigham Young University)
- Tomaž Erjavec (Jožef Stefan Institute)
- Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
- Andrew Hardie (Lancaster University)
- Serge Heiden (ENS de Lyon)
- Nancy Ide (Vassar College)
- Miloš Jakubíček (Lexical Computing Ltd.)
- Adam Kilgarriff (Lexical Computing Ltd.)
- Krister Lindén (University of Helsinki)
- Martin Mueller (Northwestern University)
- Nelleke Oostdijk (Radboud University Nijmegen)
- Christian-Emil Smith Ore (University of Oslo)
- Piotr Pęzik (University of Łódź)
- Uwe Quasthoff (Leipzig University)
- Paul Rayson (Lancaster University)
- Laurent Romary (INRIA, DARIAH)
- Roland Schäfer (FU Berlin)
- Serge Sharoff (University of Leeds)
- Mária Simková (Slovak Academy of Sciences)
- Jörg Tiedemann (Uppsala University)
- Dan Tufiş (Romanian Academy, Bucharest)
- Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)