Creating very large corpora no longer appears to be a challenge. With the constantly growing amount of born-digital text – be it available on the web or only on the servers of publishing companies – and with the rising number of printed texts digitized by public institutions or technological giants such as Google, we may safely expect the upper limits of text collections to keep increasing for years to come. Although some of this was already true 20 years ago, we have a strong impression that the challenge has now shifted from an increase in terms of size to the effective and efficient processing of the large amounts of primary data and much larger amounts of annotation data.
On the one hand, some fundamental technical methods and strategies call for re-evaluation. These include, for example, efficient and sustainable curation of data, management of collections that span multiple volumes or that are distributed across several centres, innovative corpus architectures that maximize the usefulness of data, and techniques that allow for efficient search and analysis.
On the other hand, the new challenges require research into language-modelling methods and new corpus-linguistic methodologies that can make use of extremely large, structured datasets. These methodologies must re-address the tasks of investigating rare phenomena involving multiple lexical items, of finding and representing fine-grained sub-regularities, and of investigating variations within and across language domains. This should be accompanied by new methods to structure both content and search results, in order to, among others, cope with false positives, assess data quality, or ensure interoperability. Another much-needed research goal is visualization techniques that facilitate the interpretation of results and formulation of new hypotheses.
Due to the interest that the first meeting (at LREC 2012 in Istanbul) of CMLC enjoyed, the workshop has become a cyclic event. The second meeting took place at LREC again, in 2014 in Reykjavík; the third edition of CMLC was part of Corpus Linguistics 2015 in Lancaster. The fourth meeting took place place in Portorož, Slovenia, as part of LREC-2016.
Topics of interest
CMLC aims at gathering experts in corpus linguistics as well as in language resource creation and curation, in order to provide for an intensive exchange of expertise, results and ideas. Some of the topics that this mixed community has found particularly interesting are listed below.
- recent developments in ongoing web-as-corpus initiatives, national corpora, reference corpora, and other very large corpora
- evaluation and investigation of the properties of large corpora
- extraction, representation, and management of metadata
- virtualization / techniques for drawing and accessing stratified virtual corpora
- increasing the coverage of underrepresented strata
- legal issues including license models and license management
- acquisition and curation of large text archives from third parties
- legal and technological issues of corpora physically distributed over different locations
- system- and database architectures for very large semi-structured data sets
- heavily annotated corpora
- challenges of large multiparallel corpora
- use of annotation standards for large data sets
- issues of interoperability and tool chaining
- interfaces for user-provided annotations
- quality control of annotations in large data sets
- dealing with efficient and scalable user interfaces
- effective querying of large corpora with multiple annotation layers
- effective techniques for analyzing corpus data
- strategies and techniques for maximizing recall and coping with large numbers of false positives
- visualization and other techniques that facilitate the linking between quantitative investigations and qualitative interpretations
- “put the computation near the data” as a strategy for dealing with IPR restrictions
- open-source software and open-data corpora strategies
- other issues that arise in the context of management of large datasets
Identify, Describe and Share your LRs!
Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.
As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2016 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.
CMLC homepage is located at http://corpora.ids-mannheim.de/cmlc.html