Challenges in the Management of Large Corpora

Last updated: 2023-05-09

General information

Creating very large corpora no longer appears to be a challenge. With the constantly growing amount of born-digital text – be it available on the web or only on the servers of publishing companies – and with the rising number of printed texts digitised by public institutions or technological giants such as Google, we may safely expect the upper limits of text collections to keep increasing for years to come. Although some of this was already true 20 years ago, we have a strong impression that the challenge has now shifted from an increase in terms of size to the effective and efficient processing of the large amounts of primary data and much larger amounts of annotation data.

On the one hand, some fundamental technical methods and strategies call for re-evaluation. These include, for example, efficient and sustainable curation of data, management of collections that span multiple volumes or that are distributed across several centres, innovative corpus architectures that maximise the usefulness of data, and techniques that allow for efficient search and analysis.

On the other hand, the new challenges require research into language-modelling methods and new corpus-linguistic methodologies that can make use of extremely large, semi-structured datasets. These methodologies must re-address the tasks of investigating rare phenomena involving multiple lexical items, of finding and representing fine-grained sub-regularities, and of investigating variations within and across language domains. This should be accompanied by new methods to structure both content and search results, in order to, among others, cope with false positives, assess data quality, or ensure interoperability. Another much-needed research goal is visualization techniques that facilitate the interpretation of results and formulation of new hypotheses.

Due to the interest that the first meeting of CMLC (held at LREC-2012 in Istanbul) enjoyed, the workshop became a cyclic event. The second meeting took place at LREC again, in 2014 in Reykjavík; the third edition of CMLC was part of Corpus Linguistics 2015 in Lancaster. The fourth meeting took place in Portorož, Slovenia, as part of LREC-2016. CMLC-5 was an event combined with BigNLP-2017 and took place as part of the Corpus Linguistics conference in Birmingham. The sixth meeting took us to Japan (LREC-2018 in Miyazaki), and the seventh to Wales (CL 2019 in Cardiff). Due to the COVID-19 pandemic, the eighth event, scheduled to be co-located with LREC-2020 in Marseille, shared the fate of the conference and was cancelled at the post-review stage, while we chose to maintain the event numbering for the sake of the proceedings volume. The subsequent meeting, at CL 2021, organised by the University of Limerick, was fully virtual. The 10^th meeting was held in hybrid mode, physically anchored in Marseille, at LREC-2022.

Topics of interest

CMLC aims at gathering experts in corpus linguistics as well as in language resource creation and curation, in order to provide for an intensive exchange of expertise, results and ideas. Some of the topics that this mixed community has found particularly interesting are listed below.

Technical issues
- Storage and retrieval solutions for big textual data corpora: primary data, metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks (Hadoop, Spark, etc.) for language processing
- Dealing with streaming data (e.g. in social media) and rapidly changing corpora
Licensing, legal and privacy issues:
- Licensing models of open and closed data
- Coping with intellectual property restrictions
Linguistic content issues:
- Dealing with the variety of language: multilinguality, historical texts, user-generated content, etc.
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations
Exploitation issues:
- Query languages
- Analysis of very large corpora
- Innovative approaches to aggregation and visualisation of text analytics

Meetings and proceedings

The language of our meetings and proceedings is English.

CMLC-1: 22^nd of May, 2012 in Istanbul (proceedings, gallery)
CMLC-2: 31^st of May, 2014 in Reykjavík (proceedings, gallery)
CMLC-3: 20^th of July, 2015 in Lancaster (proceedings, gallery)
CMLC-4: 28^th of May, 2016 in Portorož (proceedings)
CMLC-5, in conjunction with BigNLP-2017: 24^th of July, 2017, in Birmingham (proceedings)
CMLC-6: 7^th of May, 2018, in Miyazaki (proceedings)
CMLC-7: 22^nd of July, 2019, in Cardiff (proceedings)
CMLC-8: planned for May 2020, in Marselha, but cancelled due to COVID-19 (proceedings)
CMLC-9: 12^th of July, 2021, virtually in Limerick (proceedings)
CMLC-10: June 2022, in Marselha (proceedings)

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html

General information

Topics of interest

Technical issues

Licensing, legal and privacy issues:

Linguistic content issues:

Exploitation issues:

Meetings and proceedings

Homepage