Home CMLC-1 [proceedings] | CMLC-2 [proceedings] | CMLC-3 [proceedings] | CMLC-4 [proceedings] | CMLC-5 [proceedings] | CMLC-6 [proceedings]
last updated: 2019-06-07 16:31 CEST

7th Workshop on the Challenges in the Management of Large Corpora

Where and when?

Cardiff, 22nd of July, 2019 -- during the Corpus Linguistics 2019 conference.

Preliminary Programme

09.00 – 11.00     Session 1

11.20 – 13.00     Session 2


The upcoming CMLC meeting continues the successful series of “Challenges in the management of large corpora” events, previously hosted at LREC conferences, CL2015, and CL2017. As in the previous meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, and data science.

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A number of key themes and questions emerge of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way?


This year’s event will cover the whole range of the standard CMLC themes, with some new additions and hot topics:

In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

Important dates

Submission categories

We invite anonymised extended abstracts for oral presentations on the topics listed above (PDF, 1000-1500 words excluding references, font preferably 11 pt, line spacing 1.5).

CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised. The number of poster slots is limited. If there is spare capacity in the poster session, we reserve the right to change the presentation format of accepted papers from oral presentation to poster. Such a change will not affect how the paper is presented in the proceedings.

Submissions are accepted exclusively through the EasyChair submission system, at https://easychair.org/conferences/?conf=cmlc7.


Online proceedings will be published before the meeting in a peer-reviewed, open-access volume.

Programme Committee

Organising Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences

Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich

Simon Clematide

Austrian Academy of Sciences, Vienna

Hanno Biber, Evelyn Breiteneder


CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html