Home | Proceedings CMLC-1 | CMLC-2 | CMLC-3 | CMLC-4 | CMLC-5 | CMLC-6 | CMLC-7 | CMLC-8 | CMLC-9
last updated: 2022-01-10

10th Workshop on the Challenges in the Management of Large Corpora

When and where/how

CMLC-10 is going to be hosted by LREC 2022 (Marseille, 20-25 June 2022). We're hoping for a non-virtual event, but...

Important dates

Abstract submission

All submissions must use the LREC template and come via the START manager (URL to be published here).

We invite anonymised extended abstracts for 15- to 20-minute oral presentations on the workshop topics (see below). Format: PDF, 1000-1500 words excluding references, (~ preferably up to 4 pages).

CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised.

Workshop description

The upcoming CMLC meeting continues the successful series of Challenges in the management of large corpora events, previously hosted at LREC (since 2012) and at Corpus Linguistics conferences (since 2015). As in the previous meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, and data science. Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitized, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A number of key themes and questions emerge which are of interest to the contributing research communities: (a) What can be done to deal with IPR and data protection issues? (b) What sampling techniques can we apply? (c) What quality issues should we be aware of? (d) What infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) What affordances do visualization techniques offer for the exploratory analysis approaches of corpora? (f) What kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) How to guarantee that corpus data remain available and sustainably usable?

Motivation and topics of interest

This year’s event will cover the entire range of the standard CMLC themes, with some new additions including some of LREC 2022’s focus topics:

In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

Share your LRs!

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones). See the relevant ELRA page.

Programme Committee

To be announced soonish.

Organising Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

Piotr Bański, 📩 Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences

📩 Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich

Simon Clematide

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html