Home | Proceedings CMLC-1 | CMLC-2 | CMLC-3 | CMLC-4 | CMLC-5 | CMLC-6 | CMLC-7 | CMLC-8 | CMLC-10
last updated: 2021-07-09

9th Workshop on the Challenges in the Management of Large Corpora

Special Topic: Design and Management of Research Software


The proceedings volume has been published at https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/10467

The versions available from there seem to have lost all hyperlinks.

When and where/how

CMLC-9 is going to take place on the 12th of July and it is going to be an online event: a pre-conference workshop at Corpus Linguistics 2021 conference, hosted by the University of Limerick, Ireland.

The zoom link should have been sent to all registered participants. Please let us know if you don't have the zoom link, even though you are registered for the workshop.

Important dates


All times are listed in Irish Standard Time (UTC + 1:00)

10.00 – 11.30     Session 1: Presentations

11.45 – 12.30     Session 2: Panel Discussion about Research Software Management

Confirmed participants:

Workshop description

The upcoming CMLC meeting continues the successful series of “Challenges in the management of large corpora” events, previously hosted at LREC (since 2012) and CL (since 2015) conferences. As in the previous meetings, we wish to explore common areas of interest across a range of issues in linguistic research data and tool management, corpus linguistics, natural language processing, and data science, with a special focus on tools, this time.

Linguistic research software and other topics of interest

To an even greater extent than in other disciplines, linguistic research data can hardly be used without the help of appropriate research software. As frequently noted at CMLC events, this often relates to the need for client/server approaches, as language data cannot usually be downloaded and processed on the home or lab PC, for legal and logistical reasons. Additionally, due to the complexity and high dimensionality of linguistic data and the unknown nature of the variation factors, specialised tools are needed on the way from raw data to their interpretation. These tools cannot be considered part of a general technical infrastructure.

Starting with the reconstruction or transformation of the raw data and e.g. its tokenization, the linguistic assumptions and decisions, as well as errors, manifested in research tools have as much influence on observations and possibly on research results as the research data itself – if data and tools can be treated separately at all. While approaches to the management of research data have been discussed quite broadly in the last 15 years, this was at best only marginally the case for research tools.

For this reason, CMLC-9 will focus on approaches to the design, development and management of research software (while not ignoring the other CMLC topics):

Programme Committee

Organising Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

📩 Piotr Bański, 📩 Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences

📩 Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich

Simon Clematide


CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html