12th Workshop on the Challenges in the Management of Large Corpora
The next meeting of CMLC will be held in the morning of the 11th of May, as part of the LREC-2026 conference in Palma, Mallorca. It is going to be a hybrid event.
Accepted presentations
Paper presentations
- "Pop Lyrics Through Time: Challenges in Corpus-Based Modeling of Linguistic and Emotional Dynamics in German Pop Lyrics", by Roman Schneider
- "The Infrastructure Behind Latvian National Corpora Collection", by Roberts Dargis and Baiba Valkovska
- "Optimized for AI: Curating the Icelandic Gigaword Corpus for Stable LLM Training", by Jón Friðrik Daðason and Steinthor Steingrimsson
- "A large dataset representing Bulgarian, with the Bulgarian National Corpus as its core" (short paper), by Svetla Peneva Koeva and Ivelina Stoyanova
- "Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996–2024) for Language Modeling and Sociolinguistic Research" (short paper), by Matteo Rinaldi, Rossella Varvara and Viviana Patti
Poster presentations
Poster presentations are going to be preceded by a flash presentation session.
- "National Corpus Report: The German Medical Text Corpus (GeMTeX)", by Dawn Knight and Fernando Alva Manchego
- "From Corpus to Community: New NLP Tools for Welsh Language Research and Learning", by Witold Kieraś, Małgorzata Marciniak, Katarzyna Krasnowska-Kieraś and Marcin Woliński
- "The Corpus of Contemporary Polish: 2011-2020 Decade and Beyond", by Marko Tadić, Vanja Štefanec and Daša Farkaš
- "Swiss-AL: Language Data Platform for Applied Sciences", by Julia Krasselt, Philipp Dreesen, Dolores Lemmenmeier-Batinić, Sooyeon Geckeler, Klaus Rothenhäusler and Matthias Fluor
- "Managing Growth in a National Corpus: The Hungarian National Corpus 3.0 (MNSZ3)", by Noémi Ligeti-Nagy, Enikő Héja, Ágnes Bánfi, Flóra Földesi, Bence Sárossy, Boglárka Skrabák, Tamás Váradi and Gabor PROSZEKY
- "EuReCo, KorAP and DeReKo: Updates on Ingestion and Annotation Pipelines, Backend, Interfaces, Operation, and Corpora", by Marc Kupietz, Harald Lüngen, Nils Diewald, Eliza Margaretha Illig, Helge Stallkamp, Uyen-Nhu Tran and Rameela Yaddehige
- "The British National Corpus 1994 to 2026", by Martin Wynne and Megan Bushnell
- "CoRoLa version 2.0: corpus enrichment and a new annotation level", by Elena Irimia, Verginica Barbu Mititelu, Radu Ion, Vasile Pais, Maria Carp and Dan Ioan Tufis
- "Recent developments of the Bulgarian National Corpus", by Svetla Peneva Koeva and Ivelina Stoyanova
- "The Hellenic National Corpus: Present, Future", by Maria Gavriilidou and Nikolaos Sidiropoulos
- "General Regionally Annotated Corpus of Ukrainian: Recent Developments and Future Plans", by Maria Shvedova
- "Corpas Náisiúnta na Gaeilge 2022-2029: A Project Overview", by Mícheál J. Ó Meachair, Úna Bhreathnach, Kevin Scannell, Michal Mechura, Brian Ó Raghallaigh and Gearóid Ó Cleircín
- "Merimënga: A Manifest-First Pipeline for Reproducible Albanian Web Corpus Construction", by Besim Kabashi and Michael Ruppert
Important dates
- Deadline for the submission of camera-ready papers: the 30th of March 2026 (Monday)
- Meeting: the 11th of May, 8:50 am for the poster presenters, 9:00 am for the rest of the world
Final submissions
- A bunch of hard constraints is to be found in the acceptance e-mails
- Among others, the final versions have to conform to LREC-2026 templates
- Also, make sure that at least one presenter is registered as physically present (how about meeting the early-bird deadline?)
- Consult the LREC FAQ for any additional issues concerning the preparation and presentation of the accepted submissions.
- If in doubt, please e-mail the Organisers (you have the alias in your e-mail)
Workshop description
As in the previous CMLC meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, natural language generation, and data science.
Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.
A mixed blessing of the times is that much of those texts, in mono- and multi-lingual arrangements can now be created automatically by exploiting Large Language Models at various scales. That, on the one hand, makes it possible to inflate the amounts of data where normally data would be scarce: in under-resourced languages or language varieties, in specific genres or for intricate and rarely attested constructions. On the other hand, such procedures immediately raise concerns regarding the authenticity and quality of such data, casting doubt on the possibility of adequately (truthfully, verifiably, reproducibly) addressing the kind of research questions that provoked the rapid but tainted increase of the available data volumes in the first place. Similar doubts may be directed at mass creation of secondary and tertiary data ordinarily crucial for linguistic research: apart from potential legal constraints on the use of the initial amounts of human-created data, new questions arise as to the legal status of the derived data, the ways to create e.g. provenance metadata of the derived resources, and the level of trust regarding mass-produced grammatical (and other) annotation layers.
These new as well as more traditional questions lie at the base of the list of topics that management of large corpora (for any currently suitable definition of “large”) invokes or at least strongly brushes against.
National corpus initiatives
CMLC has often invited reports on national corpus initiatives. Given that it's been a while since the last round, we are happy to host a little "What's the news?" session, with some of our veteran presenters as well as colleagues who have not yet introduced their national corpus projects.
Our poster sessions is, as usual, scheduled to overlap with the coffee break, to ensure informal atmosphere and to maximally use the time slot available to us. A flash presentation section is planned for just before the poster session: ca. 3 minutes for the highlights.
LRE 2026 Map and the "Share your LRs!" initiative
When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).
Programme Committee
- Laurence Anthony (Waseda University, Japan)
- Vladimír Benko (Slovak Academy of Sciences)
- Felix Bildhauer (IDS Mannheim)
- Mark Davies (English-Corpora.org)
- Nils Diewald (IDS Mannheim)
- Kaja Dobrovoljc (University of Ljubljana / Jožef Stefan Institute)
- Jarle Ebeling (University of Oslo)
- Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
- Andrew Hardie (Lancaster University, UK)
- Serge Heiden (ENS de Lyon)
- Ulrich Heid (University of Hildesheim)
- Nancy Ide (Vassar College / Brandeis University)
- Olha Kanishcheva (Heidelberg University)
- Gražina Korvel (Vilnius University)
- Natalia Kocyba (Samsung Poland)
- Michal Křen (Charles University, Prague)
- Anna Latusek (ICS PAS, Warsaw)
- Paul Rayson (Lancaster University)
- Laurent Romary (INRIA)
- Thomas Schmidt (University of Duisburg-Essen)
- Serge Sharoff (University of Leeds)
- Maria Shvedova (Kharkiv Polytechnic Institute / University of Jena)
- Irena Spasić (Cardiff University)
- Martin Wynne (University of Oxford)
Organising Committee
[hover for the e-mail address]
- 📩 Piotr Bański (IDS Mannheim)
- 📩 Dawn Knight (Cardiff University)
- 📩 Marc Kupietz (IDS Mannheim)
- 📩 Andreas Witt (IDS Mannheim)
- 📩 Alina Wróblewska (ICS PAS, Warsaw)
Homepage
CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html