Challenges in the Management of Large Corpora (CMLC-12)

Last updated: 2026-05-06

The twelfth meeting of CMLC will be held in the morning of the 11^th of May, as part of the LREC-2026 conference in Palma, Mallorca.

We are going to meet in Cabrera 1-2 (2nd floor of the venue) (also known as "room 7") at 9 a.m., leave for the (short) coffee break and the poster session (beginning shortly after 10:30), and come back in time for another paper presentation session, i.e., before 11:30.

Poster presenters are requested to arrive earlier, so that at 8:45 they can begin to set their posters up in the Menorca Hall on the 3rd floor (there is going to be no time for that after the first session).

Programme

Place: Cabrera 1-2 (2nd floor) for the presentations, Menorca Hall (3rd floor) for the poster session. Click the titles to see the abstracts.

09:00 Technical Setup / Welcome
09.10 – 09.40 Session A: short papers. Chair: Dawn Knight
- "Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996–2024) for Language Modeling and Sociolinguistic Research" Matteo Rinaldi, Rossella Varvara, Viviana Patti University of Turin, Dipartimento di Informatica We present TestiMole-Conversational a massive collection of discussion boards messages in the Italian language. The large size of the corpus, almost 30B word-tokens (1996–2024), brings challenges in the processing and curation of the resource, but it renders it an ideal dataset for native Italian Large Language Models’ pre-training. Furthermore, discussion boards’ messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in a wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication.
- "IfGPT, a large dataset representing Bulgarian, with the Bulgarian National Corpus as its core" Svetla Peneva Koeva and Ivelina Stoyanova Institute for Bulgarian Language "Prof. Lyubomir Andreychin", Bulgarian Academy of Sciences The paper introduces the IfGPT dataset, which integrates several Bulgarian text collections, including the Bulgarian National Corpus, and applies cleaning, deduplication, and LLM-oriented metadata such as personally identifiable information and bias scores. The composition of the IfGPT dataset is presented, along with the unified metadata schema and metadata management in a graph database, enabling efficient querying and document selection for specific tasks. The main contributions are the integration of multiple Bulgarian text collections into a unified dataset, the development of a standardised metadata schema with graph-based organisation, and the provision of efficient metadata querying mechanisms to support LLM development.
09:45 – 10:25 Session B: Flash presentations of posters. Chairs: Alina Wróblewska, Piotr Bański
Each presentation team has exactly 150 seconds at their disposal (please rehearse; singing is allowed). The chairs are going to keep the time and switch the slides (those are going to be collected very soon, each team gets up to three). Team order matters.
1. "The Hellenic National Corpus: Present, Future" Maria Gavriilidou and Nikolaos Sidiropoulos Athena RC/ILSP The Hellenic National Corpus (HNC) is an integrated online environment offering access to standard Modern Greek language material and to related analysis tools. The HNC corpus has been developed in two main phases, and currently comprises over 97 million words exclusively of written language, sourced from printed resources or scraped from the internet. The material has been automatically lemmatized and morphologically annotated, while a subset of 100,000 words has been further manually corrected, in order to produce a freely downloadable error- free corpus. Through the dedicated platform, the users have access to concordances, morphological analysis of words and statistical information (frequency) at word, lemma, part of speech and n-gram levels. Future steps include the expansion of the material in both historical and coverage dimensions: the inclusion of material from older phases of the language is foreseen, as well as the addition of dialectal material besides standards language.
2. "Corpas Náisiúnta na Gaeilge 2022-2029: A Project Overview" Mícheál J. Ó Meachair¹, Úna Bhreathnach¹, Kevin Scannell², Michal Mechura³, Brian Ó Raghallaigh¹, Gearóid Ó Cleircín¹ ¹Fiontar & Scoil na Gaeilge, Dublin City University, ²Cadhan Aonair, ³Masaryk University and Dublin City University This paper reports the latest developments, planned works, and issues of the Corpas Náisiúnta na Gaeilge (henceforth: CNG, translation: the National Corpus of Irish) project, detailing the work that has been completed to date, current work, and planned future work. This report details the compilation of corpora, development of a project website and part-speech tagger, the challenges of expanding existing corpora, and the addition of historical and legal corpora. We also present the training and outreach activities of the project.
3. "General Regionally Annotated Corpus of Ukrainian: Recent Developments and Future Plans" Maria Shvedova National Technical University "Kharkiv Polytechnic Institute"; University of Jena The General Regionally Annotated Corpus of Ukrainian (GRAC) effectively serves as a national corpus. GRAC v.19 (2025) contains 2 billion tokens from over 800,000 texts (1816–2025). The corpus has multi-level annotations: rich metadata including regional tags, morphological annotation based on the VESUM dictionary, and partial semantic annotation. GRAC is the source of several derivative projects, including UD_Ukrainian_ParlaMint, ParaRook parallel corpora, Rada_Trees, and others.
4. "Recent developments of the Bulgarian National Corpus" Svetla Peneva Koeva and Ivelina Stoyanova Institute for Bulgarian Language "Prof. Lyubomir Andreychin", Bulgarian Academy of Sciences We present recent developments in the Bulgarian National Corpus, including data collection from various sources, cleaning of diverse datasets, enrichment with multimodal data, and extensive metadata, which resulted in the development of IfGPT, a large BulNC-based dataset. Typical methods for distributing the BulNC-based dataset are briefly described, with emphasis on effective searching within the metadata stored in a graph database.
5. "The British National Corpus 1994 to 2026" Martin Wynne and Megan Bushnell University of Oxford The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. It is one of the first generation of monolingual, synchronic, general, representative corpora of its size, and led the way for other national corpora. It was created by a consortium of academic partners and publishers, with funding from the Department of Trade and Industry in the UK. This posters reflects on a number of lessons learning in more than thirty years, in terms of corpus representativeness, modes of access to the corpus, licensing, and managing the transition from a contemporary synchronic corpus to a historical corpus.
6. "The Corpus of Contemporary Polish: 2011-2020 Decade and Beyond" Witold Kieraś, Małgorzata Marciniak, Katarzyna Krasnowska-Kieraś and Marcin Woliński Institute of Computer Science, Polish Academy of Sciences The aim of this poster is to present the Contemporary Corpus of Polish (KWJP), a new reference corpus spanning the period of 2011–2020. The KWJP complements the now discontinued National Corpus of Polish project (NKJP) by providing up-to-date linguistic data. It comprises a 100M-token balanced sub-corpus alongside a larger 1.5B-token unbalanced (opportunistic) corpus, consisting of books and periodicals not included in the balanced part. While the corpus contains almost exclusively copyrighted material and is accessible only via a web-based search engine, a representative 0.5M-token sample has been published as open-source data.
7. "Building the v4 of the Croatian National Corpus" Marko Tadić, Vanja Štefanec and Daša Farkaš University of Zagreb, Faculty of Humanities and Social Sciences It has been thirteen years since the release of the current version (v3) of the Croatian National Corpus (HNK). In terms of synchronicity in corpus linguistics, that many years may be considered quite some time. The preparatory phase for the composition of the new version of HNK (v4) has been going already for several years and in this paper we touch on several issues of concern. Apart of regular corpus parameters, e.g. text sources, text genres, coverage of language varieties, time span, we also discuss about metadata and linguistic annotation schemata. One of important technical prerequisites was the development of CorpRepo, a custom corpus data management system and file system, which enable us to do sustainable long-term maintenance of the data, and to produce newer versions of corpus more easily and more often. The selection of IPR-cleared data entails some restrictions and we give several examples of that kind of textual sources, but also discuss possible weaknesses of such approach to data selection. Regarding the linguistic annotation, the important shift is the decision to abandon the MulText East morphosyntacting descriptions and use solutions recommended by UD-initiative.
8. "Managing Growth in a National Corpus: The Hungarian National Corpus 3.0 (MNSZ3)" Noémi Ligeti-Nagy¹, Enikő Héja¹, Ágnes Bánfi¹, Flóra Földesi¹, Bence Sárossy¹, Boglárka Skrabák², Tamás Váradi¹, Gábor Prószéky¹ ¹ELTE Research Centre for Linguistics, ²ELTE Faculty of Humanities The third generation of the Hungarian National Corpus (MNSZ3) aims to provide a large-scale, curated, and well-described corpus resource needed for the sustainable digital presence of Hungarian. Building on the domain structure and proportions of MNSZ2 (v2.0.5; 1.04 billion running words), the project targets a substantial increase in scale while also strengthening the coverage and metadata description of Hungarian language use outside Hungary. MNSZ3 retains the six traditional domains of the earlier corpus—press, fiction, scientific, official, personal, and transcribed spoken language—and is planned to reach approximately 10 billion tokens. This paper presents the motivation and design principles of the project, outlines the practical decisions and procedures used in data collection and cleaning, and discusses the annotation strategy developed for large-scale processing. In planning the linguistic analysis, we build on the complementary strengths of HuSpaCy and e-magyar: HuSpaCy provides the unified and efficient UD-oriented processing backbone, while e-magyar (emMorph) is preserved as an explicit additional layer for morphology and lemmatisation.
9. "CoRoLa version 2.0: corpus enrichment and a new annotation level" Elena Irimia, Verginica Barbu Mititelu, Radu Ion, Vasile Pais, Maria Carp and Dan Ioan Tufis Research Institute for Artificial Intelligence, Romanian Academy The paper gives an overview of the recent developments in the enrichment of the reference Corpus of Contemporary Romanian (CoRoLa), within on-going international projects. Statistics of the newly acquired data, work methodology and work towards inclusion of a new annotation layer, the syntactic one, are detailed. We briefly present RODNA, an updated Romanian text processor with state-of-the-art performance on POS tagging, lemmatization and dependency parsing that will be used to populate the syntactic layer of CoRoLa.
10. "The German Medical Text Corpus: Early 2026 Update" Justin Hofenbitzer¹, Christina Lohr², Frank Meineke², Markus Löffler², Martin Boeker¹ ¹Technical University of Munich, ²IMISE, University of Leipzig Clinical text resources are a central component for the study of medical language, as well as the training and evaluation of large language models, chatbots, and artificial intelligence systems supporting clinical routines. With the German Medical Text Corpus (GeMTeX), we are currently working on the largest shareable clinical document dataset in German. The multi-centric project ensures diversity across different university hospitals, clinical domains, and text sorts. After a thorough de-identification process, the clinical texts are semantically annotated using Snomed CT, a language-independent, standardized medical ontology. While the corpus is still under active development, it is accessible upon request under controlled access conditions. As of February 2026, GeMTeX comprises more than 15k documents and 20M tokens. We refer researchers interested in the resource to visit https://kiinformatik. mri.tum.de/en/gemtex or reach out to us via gemtex.mi @ mh.tum.de.
11. "From Corpus to Community: New NLP Tools for Welsh Language Research and Learning" Dawn Knight and Fernando Alva Manchego Cardiff University Launched in 2020, CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – National Corpus of Contemporary Welsh) is the first large-scale corpus of the Welsh language to integrate spoken, written, and electronically mediated data, offering a comprehensive snapshot of contemporary Welsh use. Including contributions from over 2,000 speakers, the 11.2-million-word corpus represents the diversity of Wales's linguistic landscape. As a national resource, CorCenCC enables users to explore real world Welsh. Several tools and resources were developed through the CorCenCC project, including the CyTag POS tagger and CySemTag (adapted from Lancaster University's USAS semantic system), to enable the grammatical and semantic categorisation of the dataset. The team also built the pedagogic toolkit Y Tiwtiadur, to allow learners and teachers to access corpus-based examples and tasks. Additionally, Yr Amliadur provides curated frequency-based wordlists across modes and parts of speech, supporting linguistic analysis and vocabulary development. Since completing the corpus, the team has focused on extending its impact and reach, to ensure that the resources are maintained and sustained for future use; a challenge often faced when large-scale projects end. This poster profiles the tools and resources created from and inspired by CorCenCC and its associated tools and resources, as a means of supporting the democratisation of linguistic resources for minoritised language contexts.
12. "Swiss-AL: Language Data Platform for Applied Sciences" Julia Krasselt, Philipp Dreesen, Dolores Lemmenmeier-Batinić, Sooyeon Geckeler, Klaus Rothenhäusler and Matthias Fluor ZHAW This paper introduces Swiss-AL, a language data platform designed for the multilingual, comparative analysis of public discourse in Switzerland. Swiss-AL is an open research data resource providing browser-based access to a variety of corpora in all four of Switzerland's official languages. Corpora contain journalistic, organisational, and parliamentary discourse. The platform supports research in applied linguistics as well as neighbouring disciplines (e.g., social sciences, communication and media studies).
13. "EuReCo, KorAP and DeReKo: Updates on Ingestion and Annotation Pipelines, Backend, Interfaces, Operation, and Corpora" Marc Kupietz, Harald Lüngen, Nils Diewald, Eliza Margaretha Illig, Helge Stallkamp, Uyen-Nhu Tran and Rameela Yaddehige Leibniz Institute for the German Language (IDS) This paper reports on recent technical developments in the European Reference Corpus EuReCo and its current technical implementation based on the corpus search and analysis platform KorAP. We describe updates to the ingestion pipeline, including extensions to the TEI-to-KorAP-XML converter tei2korapxml and the KorAP tokenizer, as well as the newly introduced korapxmltool for annotation and index conversion. We further present Koral- Mapper, a service that enables cross-schema comparability of annotations and metadata at query time, and report on developments in the backend access control system Kustvakt, the web user interface Kalamar, API client libraries for R and Python that promote reproducibility and methodologically sound AI-assisted analysis, and containerized deployment. The corpora and languages currently represented in EuReCo are outlined, and the role of the German Reference Corpus DeReKo, including its metadata-driven virtual corpus design, predefined useful subcorpora, and I5/TEI encoding, is discussed in detail.
14. "Merimënga: A Manifest-First Pipeline for Reproducible Albanian Web Corpus Construction" Besim Kabashi¹ and Michael Ruppert² ¹Eberhard Karls Universitaet Tuebingen, ²Friedrich-Alexander-Universitaet Erlangen-Nuernberg We present Merimënga, a pipeline for reproducible Albanian web-corpus construction from Common Crawl. Rather than distributing a static text dump, we publish versioned manifests and append-only JSONL ledgers that make every retrieval and filtering decision replayable at record level. Records are addressed by (WARC filename, byte offset, byte length) and retrieved via HTTP range requests with checksum validation, enabling selective download, resumability, and exact re-materialization. On top of deterministic cleaning and deduplication, Merimënga supports teacher–student filtering: a large LLM labels a stratified sample; the resulting policy is distilled into a faster student model applied at corpus scale. The paper contributes (i) a reproducibility specification for web-corpus construction based on coordinate-addressed retrieval and decision ledgers, (ii) a concrete instantiation for Albanian with language-specific filtering, and (iii) an evaluation protocol for rerun equivalence and filter-stack ablation. Large-scale download and full-corpus filtering are ongoing; this submission focuses on methodology and auditable artifacts rather than final corpus statistics.
10:30 – 10:40 Coffee break (drifting towards the poster area on the 3rd floor)
10:40 – 11:25 Session C: Poster session in the Menorca Hall. Chair: Andreas Witt
For poster titles and authors, see session B. Please be back in room 7 before 11:30.
11:30 – 13:00 Session D: Long papers. Chair: Marc Kupietz
- "Pop Lyrics through Time: Challenges in Corpus-Based Modeling of Linguistic and Emotional Dynamics in German Pop Lyrics" Roman Schneider Leibniz Institute for the German Language (IDS) This paper presents a large-scale diachronic analysis of German pop lyrics based on a linguistically rich, TEI- encoded monitoring corpus. We describe multi-layer annotation and reproducible workflows for deriving higher- level features at scale, including lexical diversity indices, a pronoun-based subjectivity measure, modal particle density, and a length-normalized sentiment intensity score. Particular attention is paid to the development and evaluation of pipelines for two notoriously challenging phenomena: modal particles and sentiment. For modal particles, we build a manually curated gold standard and train sequence models whose performance we relate to inter-annotator agreement. For sentiment, we integrate a lexicon-based resource with a dedicated human annotation experiment to assess reliability and alignment with expert judgments. On this basis, we investigate how structural and affective features co-vary in the corpus and how they change over time, showing, among other trends, declining lexical diversity and sentiment intensity alongside a slight increase in first- and second-person pronouns. Beyond the empirical findings, the paper highlights practical challenges in managing culturally specific corpora, and makes evaluation materials available to support transparent, reusable corpus-based research on popular music and related domains.
- "The Infrastructure behind Latvian National Corpora Collection" Roberts Dargis and Baiba Valkovska Institute of Mathematics and Computer Science, University of Latvia The rapid advancement of digital humanities and Natural Language Processing (NLP) necessitates centralized access to high-quality, large-scale language resources. This paper presents the technical infrastructure and evolving ecosystem of Korpuss.lv, the central access platform for the Latvian National Corpora Collection (LNCC). The LNCC consolidates 42 corpora developed by 14 institutions, comprising 2.8 billion tokens of written and spoken Latvian across diverse genres and annotation layers. Korpuss.lv has evolved from a simple metadata index into a comprehensive digital infrastructure that enhances corpus discoverability, accessibility, and usability for researchers in linguistics, digital humanities, and natural language processing. The platform integrates noSketchEngine as its primary corpus analysis tool and extends its functionality with custom modules, including a metadata-driven Corpora Explorer, a client-side Federated Content Search system, and precomputed UD-based Word Sketches. The ecosystem is further supported by CLARIN DSpace repositories for persistent storage and citation management, as well as a federated academic authentication architecture built on SATOSA and Keycloak via the CLARIN Service Provider Federation. The paper outlines architectural decisions, integration strategies, and future development plans.
- "Optimized for AI: Curating the Icelandic Gigaword Corpus for Stable LLM Training" Jón Friðrik Daðason and Steinþór Steingrímsson The Árni Magnússon Institute for Icelandic Studies The Icelandic Gigaword Corpus (IGC) is a primary resource for Icelandic NLP, with its current version containing 2.7 billion words of curated text. The IGC is traditionally distributed in a TEI-XML format, a hierarchical structure that allows for rich linguistic annotation and metadata. However, this format introduces significant friction for modern machine learning workflows. Even high-quality curated corpora have been found to contain "unwanted" text sequences—such as fragmented lists or repetitive boilerplate that may trigger instabilities during training of large language models. In this paper, we present a new processing pipeline designed to optimize the IGC for AI development. We describe a filtering approach focusing on training stability, including fuzzy deduplication to reduce the risk of data leakage, with the aim to provide high-quality data for stable model convergence. Furthermore, we introduce a new JSONL distribution format that bridges the gap between TEI-XML and machine-actionable data, facilitating easier access and safer training for models aiming to work with Icelandic.
13:00 – 13:03 Roundup and thanks (remember to take down your posters)
Until 14:00 Lunch together? (more info on the spot; BYO€)

Workshop description

As in the previous CMLC meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, natural language generation, and data science.

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A mixed blessing of the times is that much of those texts, in mono- and multi-lingual arrangements can now be created automatically by exploiting Large Language Models at various scales. That, on the one hand, makes it possible to inflate the amounts of data where normally data would be scarce: in under-resourced languages or language varieties, in specific genres or for intricate and rarely attested constructions. On the other hand, such procedures immediately raise concerns regarding the authenticity and quality of such data, casting doubt on the possibility of adequately (truthfully, verifiably, reproducibly) addressing the kind of research questions that provoked the rapid but tainted increase of the available data volumes in the first place. Similar doubts may be directed at mass creation of secondary and tertiary data ordinarily crucial for linguistic research: apart from potential legal constraints on the use of the initial amounts of human-created data, new questions arise as to the legal status of the derived data, the ways to create e.g. provenance metadata of the derived resources, and the level of trust regarding mass-produced grammatical (and other) annotation layers.

These new as well as more traditional questions lie at the base of the list of topics that management of large corpora (for any currently suitable definition of “large”) invokes or at least strongly brushes against.

National corpus initiatives

CMLC has often invited reports on national corpus initiatives. Given that it's been a while since the last round, we are happy to host a little "What's the news?" session, with some of our veteran presenters as well as colleagues who have not yet introduced their national corpus projects.

LRE 2026 Map and the "Share your LRs!" initiative

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones).

Programme Committee

Laurence Anthony (Waseda University, Japan)
Vladimír Benko (Slovak Academy of Sciences)
Felix Bildhauer (IDS Mannheim)
Mark Davies (English-Corpora.org)
Nils Diewald (IDS Mannheim)
Kaja Dobrovoljc (University of Ljubljana / Jožef Stefan Institute)
Jarle Ebeling (University of Oslo)
Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
Andrew Hardie (Lancaster University, UK)
Serge Heiden (ENS de Lyon)
Ulrich Heid (University of Hildesheim)
Nancy Ide (Vassar College / Brandeis University)
Olha Kanishcheva (Heidelberg University)
Gražina Korvel (Vilnius University)
Natalia Kocyba (Samsung Poland)
Michal Křen (Charles University, Prague)
Anna Latusek (ICS PAS, Warsaw)
Paul Rayson (Lancaster University)
Laurent Romary (INRIA)
Thomas Schmidt (University of Duisburg-Essen)
Serge Sharoff (University of Leeds)
Maria Shvedova (Kharkiv Polytechnic Institute / University of Jena)
Irena Spasić (Cardiff University)
Martin Wynne (University of Oxford)

Organising Committee

[hover for the e-mail address]

📩 Piotr Bański (IDS Mannheim)
📩 Dawn Knight (Cardiff University)
📩 Marc Kupietz (IDS Mannheim)
📩 Andreas Witt (IDS Mannheim)
📩 Alina Wróblewska (ICS PAS, Warsaw)

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html

12th Workshop on the Challenges in the Management of Large Corpora