Challenges in the Management of Large Corpora (CMLC-10)

last updated: 2022-06-20

Proceedings

The proceedings volume has been published at http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/index.html

When and where/how

CMLC-10 is going to be hosted by LREC 2022 (Marseille, 20-25 June 2022). It will take place on the pre-conference workshop day 20 June 2022, in the morning session in Room Lacydon.

The meeting is going to be hybrid. Remote attendees can join from the LREC 2022 Virtual Platform with their personal logins.

Important dates

Deadline for abstract submission: April 19, 2022 (midnight UTC)
Notification of acceptance: May 3 April 28 May 3, 2022
Deadline for the submission of camera-ready papers: May 23, 2022
Meeting: the morning session of June 20, 2022 (Monday) in Room Lacydon

Programme

09.00 – 10.30 Session 1. Chair: Marc Kupietz
- Technical Setup / Welcome
- Introduction
- [remote presentation] Vasile Pais, Maria Mitrofan, Verginica Barbu Mititelu, Elena Irimia, Roxana Micu and Carol Luca Gasan: Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text
- Modest von Korff: Exhaustive Indexing of PubMed Records with Medical Subject Headings
10.30 – 11.00 Coffee break
11.00 – 13.00 Session 2. Chair: Andreas Witt
- Luca Brigada Villa: UDeasy: a Tool for Querying Treebanks in CoNLL-U Format
- Nils Diewald: Matrix and Double-Array Representations for Efficient Finite State Tokenization
- Peter Fankhauser and Marc Kupietz: Count-Based and Predictive Language Models for Exploring DeReKo
- [remote presentation] Hanno Biber: “The word expired when that world awoke.” New Challenges for Research with Large Text Corpora and Corpus-Based Discourse Studies in Totalitarian Times

Workshop description

The upcoming CMLC meeting continues the successful series of Challenges in the management of large corpora events, previously hosted at LREC (since 2012) and at Corpus Linguistics conferences (since 2015). As in the previous meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, and data science. Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitized, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A number of key themes and questions emerge which are of interest to the contributing research communities: (a) What can be done to deal with IPR and data protection issues? (b) What sampling techniques can we apply? (c) What quality issues should we be aware of? (d) What infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) What affordances do visualization techniques offer for the exploratory analysis approaches of corpora? (f) What kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) How to guarantee that corpus data remain available and sustainably usable?

Motivation and topics of interest

This year’s event will cover the entire range of the standard CMLC themes, with some new additions including some of LREC 2022’s focus topics:

Interoperability and accessibility
- Improving the accessibility of large corpora
- Interoperable APIs for query and analysis software
- Provision of multiple levels of access for different tasks
Machine/Deep Learning
- Data preparation for machine learning input
- Creation, curation, maintenance and dissemination of language models based on machine learning (including, for example, word embeddings and entre shallow and deep learning networks)
- Legal issues concerning language model distribution
Linguistic content challenges
- Dealing with the variety of language: multilinguality, historical texts, noisy OCR texts, user-generated content, etc.
- Diversity and inclusiveness in/of language resources
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations
- Dealing with different linguistic data types (corpora, facsimiles, experimental data, neuroimaging data, …)
Technical challenges
- Storage and retrieval solutions for big textual data corpora: primary data (potentially including facsimiles, etc.), metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks for language processing
- Dealing with streaming data (e.g. social media) and rapidly changing corpora
- Environmental impact of big language data computing
- Engineering and management of research software
Exploitation challenges
- Legal and privacy issues
- Query languages, data models, and standardization
- Licensing models of open and closed data, coping with intellectual property restrictions
- Innovative approaches for aggregation and visualisation of text analytics

In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

Share your LRs!

When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e. also technologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. Moreover, ELRA encourages all LREC authors to share the described LRs (data, tools, services, etc.) to enable their reuse and replicability of experiments (including evaluation ones). See the relevant ELRA page.

Programme Committee

Laurence Anthony (Waseda University, Japan)
Vladimír Benko (Slovak Academy of Sciences)
Damir Ćavar (Indiana University, USA)
Nils Diewald (IDS Mannheim)
Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
Johannes Graën (University of Zurich, Switzerland)
Andrew Hardie (Lancaster University, UK)
Serge Heiden (ENS de Lyon/IHRIM, France)
Miloš Jakubíček (Lexical Computing Ltd.)
Paweł Kamocki (IDS Mannheim)
Natalia Kotsyba (Samsung Poland)
Dawn Knight (Cardiff University)
Michal Křen (Charles University, Prague)
Veronika Laippala (University of Turku)
Vereina Lyding (EURAC Research, Italy)
Paul Rayson (Lancaster University, UK
Laurent Romary (INRIA)
Jan-Oliver Rüdiger (IDS Mannheim)
Roman Schneider (IDS Mannheim, Germany)
Serge Sharoff (University of Leeds)
Irena Spasić (Cardiff University, UK)
Marko Tadić (University of Zagreb, Faculty of Humanities and Social Sciences)
Ludovic Tanguy (University of Toulouse)
Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)
Andreas Witt (IDS / Mannheim University)

Organising Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

Piotr Bański, 📩 Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences

📩 Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich

Simon Clematide

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html

10th Workshop on the Challenges in the Management of Large Corpora