Challenges in the Management of Large Corpora (CMLC-8)

last updated: 2020-07-24 11:37 CEST

News

The proceedings volume is available at https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/CMLC-8book.pdf.

When and where

Time: Saturday, 16^th of May, 2020, in the morning session. Place: Marseille, as part of LREC-2020. (Room number t.b.a)

Due to COVID19 pandemic, the 12th edition of the LREC conference, and consequently the 8th edition of he workshop, have been cancelled. We would like to thank the Authors and the Programme Committee for the work they have put into the papers and the selection process. That work is not lost: a peer-reviewed volume of proceedings is available from LREC-2020 pages.

Accepted papers

Oral presentations
- Denis Arnold, Bernhard Fisseni, Pawel Kamocki, Oliver Schonefeld, Marc Kupietz and Thomas Schmidt, "Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora"
- Peter Fankhauser, Bich-Ngoc Do and Marc Kupietz, "Evaluating a Dependency Parser on DeReKo"
- Murielle Popa-Fabre, Pedro Javier Ortiz Suárez, Benoît Sagot and Éric de la Clergerie, "French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus"
- Rosa Filgueira, Claire Grover, Beatrice Alex and Melissa Terras, "Geoparsing the historical Gazetteers of Scotland: accurately computing location in mass digitised texts"
- Vladimír Benko, "How Big Is Big Enough? Corpus-Based Frequency Lists for Language Identification"
- Markus Gärtner, "The Corpus Query Middleware of Tomorrow – A Proposal for a Hybrid Corpus Query Architecture"
- Elena Frick and Thomas Schmidt, "Using full text indices for querying spoken language data"
Poster presentations
- Hanno Biber, "Challenges for Making Use of a Large Text Corpus such as the ‘AAC – Austrian Academy Corpus’ for Digital Literary Studies"
- Michal Křen, "Czech National Corpus in 2020: Recent Development and Future Outlook"
- Andrei Scutelnicu, Catalina Maranduc and Dan Cristea, "The Syntactic Annotation Level in the Corpus of Contemporary Romanian Language"

Workshop description

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.

A number of key themes and questions emerge of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way?

The CMLC workshop series invites papers dealing with challenges that arise in particular in connection with very large corpora, on topics such as: sampling approaches; web harvesting approaches; quality assessment; efficient solutions for storage, processing, querying and analysis; dimension reduction and exploratory data visualization; interfaces/APIs and other approaches to make the corpus data as widely usable as possible; sustainability and interoperability in general; intellectual property rights and licensing. This year’s event will cover the whole range of the standard CMLC themes, with some new additions and adopting some of LREC 2020’s focus topics.

Topics

This year’s event will cover the whole range of the standard CMLC themes, with some new additions and adopting some of LREC 2020’s focus topics:

Interoperability and accessibility
- How to make corpora as accessible as possible
- Interoperable APIs for query and analysis software
- Provision of multiple levels of access for different tasks
Machine/Deep Learning
- Data preparation for machine learning input
- Creation, curation, maintenance and dissemination of language models based on machine learning (including, for example, word embeddings and entre shallow and deep learning networks)
- Legal issues concerning language model distribution
Linguistic content challenges
- Dealing with the variety of language: multilinguality, historical texts, noisy OCR texts, user-generated content, etc.
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations
- Dealing with different linguistic data types (corpora, facsimiles, experimental data, neuroimaging data, …)
Technical challenges
- Storage and retrieval solutions for big textual data corpora: primary data (potentially including facsimiles, etc.), metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks for language processing
- Dealing with streaming data (e.g. Social Media) and rapidly changing corpora
- Environmental impact of big language data computing
Exploitation challenges
- Legal and privacy issues
- Query languages, data models, and standardization
- Licensing models of open and closed data, coping with intellectual property restrictions
- Innovative approaches for aggregation and visualisation of text analytics

In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.

Final submissions

Final submissions MUST conform to the LREC template. Size: 4-8 pages, with a possible extra page for references.

Submission URL: https://www.softconf.com/lrec2020/CMLC-8/

Important dates

Deadline for abstract submission: 23 February 2020
Notification of acceptance: 12 March 2020
Deadline for the submission of camera-ready papers: 26 March 2020
Meeting: ... 16th May 2020, morning session

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2020 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Programme Committee

Names will be added as Programme Committee members confirm their participation.

Laurence Anthony (Waseda University, Japan)
Vladimír Benko (Slovak Academy of Sciences)
Felix Bildhauer (IDS Mannheim)
Sonja Bosch (University of South Africa)
Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
Damir Ćavar (Indiana University)
Tomaž Erjavec (Jožef Stefan Institute)
Stefan Evert (Friedrich-Alexander-Universität Nürnberg/Erlangen)
Johannes Graën (University of Gothenburg, Pompeu Fabra University)
Andrew Hardie (Lancaster University)
Serge Heiden (ENS de Lyon)
Miloš Jakubíček (Lexical Computing Ltd.)
Dawn Knight (Cardiff University)
Natalia Kotsyba (Samsung Poland)
Michal Křen (Charles University, Prague)
Sandra Kübler (Indiana University, Bloomington)
Gaël Lejeune (Sorbonne Université)
Paul Rayson (Lancaster University)
Martin Reynaert (Tilburg University)
Laurent Romary (INRIA)
Kevin Scannell (Saint-Louis University)
Roland Schäfer (FU Berlin)
Serge Sharoff (University of Leeds)
Irena Spasic (Cardiff University)
Marko Tadić (University of Zagreb, Faculty of Humanities and Social Sciences)
Ludovic Tanguy (University of Toulouse)
Dan Tufiş (Romanian Academy, Bucharest)
Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)

Organising Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen

Berlin-Brandenburg Academy of Sciences

Adrien Barbaresi

Institute of Computational Linguistics, University of Zurich

Simon Clematide

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html

8th Workshop on the Challenges in the Management of Large Corpora