7th Workshop on the Challenges in the Management of Large Corpora
Proceedings
The CMLC-7 proceedings volume is now available.
Where and when?
Cardiff University, Main Building, the "small chemistry" hall, 22nd of July, 2019, 9.00-13.30 -- preceding the Corpus Linguistics 2019 conference (check the conference app for potential changes).
Programme
Please visit the CMLC-7 proceedings homepage to download the individual papers or the entire volume.
09.00 – 11.00 Session 1
- Johannes Graën, Tannon Kew, Anastassia Shaitarova and Martin Volk, "Modelling Large Parallel Corpora"
- Pedro Javier Ortiz Suárez, Benoît Sagot and Laurent Romary, "Asynchronous Pipelines for Processing Huge Corpora on Medium to Low Resource Infrastructures"
- Vladimír Benko, "Deduplication in Large Web Corpora"
- Mark Davies, "The best of both worlds: Multi-billion word “dynamic” corpora"
11.20 – 13.00 Session 2
- Andrew Hardie, "Managing complex and arbitrary corpus subsections at scale and at speed: from formalism to implementation within CQPweb"
- Adrien Barbaresi, "The Vast and the Focused: On the need for thematic web and blog corpora"
- Marc Kupietz, Eliza Margaretha, Nils Diewald, Harald Lüngen and Peter Fankhauser, "What's New in EuReCo? Interoperability, Comparable Corpora, Licensing"
Description
The upcoming CMLC meeting continues the successful series of “Challenges in the management of large corpora” events, previously hosted at LREC conferences, CL2015, and CL2017. As in the previous meetings, we wish to explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing, and data science.
Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media.
A number of key themes and questions emerge of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way?
Topics
This year’s event will cover the whole range of the standard CMLC themes, with some new additions and hot topics:
-
- New opportunities and issues after GDPR, changes in national copyright legislations, and Brexit
- (automatic) anonymization of web as corpus genres
- publication of corpus-based language models
- provision of multiple levels of access for different tasks
- common access APIs
- research corpora/data and fair use
- political and sociological balance
- social media bubbles, hate speech and fake news
- proliferation of stereotypes via corpora and language models
- corpora as archives of the past: evolution in mentalities or laws, personality rights
- How to make corpora as accessible as possible despite big data issues, application heterogeneity, and IPR issues (continued from the CMLC 6 special topic)
- Societal and legal issues relevant for corpora and corpus studies
-
- Dealing with the variety of language: multilinguality, historical texts, user-generated content, etc.
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations
-
- Storage and retrieval solutions for big textual data corpora: primary data, metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks for language processing
- Dealing with streaming data (e.g. Social Media) and rapidly changing corpora
-
- Legal and privacy issues
- Query languages, data models, and standardization
- Licensing models of open and closed data, coping with intellectual property restrictions
- Innovative approaches for aggregation and visualisation of text analytics
In the tradition of CMLC, we invite reports on national corpus initiatives; submitters of these reports should be prepared to present a poster.
Important dates
- NEW deadline for abstract submission: the 9th of May, midnight UTC
- Notification of acceptance: the 24th of May
- Deadline for the submission of camera-ready papers: the 20th of June
- Meeting: the 22nd of July, morning session
Submission categories
We invite anonymised extended abstracts for oral presentations on the topics listed above (PDF, 1000-1500 words excluding references, font preferably 11 pt, line spacing 1.5).
CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised. The number of poster slots is limited. If there is spare capacity in the poster session, we reserve the right to change the presentation format of accepted papers from oral presentation to poster. Such a change will not affect how the paper is presented in the proceedings.
Submissions are accepted exclusively through the EasyChair submission system, at https://easychair.org/conferences/?conf=cmlc7.
Proceedings
Online proceedings will be published before the meeting in a
peer-reviewed, open-access volume. Due to various delays, we will offer the full
proceedings volume only after the meeting. Please click on the titles of
presentations in order to access the text of the submitted final versions of the
papers.
Programme Committee
- Laurence Anthony (Waseda University, Japan)
- Vladimír Benko (Slovak Academy of Sciences)
- Felix Bildhauer (IDS Mannheim)
- Damir Ćavar (Indiana University, Bloomington)
- Mark Davies (BYU, USA)
- Tomaž Erjavec (Jožef Stefan Institute)
- Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
- Johannes Graën (University of Gothenburg, Pompeu Fabra University)
- Andrew Hardie (Lancaster University)
- Serge Heiden (ENS de Lyon)
- Miloš Jakubíček (Lexical Computing Ltd.)
- Michal Křen (Charles University, Prague)
- Sandra Kübler (Indiana University, Bloomington)
- Anke Lüdeling (HU Berlin)
- Piotr Pęzik (University of Łódź)
- Paul Rayson (Lancaster University)
- Martin Reynaert (Tilburg University)
- Laurent Romary (INRIA)
- Kevin Scannell (Saint-Louis University)
- Roland Schäfer (FU Berlin)
- Roman Schneider (Justus-Liebig-Universität Gießen / IDS Mannheim)
- Serge Sharoff (University of Leeds)
- Marko Tadić (University of Zagreb, Faculty of Humanities and Social Sciences)
- Ludovic Tanguy (University of Toulouse)
- Dan Tufiş (Romanian Academy, Bucharest)
- Amir Zeldes (Georgetown University)
Organising Committee
[hover for the e-mail address]
Institut für Deutsche Sprache, Mannheim
Piotr Bański, Marc Kupietz, Harald Lüngen
Berlin-Brandenburg Academy of Sciences
Adrien Barbaresi
Institute of Computational Linguistics, University of Zurich
Simon Clematide
Austrian Academy of Sciences, Vienna
Hanno Biber, Evelyn Breiteneder
Homepage
CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html