Home CMLC-1 [proceedings] | CMLC-2 [proceedings] | CMLC-3 [proceedings] | CMLC-4 [proceedings] | [proceedings 2017] | CMLC-6 [proceedings] | CMLC-7 [proceedings] | CMLC-8 [proceedings] | CMLC-9 | CMLC-10

Challenges in the Management of Large Corpora 5 + Big Data and Natural Language Processing 2017

A joint meeting of the workshops on "Big Data and Natural Language Processing" and "Challenges in the Management of Large Corpora" will take place on the 24th of July, in Birmingham, as part of the Corpus Linguistics 2017 conference. Please bookmark this page for current information.


The proceedings volume has been published. The individual papers are linked from the programme below.

CMLC+BigNLP is going to be preceded by a guest Web-as-Corpus session, before the lunch break (see below for the list of accepted presentations). Registration at CMLC+BigNLP will be valid throughout.

Workshop description

The CMLC+BigNLP workshop is a joint initiative of two teams who have decided to join forces for the purpose of organizing an event co-located with Corpus Linguistics 2017 in Birmingham. The upcoming meeting continues the successful series of “Challenges in the management of large corpora” events (previously hosted at LREC conferences and CL2015) and is at the same time the second event in the the Big-NLP series, inaugurated last year at the IEEE Big Data 2016 conference. This year, we wish to together explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing and data science.

An increasing amount of text is available in digital format: more historical archives are being digitised, more publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. The resulting large textual datasets are used across a number of disciplines to answer a wide range of research questions. In order for these datasets to be maximally useful, careful consideration needs to be made regarding their design, collection, cleaning, encoding, annotation, storage, retrieval and curation.

A number of key themes and questions emerge of interest to the contributing research communities: (a) is having more data always better? (b) is the full range of text types available online and what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what are the key legal and ethical issues related to the use of large corpora?

Workshop Programme (24 July 2017)

Programme Committee

Joint Organising Committee

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen

Institute for Corpus Linguistics and Text Technology, Vienna

Hanno Biber, Evelyn Breiteneder

Institute of Computational Linguistics, Zurich

Simon Clematide

Lancaster University, UK

John Mariani, Paul Rayson

Sheffield University, UK

Mark Stevenson


It is the second time that CMLC has used the EasyAbs abstract submission system offered at no cost by the Linguist List. We gratefully acknowledge this service. Some of us have donated to the LL fund drive over the years, and we consider that money well spent.


This page is located at http://corpora.ids-mannheim.de/cmlc-2017.html

The time-limited workshop mailing address is: cmlc+bignlp at INACTIVE.ids-mannheim.de