Home CMLC-1 [proceedings] | CMLC-2 [proceedings] | CMLC-3 [proceedings] | CMLC-4 [proceedings]

Challenges in the Management of Large Corpora 5 + Big Data and Natural Language Processing 2017

A joint meeting of the workshops on "Big Data and Natural Language Processing" and "Challenges in the Management of Large Corpora" will take place on the 24th of July, in Birmingham, as part of the Corpus Linguistics 2017 conference. Please bookmark this page for current information.

Workshop description

The CMLC+BigNLP workshop is a joint initiative of two teams who have decided to join forces for the purpose of organizing an event co-located with Corpus Linguistics 2017 in Birmingham. The upcoming meeting continues the successful series of “Challenges in the management of large corpora” events (previously hosted at LREC conferences and CL2015) and is at the same time the second event in the the Big-NLP series, inaugurated last year at the IEEE Big Data 2016 conference. This year, we wish to together explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing and data science.

An increasing amount of text is available in digital format: more historical archives are being digitised, more publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. The resulting large textual datasets are used across a number of disciplines to answer a wide range of research questions. In order for these datasets to be maximally useful, careful consideration needs to be made regarding their design, collection, cleaning, encoding, annotation, storage, retrieval and curation.

A number of key themes and questions emerge of interest to the contributing research communities: (a) is having more data always better? (b) is the full range of text types available online and what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what are the key legal and ethical issues related to the use of large corpora?

An open-access (CC BY-NC-ND) electronic volume of proceedings is planned.

Topics of interest

This year’s event focuses on the union of the standard topics of CLMC and Big NLP:

Accepted submissions

Full-paper submissions

An open-access (CC BY-NC-ND) electronic volume of proceedings is planned, published by IDS Mannheim (compare the proceedings volume for CMLC-3).

The final papers should adhere to the ACL style (see towards the bottom of the linked WWW page) and should not exceed 8 pages. Authors are expected to submit the papers by the 12th of June and be prepared to introduce editorial corrections afterwards. Short abstracts of these papers will be published on the workshop home page.

Important dates

Programme Committee

Further names to be added as Programme Committee members confirm.

Joint Organising Committee

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Harald Lüngen

Institute for Corpus Linguistics and Text Technology, Vienna

Hanno Biber, Evelyn Breiteneder

Institute of Computational Linguistics, Zurich

Simon Clematide

Lancaster University, UK

John Mariani, Paul Rayson

Sheffield University, UK

Mark Stevenson

Acknowledgements

It is the second time that CMLC has used the EasyAbs abstract submission system offered at no cost by the Linguist List. We gratefully acknowledge this service. Some of us have donated to the LL fund drive over the years, and we consider that money well spent.

Contact

This page is located at http://corpora.ids-mannheim.de/cmlc-2017.html

The time-limited workshop mailing address is: cmlc+bignlp at INACTIVE.ids-mannheim.de