Challenges in the Management of Large Corpora + Big Data and Natural Language Processing
A joint meeting of the workshops on "Big Data and Natural Language Processing" and "Challenges in the Management of Large Corpora" will take place on the 24th of July, in Birmingham, as part of the Corpus Linguistics 2017 conference. Please bookmark this page for more information.
The CMLC+BigNLP workshop is a joint initiative of two teams who have decided to join forces for the purpose of organizing an event co-located with Corpus Linguistics 2017 in Birmingham. The upcoming meeting continues the successful series of “Challenges in the management of large corpora” events (previously hosted at LREC conferences and CL2015) and is at the same time the second event in the the Big-NLP series, inaugurated last year at the IEEE Big Data 2016 conference. This year, we wish to together explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing and data science.
An increasing amount of text is available in digital format: more historical archives are being digitised, more publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. The resulting large textual datasets are used across a number of disciplines to answer a wide range of research questions. In order for these datasets to be maximally useful, careful consideration needs to be made regarding their design, collection, cleaning, encoding, annotation, storage, retrieval and curation.
A number of key themes and questions emerge of interest to the contributing research communities: (a) is having more data always better? (b) is the full range of text types available online and what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what are the key legal and ethical issues related to the use of large corpora?
An open-access (CC BY-NC-ND) electronic volume of proceedings is planned.
Topics of interest
This year’s event focuses on the union of the standard topics of CLMC and Big NLP:
- Storage and retrieval solutions for big textual data corpora: primary data, metadata, and annotation data
- Scalable and efficient NLP tooling for annotating and analysing large datasets: distributed and GPGPU computing; using big data analysis frameworks (Hadoop, Spark, etc.) for language processing
- Dealing with streaming data (e.g. Social Media) and rapidly changing corpora
Licensing, legal and privacy issues:
- Licensing models of open and closed data
- Coping with intellectual property restrictions
Linguistic content issues:
- Dealing with the variety of language: multilinguality, historical texts, user-generated content, etc.
- Integration of human computation (crowdsourcing) and automatic annotation
- Quality management of annotations
- Query languages
- Innovative approaches for aggregation and visualisation of text analytics
We invite anonymised extended abstracts for oral presentations on the topics listed above (PDF, 1000-1500 words excluding references, font preferably 11 pt, line spacing 1.5).
CMLC has always reserved a track for national corpus project reports, and to this end, we invite poster proposals of 500-750 words. National project reports need not be anonymised. The number of poster slots is limited. If there is spare capacity in the poster session, we reserve the right to change the presentation format of accepted papers from oral presentation to poster. Such a change will not affect how the paper is presented in the proceedings.
Submissions are accepted exclusively through the EasyAbs submission system, at http://linguistlist.org/easyabs/cmlc+bignlp.
- Submission deadline: 12th of March, midnight UTC
- Notification of acceptance: 18th of April
- Camera-ready papers due: 18th of June
- Workshop date: 24 July 2016, afternoon session
Further names to be added as Programme Committee members confirm.
- Laurence Anthony (Waseda University, Japan)
- Alistair Baron (Lancaster University, UK)
- Felix Bildhauer (IDS Mannheim)
- Damir Ćavar (Indiana University, Bloomington)
- Matt Coole (Lancaster University, UK)
- Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
- Tomaž Erjavec (Jožef Stefan Institute)
- Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
- Johannes Graën (University of Zurich)
- Andrew Hardie (Lancaster University)
- Serge Heiden (ENS de Lyon)
- Miloš Jakubíček (Lexical Computing Ltd.)
- Dawn Knight (Cardiff University, UK)
- Michal Křen (Charles University, Prague)
- Sandra Kübler (Indiana University, Bloomington)
- Jochen Leidner (Thomson Reuters, UK)
- Rao Muhammad Adeel Nawab (COMSATS, Pakistan)
- Piotr Pęzik (University of Łódź)
- Laura Irina Rusu (IBM Australia)
- Roland Schäfer (FU Berlin)
- Roman Schneider (IDS Mannheim)
- Gandhi Sivakumar (IBM Australia)
- Irena Spasić (Cardiff University, UK)
- Marko Tadić (University of Zagreb, Faculty of Humanities and Social Sciences)
- Dan Tufiş (Romanian Academy, Bucharest)
- Amir Zeldes (Georgetown University, USA)
Joint Organising Committee
Institut für Deutsche Sprache, Mannheim
Piotr Bański, Marc Kupietz, Harald Lüngen
Institute for Corpus Linguistics and Text Technology, Vienna
Hanno Biber, Evelyn Breiteneder
Institute of Computational Linguistics, Zurich
Lancaster University, UK
John Mariani, Paul Rayson
Sheffield University, UK
It is the second time that CMLC has used the EasyAbs abstract submission system offered at no cost by the Linguist List. We gratefully acknowledge this service. Some of us have donated to the LL fund drive over the years, and we consider that money well spent.
This page is located at http://corpora.ids-mannheim.de/cmlc-2017.html
The time-limited workshop mailing address is: cmlc+bignlp at INACTIVE.ids-mannheim.de