Challenges in the Management of Large Corpora 5 + Big Data and Natural Language Processing 2017
A joint meeting of the workshops on "Big Data and Natural Language Processing" and "Challenges in the Management of Large Corpora" will take place on the 24th of July, in Birmingham, as part of the Corpus Linguistics 2017 conference. Please bookmark this page for current information.
News
The proceedings volume has been published. The individual papers are linked from the programme below.
CMLC+BigNLP is going to be preceded by a guest Web-as-Corpus session, before the lunch break (see below for the list of accepted presentations). Registration at CMLC+BigNLP will be valid throughout.
Workshop description
The CMLC+BigNLP workshop is a joint initiative of two teams who have decided to join forces for the purpose of organizing an event co-located with Corpus Linguistics 2017 in Birmingham. The upcoming meeting continues the successful series of “Challenges in the management of large corpora” events (previously hosted at LREC conferences and CL2015) and is at the same time the second event in the the Big-NLP series, inaugurated last year at the IEEE Big Data 2016 conference. This year, we wish to together explore common areas of interest across a range of issues in language resource management, corpus linguistics, natural language processing and data science.
An increasing amount of text is available in digital format: more historical archives are being digitised, more publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. The resulting large textual datasets are used across a number of disciplines to answer a wide range of research questions. In order for these datasets to be maximally useful, careful consideration needs to be made regarding their design, collection, cleaning, encoding, annotation, storage, retrieval and curation.
A number of key themes and questions emerge of interest to the contributing research communities: (a) is having more data always better? (b) is the full range of text types available online and what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what are the key legal and ethical issues related to the use of large corpora?
Workshop Programme (24 July 2017)
-
WAC-XI guest session (11.00 – 12.30)
Convenors: Adrien Barbaresi (ICLTT Vienna), Felix Bildhauer (IDS Mannheim), Roland Schäfer (FU Berlin)
Chair: Stefan Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg)
- 11:00 – 11:30: Edyta Jurkiewicz-Rohrbacher, Zrinka Kolaković, Björn Hansen – Web Corpora – the best possible solution for tracking rare phenomena in underresourced languages – clitics in Bosnian, Croatian and Serbian
- 11:30 – 12:00: Vladimir Benko – Are Web Corpora Inferior? The Case of Czech and Slovak
- 12:00 – 12:30: Vit Suchomel – Removing Spam from Web Corpora Through Supervised Learning Using FastText
Lunch (12:30 – 13:30)
Main CMLC+BigNLP session I: National Corpora Talks (13:30 – 15:40)
- 13:30 – 13:40: Welcome and introduction
- 13:40 – 14:00: Andreas Dittrich (Academiae Corpora, Austrian Academy of Sciences) – Intra-connecting a small exemplary literary corpus with semantic web technologies for exploratory literary studies
- 14:00 – 14:20: John Kirk (Dresden University of Technology) and Anna Čermáková (Charles University, Prague) – An International Comparable Corpus
- 14:20 – 14:40: Dawn Knight (Cardiff University), Tess Fitzpatrick (Swansea University), Steve Morris (Swansea University), Jeremy Evas (Cardiff University), Paul Rayson (Lancaster University), Irena Spasic (Cardiff University), Mark Stonelake (Swansea University), Enlli Môn Thomas (Bangor University), Steven Neale (Cardiff University), Jennifer Needs (Swansea University), Scott Piao (Lancaster University), Mair Rees (Swansea University), Gareth Watkins (Cardiff University), Laurence Anthony (Waseda University), Thomas Michael Cobb (University of Quebec at Montreal), Margaret Deuchar (University of Cambridge), Kevin Donnelly (Freelance), Michael McCarthy (University of Nottingham), Kevin Scannell (Saint Louis University) – Creating CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – The National Corpus of Contemporary Welsh)
- 14:40 – 15:00: Marc Kupietz (IDS Mannheim), Piotr Bański (IDS Mannheim), Andreas Witt (University of Cologne, IDS Mannheim), Dan Tufiş (Institute for Artificial Intelligence Mihai Drăgănescu, Bucharest), Dan Cristea (Romanian Academy, Institute for Computer Science - Iaşi, “Alexandru Ioan Cuza” University of Iaşi), Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences) – EuReCo – Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research in Europe
Coffee break (15:00 – 15:20)
- 15:20 – 15:40: Harald Lüngen and Marc Kupietz (IDS Mannheim) – CMC Corpora in DeReKo
Main CMLC+BigNLP session II: Technology Talks (15:40 – 16:40)
- 15:40 – 16:00: David McClure, Mark Algee-Hewitt, Douris Steele, Erik Fredner and Hannah Walser (Stanford University) – Organizing corpora at the Stanford Literary Lab
- 16:00 – 16:20: Radoslav Rábara, Pavel Rychlý and Ondřej Herman (Lexical Computing) – Accelerating Corpus Search Using Multiple Cores
- 16:20 – 16:40: John Vidler (School of Computing and Communications, Lancaster University) and Stephen Wattam (Department of Linguistics and English Language, Lancaster University) – Keeping Properties with the Data: CL-MetaHeaders – An Open Standard
Wrap-up discussion (16:40 – 17:00)
Programme Committee
- Laurence Anthony (Waseda University, Japan)
- Alistair Baron (Lancaster University, UK)
- Felix Bildhauer (IDS Mannheim)
- Damir Ćavar (Indiana University, Bloomington)
- Matt Coole (Lancaster University, UK)
- Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
- Tomaž Erjavec (Jožef Stefan Institute)
- Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
- Johannes Graën (University of Zurich)
- Andrew Hardie (Lancaster University)
- Serge Heiden (ENS de Lyon)
- Miloš Jakubíček (Lexical Computing Ltd.)
- Dawn Knight (Cardiff University, UK)
- Michal Křen (Charles University, Prague)
- Sandra Kübler (Indiana University, Bloomington)
- Jochen Leidner (Thomson Reuters, UK)
- Rao Muhammad Adeel Nawab (COMSATS, Pakistan)
- Piotr Pęzik (University of Łódź)
- Laura Irina Rusu (IBM Australia)
- Roland Schäfer (FU Berlin)
- Roman Schneider (IDS Mannheim)
- Gandhi Sivakumar (IBM Australia)
- Irena Spasić (Cardiff University, UK)
- Marko Tadić (University of Zagreb, Faculty of Humanities and Social Sciences)
- Dan Tufiş (Romanian Academy, Bucharest)
- Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)
- Andreas Witt (University of Cologne, IDS Mannheim, University of Heidelberg)
- Amir Zeldes (Georgetown University, USA)
Joint Organising Committee
Institut für Deutsche Sprache, Mannheim
Piotr Bański, Marc Kupietz, Harald Lüngen
Institute for Corpus Linguistics and Text Technology, Vienna
Hanno Biber, Evelyn Breiteneder
Institute of Computational Linguistics, Zurich
Simon Clematide
Lancaster University, UK
John Mariani, Paul Rayson
Sheffield University, UK
Mark Stevenson
Acknowledgements
It is the second time that CMLC has used the EasyAbs abstract submission system offered at no cost by the Linguist List. We gratefully acknowledge this service. Some of us have donated to the LL fund drive over the years, and we consider that money well spent.
Contact
This page is located at http://corpora.ids-mannheim.de/cmlc-2017.html
The time-limited workshop mailing address is: cmlc+bignlp at INACTIVE.ids-mannheim.de