Challenges in the Management of Large Corpora (CMLC-6)

last updated: 2018-12-14 15:41 CET

(May 7^th 2018, Miyazaki; part of the LREC-2018 workshop structure)

Special Topic

Interoperability of corpus query and analysis systems

Description

Large corpora require careful design, licensing, collecting, cleaning, encoding, annotation, management, storage, retrieval, analysis, and curation to unfold their potential for a wide range of research questions and users, across a number of disciplines. Apart from the usual CMLC topics that fall into these areas, CMLC-6 will have a special focus on corpus query and analysis systems and specifically on goals concerning their interoperability.

In the past 5 years, a whole new generation of corpus query engines that overcome limitations on the number of tokens and annotation layers has started to emerge at several different places. While there seems to be a consensus that there can be no single corpus tool that fulfills the need of all communities and that a degree of heterogeneity is required, the time seems ripe to discuss whether (further, unrestricted) divergence should be avoided in order to allow for some interoperability and reusability – and how this can be achieved. The two most prominent areas where interoperability seems highly desirable are query languages and software components for corpus analysis. The former issue is already partially addressed by the proposed ISO standard Corpus Query Lingua Franca (CQLF). Components for corpus analysis, on the other hand, should in an ideal world be exchangeable and reusable across different platforms, not only to avoid redundancies, but also to foster replicability and a canonization of methodology in NLP and corpus linguistics.

The 6^th edition of the workshop will devote much of its time to these issues, including an expert panel discussion with representatives of tool development teams and power users.

Proceedings

The CMLC-6 proceedings volume is available from the LREC workshops page.

Location

Room "Tenran" at the 4th floor of the Phoenix Seagaia Conference Center.

Programme

09.00 – 10.30 Session 1: Management and Search

Christoph Kuras, Thomas Eckart, Uwe Quasthoff and Dirk Goldhahn, "Automation, Management and Improvement of Text Corpus Production"
Thomas Krause, Ulf Leser, Anke Lüdeling and Stephan Druskat, "Designing a Re-Usable and Embeddable Corpus Search Library"
Radoslav Rábara, Pavel Rychlý and Ondřej Herman, "Distributed Corpus Search"
Adrien Barbaresi and Antonio Ruiz Tinoco, "Using Elasticsearch for linguistic analysis of tweets in time and space"

10.30 – 11.00 Coffee Break

11.00 – 12.00 Session 2: Query and Interoperability

Marc Kupietz, Nils Diewald and Peter Fankhauser, "How to Get the Computation Near the Data: Improving Data Accessibility to, and Re-Usability of Analysis Functions in Corpus Query Platforms"
Roman Schneider, "Example-Based Querying for Specialist Corpora"
Paul Rayson, "Increasing Interoperability for Embedding Corpus Annotation Pipelines in Wmatrix and other corpus retrieval tools"

12.00 – 13.00 Panel Discussion: Interoperability and Extensibility of Analysis Components in Corpus Query Tools

What are the most promising / important use cases? What kind of interoperability do we need for them?
What could be a simple / feasible approach to interoperability (i.e. what data formats and interfaces could be used)? How far does such an approach get us?

Important dates

Deadline for abstract submission: 14^th of January
Notification of acceptance: 12^th of February
Deadline for the submission of camera-ready papers: 23^th of February
Meeting: 7^th of May, morning session (Room "Tenran" at the 4th floor of the Phoenix Seagaia Conference Center)

Identify, Describe and Share your LRs!

Describing your LRs in the LRE Map is now a normal practice in the submission procedure of LREC (introduced in 2010 and adopted by other conferences). To continue the efforts initiated at LREC 2014 about “Sharing LRs” (data, tools, web-services, etc.), authors will have the possibility, when submitting a paper, to upload LRs in a special LREC repository. This effort of sharing LRs, linked to the LRE Map for their description, may become a new “regular” feature for conferences in our field, thus contributing to creating a common repository where everyone can deposit and share data.

As scientific work requires accurate citations of referenced work so as to allow the community to understand the whole context and also replicate the experiments conducted by other researchers, LREC 2018 endorses the need to uniquely Identify LRs through the use of the International Standard Language Resource Number (ISLRN, www.islrn.org), a Persistent Unique Identifier to be assigned to each Language Resource. The assignment of ISLRNs to LRs cited in LREC papers will be offered at submission time.

Programme Committee

Vladimír Benko (Slovak Academy of Sciences)
Felix Bildhauer (IDS Mannheim)
Hennie Brugman (Meertens Institute, Amsterdam)
Steve Cassidy (Macquarie University)
Dan Cristea ("Alexandru Ioan Cuza" University of Iasi)
Damir Ćavar (Indiana University, Bloomington)
Tomaž Erjavec (Jožef Stefan Institute)
Stefan Evert (Friedrich-Alexander-Universität Nürnberg/Erlangen)
Alexander Geyken (Berlin-Brandenburgische Akademie der Wissenschaften)
Andrew Hardie (Lancaster University)
Serge Heiden (ENS de Lyon)
Nancy Ide (Vassar College)
Miloš Jakubíček (Lexical Computing Ltd.)
Dawn Knight (Cardiff University, UK)
Michal Křen (Charles University, Prague)
Sandra Kübler (Indiana University, Bloomington)
Krister Lindén (University of Helsinki)
Anke Lüdeling (HU Berlin)
Uwe Quasthoff (Leipzig University)
Paul Rayson (Lancaster University)
Martin Reynaert (Tilburg University)
Laurent Romary (INRIA)
Roland Schäfer (FU Berlin)
Roman Schneider (IDS Mannheim)
Serge Sharoff (University of Leeds)
Ludovic Tanguy (University of Toulouse)
Dan Tufiş (Romanian Academy, Bucharest)
Tamás Váradi (Research Institute for Linguistics, Hungarian Academy of Sciences)
Pavel Vondřička (Charles University, Prague)
Amir Zeldes (Georgetown University)

Organizing Committee

[hover for the e-mail address]

Institut für Deutsche Sprache, Mannheim

Piotr Bański, Marc Kupietz, Andreas Witt

Academiae Corpora, Austrian Academy of Sciences, Vienna

Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder

Institute of Computational Linguistics, University of Zurich

Simon Clematide

Friedrich-Alexander-Universität Erlangen-Nürnberg

Stefan Evert

Homepage

CMLC series homepage is located at http://corpora.ids-mannheim.de/cmlc.html