Marc Kupietz, Nils Diewald, Eliza Margaretha, Helge Stallkamp & Franck Bodmer
Leibniz-Institute for the German Language (IDS)
Journée d'étude sur les outils d’exploration de corpus numériques, Paris 2022-06-17
central scientific institution for the documentation and research of the German language in the present and recent history
founded in 1964
is one of the 96 institutes of the Leibniz Association
financed jointly by the federal government and the state of Baden-Württemberg
227 employees (105 researchers + students + administration)
(10 of which in the Corpus Linguistics programme area)
closely cooperates with Mannheim and Heidelberg University
2.2M words on punch cards
(al-Wadi 1994) IDS-Corpora: ca. 20M. words
construction started in 1964
aims to serve as an empirical basis for German linguistics
samples the current use of written German
since ~1956
is continually expanded
covers a broad range of text types
legally compliant through > 200 licence agreements
unlike other reference corpora, DeReKo does not strive for »balance«
because »balance« like »representativeness« depends on the research question and the targeted language domain
researchers themselves should be able to draw stratified sub-samples (»virtual corpora«) from DeReKo
that are as representative as possible wrt. to their targeted language domain and research question
e.g.: linguistic annotations are just interpretations
in case of interpretations, allow multiple ›opinions‹
Corpus Search, Management and Analysis System
designed in 1994
EU-project MECOLB (with John Sinclair, …)
has > 40,000 users
very stable and lots of features, e.g. …
COSMAS II already designed in 1994
underlying database managing gigabytes out of maintenance
limited to < 17G words (and less with annotations)
further developments increasingly expensive
market survey in 2009: there is no corpus platform that …
supports > 8G words
supports multiple, potentially concurrent annotation layers
is open source
conclusion:
build a new corpus platform
funding opportunity:
Risk R&D section of the “Leibniz Competition”
huge, fast growing corpora
> 10k users from all areas of linguistics
methodological fundamentals:
user definable virtual corpora
theory neutrality and distinction between observation & interpretation
already working but hardly extensible query platform
absolute necessity to act IPR- and licence-compliant in order not to lose good reputation and text donors
project proposal was successful in the Risk-R&D-section of the Leibniz Competition
10 PY knock-on funding 2011 – 2015
some additional funding (CLARIN, KobRA)
support by an increasing portion of the COSMAS-II project members
minimize the risk of total failure:
start with complementing COSMAS II
there will be no single jack of all trades corpus tool!
rather many of them complementing each other
also: corpus-driven background, but focus on corpus-based features
core-sustainability for >20 years
realistic extensibility with important features
always meet the requirements of a scientific tool
support for in principle unlimited amount of primary data and annotation levels
by horizontal scalability
(if the system gets too slow, just add another machine)
good support of user defined virtual corpora
easy integration of external developments
it's impossible to develop all desired features allone
ideally join forces with other corpus hosting institutions
support multiple query languages to reach out for different user communities
use “query rewriting” to handle fine-grained authorization
more efficient than filtering query hits
more transparent (traceable, replicable)
backends and frontends can be developed without paying attention to authorization
All contributions and issue reports are extremely welcome!
currently best supported QL, very similar to QLs of:
differences (a.o.):
syntax for regular expressions and verbatim strings
span handling: CQP: "laufen" </base/s=s>;
vs. Poliqarp+: endsWith(<base/s=s>, laufen)
plain CQP in development
QL discussion spin-off:
ISO 24623 Corpus Query Lingua Franca CQLF (Bański et al. 2016, Evert et al. 2020)
using Query by Example
starting from »die ehrlichste, anständigste Anlageform« ??
incorrectly annotated!
starting from »der höchste Preis«
[marmot/m=degree:sup & marmot/p=ADJA]{2} [tt/p=NN]
apparently a larger amount of false negatives are to be expected!
the web user interface
accessible directly or via client libraries
user interface plugins
independent access by fully customized installations
new features by source code contributions
direct access to corpus data (without KorAP)
makes all backend functionality available
all query languages
virtual corpus definitions
complex query expressions
UI itself uses the API only
provides OAuth2 for authorized access to restricted data
offers unauthorised access to not copyrighted numerical data
accessible directly (➞ documentation)
and via client libraries for R and Python (Kupietz et al. 2020b)
complex, multipart queries
applications where reproducibility and replicability is required (with varied query or corpus base)
providing features that are not yet supported by KorAP's backend or UI, currently e.g.
collocation analysis
aggregation of search results
aim: make programmatic use as easy as possible
in order to also pick up linguists who never coded before
and to support close links between quantitative analysis and qualitative interpretation
first install R client library, then
pip3 install KorAPClient
collocationAnalysis(
"head",
vc = "corpusSigle=WPE15", # virtual corpus definition
lemmatizeNodeQuery = FALSE,
minOccur = 5, # minimum absolute number of observed co-occurrences
leftContextSize = 3, # size of the left context window
rightContextSize = 3, # size of the right context window
topCollocatesLimit = 40, # limit analysis to the n most frequent collocates in the sample
searchHitsSampleLimit = 10000, # limit the size of the search hits sample
ignoreCollocateCase = TRUE,
withinSpan = "base/s=s", # KorAP span specification for collocations to be searched within
exactFrequencies = TRUE, # retrieve exact co-occurrence frequencies
stopwords = STOPWORDS, # words not to be considered as collocates
maxRecurse = 1, # apply collocation analysis recursively maxRecurse times
addExamples = T, # add found instances of collocations
...
)
to support contrastive corpus linguistics
to test the idea of dynamically definable, virtual comparable corpora
to join forces in the development of linguistic research software
ongoing initiatives:
European Reference Corpus EuReCo
International Comparable Corpus ICC (Kirk et al. 2017, Čermáková et al. 2021).
WikiCorp (Poudat et al. i.p.)
open initiative founded in 2013 by IDS and the academies in Poland, Romania and Hungary.
goal and idea:
many dynamically definable comparable corpora based on existing large corpora, with:
<pune> în <NN> / CoRoLa | ||
---|---|---|
NN | logDice | EN (~DeepL) |
pericol | 11,16 | Danger |
aplicare | 10,74 | Application |
mișcare | 10,63 | Move |
discuție | 10,07 | Discussion |
funcțiune | 9,97 | Function |
evidență | 9,64 | Highlight |
practică | 8,95 | Practice |
executare | 8,85 | Version |
scenă | 8,81 | Scene |
Vânzare | 8,51 | Sale |
circulație | 8,44 | Circulation |
valoare | 8,31 | Value |
slujba | 8,24 | Job |
lumină | 7,88 | Light |
vedere | 7,26 | View |
discuția | 7,11 | Discussion |
JOC | 7,10 | Game |
libertate | 7,04 | Freedom |
relație | 6,87 | Relationship |
balanță | 6,79 | Balance |
situația | 6,55 | Situation |
borcane | 6,48 | Glasses |
serviciul | 6,41 | Service |
umbră | 6,23 | Shadow |
legătură | 6,20 | Link |
primejdie | 6,13 | Emergency |
posesie | 6,03 | Possession |
față | 6,02 | Face |
in <NN> <setzen> / vc_drukola | |
---|---|
<NN> | logDice |
Gang | 10,84 |
Szene | 10,59 |
Brand | 10,12 |
Kenntnis | 9,55 |
Bewegung | 9,44 |
Verbindung | 9,16 |
Marsch | 9,07 |
Kraft | 8,41 |
Beziehung | 7,80 |
Umlauf | 7,70 |
Anführungszeichen | 7,40 |
Flammen | 6,59 |
Relation | 6,39 |
Untersuchungshaft | 6,38 |
Klammern | 6,12 |
Betrieb | 5,92 |
Stand | 5,90 |
Erstaunen | 5,75 |
Bezug | 5,51 |
Vollzug | 5,13 |
Anführungsstriche | 5,06 |
Gänsefüßchen | 4,74 |
Auslieferungshaft | 4,42 |
Parallele | 4,39 |
Vergleich | 4,38 |
Verkehr | 4,28 |
Pose | 4,15 |
Positur | 4,10 |
when comparing syntagmatic patterns, the corpus composition itself can play a greater role than comparability
dynamically definable corpora are key
corpus access via scripts seems essential
for reproducible results which are essential not only in the iterative corpus refinements
but for all analyses consisting of multiple steps
for custom-tailored visualizations
practical comparability through a common tool seems more essential than ›theoretical‹ comparability
simply follow the instructions at
https://github.com/KorAP/KorAP-Docker
supports web UI, API, TEI conversion, indexing
not yet included: authorization-components
support for very large corpora
user definable virtual corpora
arbitrary number of morphosyntax, constituency and dependency annotations
arbitrary metadata
multiple, extensible query languages
query by example
client libraries for R and Python
sustainable standing FOSS project
with 3.5 RSE FTEs
horizontal scalability (Krawfish component)
sorting and aggregation of query results
sampling function for virtual corpus construction
collocation analysis
export/analysis of token frequency lists
very limited support for spoken or multi-modal corpora
no time axis, just one token axis
not primarily intended for personal installations
now very easy using KorAP-Docker
statistical functions not as easily optimizable as in other corpus tools
because of flexible virtual corpora, frequencies and association measures cannot be pre-calculated
punctuation, emoticons, emojis are not queryable
no query/display of possible metadata values
virtual corpora:
no overview over contents
no UI for management, export, persistence, …
import filters for other corpus formats not well supported
TEI P5, NoSketchEngine, CWB, TXM
(in descending order of support)
implement all missing features and fix all issues :)
focus: horizontal scaling component
not so important for corpora with < 10G tokens
but essential for DeReKo
and code base for all missing quantitative functionalities
al-Wadi, Doris (1994):
COSMAS – Ein Computersystem für den Zugriff auf Textkorpora. Version R.1.3-1. Benutzerhandbuch. Mit einem Geleitwort von Prof. Dr. Gerhard Stickel. XII/278 S. - Mannheim: Institut für deutsche Sprache, 1994. ISBN: 3-922641-42-3
Bański, Piotr/Fischer, Peter M./Frick, Elena/Ketzan, Erik/Kupietz, Marc/Schnober, Carsten/Schonefeld, Oliver/Witt, Andreas (2012):
The New IDS Corpus Analysis Platform: Challenges and Prospects. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey, May 2012. S. 2905-2911 - European Language Resources Association (ELRA), 2012.
Bański, Piotr/Frick, Elena/Hanl, Michael/Kupietz, Marc/Schnober, Carsten/Witt, Andreas (2013):
Robust corpus architecture: a new look at virtual collections and data access. In: Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, 2013. 23-25
Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):
KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Pozna?: Fundacja Uniwersytetu im. A., 2013. →IDS-Publikationsserver
Bański, Piotr/Frick, Elena/Witt, Andreas (2016):
Corpus Query Lingua Franca (CQLF). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA):. 2804–2809.
Belica, Cyril/Kupietz, Marc/Witt, Andreas/Lüngen, Harald (2011):
The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls. In: Konopka, Marek/Kubczak, Jacqueline/Mair, Christian/Šticha, František/Waßner, Ulrich Hermann (eds.): Grammar and Corpora 2009. Third International Conference. Mannheim, 22.-24.9.2009.
Bodmer, Franck (1996):
Aspekte der Abfragekompononente von COSMAS-II. LDV-INFO. Informationsschrift der Arbeitsstelle Linguistische Datenverarbeitung, 8:112–122.
Brückner, Tobias (1989):
REFER. Benutzerhandbuch. - Mannheim: Institut für deutsche Sprache, 1989. ISBN: 3-922641-35-0
Čermáková, A., Jantunen, J., Jauhiainen, T., Kirk, J., Křen, M., Kupietz, M., & Uí Dhonnchadha, E. (2021):
The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora. Research in Corpus Linguistics, 9(1), 89-103.
Cosma, Ruxandra/Kupietz, Marc (2019):
On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române.
Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):
The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (eds.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. 265-277.
Evert, Stefan/Harlamov, Oleg/Heinrich, Philipp/Banski, Piotr (2020):
Corpus Query Lingua Franca Part II: Ontology. Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association: 3346–3352.
Fischer, Peter M. & Lang, Christian (2022):
Kontrastierungs- und Visualisierungs-Tool" (KoViT). Unveröffentlichte Software. Version 0.1.19. Institut für Deutsche Sprache, Projekt Grammatische Ressourcen.
Gray, Jim (2003):
Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.
Janus, Daniel/Przepiórkowski, Adam (2007):
Poliqarp: An open source corpus indexer and search engine with syntactic extensions. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Prague, Czech Republic: Association for Computational Linguistics. S. 85–88. https://www.aclweb.org/anthology/P07-2022.
Kirk, John/Čermáková, Anna (2017):
From ICE to ICC: The new International Comparable Corpus. In Bański et al. (eds.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section
Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):
Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger & Marie-Aude Lefer (eds) Translating and Comparing Languages: Corpus-based Insights. Corpora and Language in Use Proceedings 6, Louvain-la-Neuve: Presses universitaires de Louvain,
Kupietz, Marc/Diewald, Nils/Hanl, Michael/Margaretha, Eliza (2017):
Möglichkeiten der Erforschung grammatischer Variation mithilfe von KorAP, der neuen Korpusanalyseplattform des IDS, In: Konopka, Marek/Wöllstein, Angelika (Hrsg.), Grammatische Variation. Empirische Zugänge und theoretische Modellierung, Proceedings of the Methodenmesse im Rahmen der Jahrestagung des Instituts für Deutsche Sprache. De Gruyter, 9. März 2016, Mannheim, Germany, S. 319–329.
Kupietz, Marc/Witt, Andreas/Bański, Piotr/Tufiş, Dan/Cristea, Dan/Váradi, Tamás (2017b):
EuReCo – Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research. In: Bański, Piotr/Kupietz, Marc/Lüngen, Harald/Rayson, Paul/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Mariani, John/Stevenson, Mark/Sick, Theresa (Hrsg.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017. Mannheim: Institut für Deutsche Sprache, 2017. 15-19.
Kupietz, Marc / Trawiński, Beata (forthcoming):
Neue Perspektiven für kontrastive Korpuslinguistik: Das Europäische Referenzkorpus EuReCo. In: Akten des XIV. Kongresses der Internationalen Vereinigung für Germanische Sprach- und Literaturwissenschaft (IVG). Peter Lang (to appear in 2022)
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):
The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). 1848-1854 - European Language Resources Association (ELRA)
Lüngen, Harald/Kupietz, Marc (2020):
IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache. In: Marx, Konstanze/Lobin, Henning/Schmidt, Axel (eds.): Deutsch in Sozialen Medien. Interaktiv, multimodal, vielfältig. Jahrbuch des Instituts für Deutsche Sprache 2019. (= Jahrbuch des Instituts für Deutsche Sprache 2019). Berlin/Boston: de Gruyter. 319-344.
Margaretha, Eliza/Lüngen, Harald (2014):
Building Linguistic Corpora from Wikipedia Articles and Discussions. In: Beißwenger, Michael/Oostdijk, Nelleke/Storrer, Angelika/van den Heuvel, Henk (eds.): Building and Annotating Corpora of Computer-mediated Communication: Issues and Challenges at the Interface between Computational and Corpus Linguistics. S. 59-82 - Regensburg: GSCL, 2014.
Perkuhn, Rainer/Kupietz, Marc (2018):
Visualisierung als aufmerksamkeitsleitendes Instrument bei der Analyse sehr großer Korpora. In: Bubenhofer, Noah/Kupietz, Marc (eds.): Visualisierung sprachlicher Daten. Visual Linguistics – Praxis – Tools. Heidelberg: Heidelberg University Publishing, 2018. S. 63-90.
Poudat, Céline / Lüngen, Harald / Herzberg, Laura (eds.) (in preparation):
Wikipedia as Corpus. Linguistic corpus building, exploration and analysis. Benjamins
Teubert, Wolfgang/Belica, Cyril (2014):
Von der linguistischen Datenverarbeitung am IDS zur “Mannheimer Schule der Korpuslinguistik”. In: Institut für Deutsche Sprache (eds.): Ansichten and Einsichten. 50 Jahre Institut für Deutsche Sprache. Redaktion: Melanie Steinle, Franz Josef Berens. S. 298-319 - Mannheim: Institut für Deutsche Sprache, 2014.