Marc Kupietz, Nils Diewald, Eliza Margaretha, Helge Stallkamp & Franck Bodmer
Leibniz-Institute for the German Language (IDS)

The Corpus Analysis Platform KorAP

its Philosophy, Features, Pros & Cons

Journée d'étude sur les outils d’exploration de corpus numériques, Paris 2022-06-17

Overview

Background, History & Philosophy
KorAP Goals & Approaches
Using KorAP
Using KorAP via Client Libraries
KorAP in contrastive Linguistics
Installing your own KorAP
Summary & Outlook

1. Background, History & Philosophy

Leibniz-Institut für Deutsche Sprache (IDS)

central scientific institution for the documentation and research of the German language in the present and recent history
founded in 1964
is one of the 96 institutes of the Leibniz Association
- financed jointly by the federal government and the state of Baden-Württemberg
227 employees (105 researchers + students + administration)
- (10 of which in the Corpus Linguistics programme area)
closely cooperates with Mannheim and Heidelberg University

Leibniz-Institut für Deutsche Sprache (IDS)
R5-building

Corpora and analysis tools at the IDS
long tradition (Teubert & Belica 2014)

1964

start of first corpus building projects

1969

1st first corpus published: Mannheimer Korpus I

2.2M words on punch cards

1983

1st query software REFER (Brückner 1989)

1992

1st analysis software online: COSMAS I

(al-Wadi 1994) IDS-Corpora: ca. 20M. words

German Reference Corpus DeReKo
IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

construction started in 1964
aims to serve as an empirical basis for German linguistics
samples the current use of written German
- since ~1956
- is continually expanded
covers a broad range of text types
legally compliant through > 200 licence agreements

Primordial Sample Design
Most distinctive feature of DeReKo (Kupietz et al. 2010)

unlike other reference corpora, DeReKo does not strive for »balance«
- because »balance« like »representativeness« depends on the research question and the targeted language domain
researchers themselves should be able to draw stratified sub-samples (»virtual corpora«) from DeReKo
- that are as representative as possible wrt. to their targeted language domain and research question

Distinction between Observation and Interpretation
Second most distinctive design principle (Belica et al. 2011)

e.g.: linguistic annotations are just interpretations
in case of interpretations, allow multiple ›opinions‹

DeReKo-growth since 2000
One more distinctive feature: Size

Big pile of corpus
on its own, not very useful

not readily interpretable linguistically
legally restricted, high-dimensional, opaque structured, ...

If the data cannot move …
… pave ways to put the computation near the data (Gray 2003, Kupietz et. al 2010)

Research Tools that make DeReKo accessible
different tools for different purposes

Main general purpose platform: COSMAS II
http://cosmas2.ids-mannheim.de/ (Bodmer 1994)

Corpus Search, Management and Analysis System
designed in 1994
- EU-project MECOLB (with John Sinclair, …)
has > 40,000 users
very stable and lots of features, e.g. …

Morphological Assistant

Recursive Collocation Analysis
Which provides multi-unit syntagmatic patterns

Different result aggregations
By text type, topic domain, decade, place of publication, …

However

COSMAS II already designed in 1994
underlying database managing gigabytes out of maintenance
limited to < 17G words (and less with annotations)
further developments increasingly expensive

What to do?

market survey in 2009: there is no corpus platform that …
- supports > 8G words
- supports multiple, potentially concurrent annotation layers
- is open source
conclusion:
- build a new corpus platform
funding opportunity:
- Risk R&D section of the “Leibniz Competition”

Background: Summary

huge, fast growing corpora
> 10k users from all areas of linguistics
methodological fundamentals:
- user definable virtual corpora
- theory neutrality and distinction between observation & interpretation
already working but hardly extensible query platform
absolute necessity to act IPR- and licence-compliant in order not to lose good reputation and text donors

2. KorAP Goals & Approaches

The KorAP project

project proposal was successful in the Risk-R&D-section of the Leibniz Competition
10 PY knock-on funding 2011 – 2015
some additional funding (CLARIN, KobRA)
support by an increasing portion of the COSMAS-II project members

Main project aims

minimize the risk of total failure:
start with complementing COSMAS II
- there will be no single jack of all trades corpus tool!
- rather many of them complementing each other
- also: corpus-driven background, but focus on corpus-based features
core-sustainability for >20 years
realistic extensibility with important features
always meet the requirements of a scientific tool

Sub-Goals and approaches
as of the project start (Bański et al 2012)

support for in principle unlimited amount of primary data and annotation levels
- by horizontal scalability
  (if the system gets too slow, just add another machine)
good support of user defined virtual corpora
easy integration of external developments
- it's impossible to develop all desired features allone
- ideally join forces with other corpus hosting institutions

General Architecture
Microservice Approach

Some more Central Ideas

support multiple query languages to reach out for different user communities
use “query rewriting” to handle fine-grained authorization
- more efficient than filtering query hits
- more transparent (traceable, replicable)
- backends and frontends can be developed without paying attention to authorization

Open source Project
BSD licensed and available from https://github.com/KorAP and via our gerrit code review

All contributions and issue reports are extremely welcome!

3. Using KorAP

Low-threshold entry UI
Information-on-Demand- approach (Diewald et al. 2020)

In principle, unlimited corpus size

Unbounded number of annotation layers
For DeReKo currently …

Definition of virtual corpora
➞ stratified sub-sampling based on text metadata (Bański et al 2013)

»Corpus by Match«
Definition of virtual corpora based on query hits (Kupietz et al. 2020)

Multiple Query Languages
(Bingel & Diewald 2015)

Poliqarp+
CQP dialect, developed for Polish National Corpus NKJP (Janus/Przepiórkowski 2007)

currently best supported QL, very similar to QLs of:
- IMS Open Corpus Workbench (CWB)
- (No)SketchEngine
differences (a.o.):
- syntax for regular expressions and verbatim strings
- span handling: CQP: "laufen" </base/s=s>;
  vs. Poliqarp+: endsWith(<base/s=s>, laufen)
plain CQP in development
QL discussion spin-off:
- ISO 24623 Corpus Query Lingua Franca CQLF (Bański et al. 2016, Evert et al. 2020)

Annotation queries supported by search assistant

»Query by Example/Match«
Learning complex queries without prior knowledge (Diewald et al. 2019)

Queries involving text-structural annotations
Example: Search ›Optimierung‹ only in sub-headings

Maximize Recall using concurrent annotations
Search for ›das‹ annotated as relative pronoun only by CoreNLP tools

Maximize Recall using concurrent annotations
Search for ›das‹ annotated as relative pronoun by CoreNLP OR tree-tagger

Search on morphological Annotations
Superlative adjective + superlative adjective + noun

using Query by Example
starting from »die ehrlichste, anständigste Anlageform« ??
- incorrectly annotated!
starting from »der höchste Preis«
[marmot/m=degree:sup & marmot/p=ADJA]{2} [tt/p=NN]
- apparently a larger amount of false negatives are to be expected!

Regular Expressions
Verb that starts with ›ver‹ or ›zer‹ and contains an ›ö‹ Umlaut (➞ Cheat-Sheet)

🢧

Search on Constituency Annotations
using »spans«: NP that ends with postposded attributive adjective ›pur‹

Sentence Annotation, Negation, Fokus
trying to find a postposed attributive adjective ›satt‹

🢧

Search on Dependency Annotations
Verb with „Satz“ as direct object (with Annis QL)

🢧

4. Using KorAP via Client Libraries

Goal: Make corpora as accessible as possible
by providing Multiple Levels of Access (Kupietz et al. 2022)

0

User Interface

the web user interface

1

Web Service API

accessible directly or via client libraries

2

Plugin

user interface plugins

3

Instance

independent access by fully customized installations

4

Open Source

new features by source code contributions

5

Corpus

direct access to corpus data (without KorAP)

Intended Properties of the levels
the higher the, the cheaper and the more application instances (Kupietz et al. forthcoming)

Level 1: Web Service API

makes all backend functionality available
- all query languages
- virtual corpus definitions
- complex query expressions
- UI itself uses the API only
provides OAuth2 for authorized access to restricted data
offers unauthorised access to not copyrighted numerical data
accessible directly (➞ documentation)
and via client libraries for R and Python (Kupietz et al. 2020b)

API client libraries for R and Python
Application areas

complex, multipart queries
applications where reproducibility and replicability is required (with varied query or corpus base)
providing features that are not yet supported by KorAP's backend or UI, currently e.g.
- collocation analysis
- aggregation of search results
aim: make programmatic use as easy as possible
- in order to also pick up linguists who never coded before
- and to support close links between quantitative analysis and qualitative interpretation

Installation of Korap’s R client Library
in RStudio: Tools ➞ Install Packages ➞ RKorAPClient

Installation of Korap’s Python client Library

first install R client library, then
pip3 install KorAPClient

Basic Example: Frequency query
for ›Shop‹ and ›Boutique‹ in DeReKo newspapers' year slices ➞ data frame

Helper Functions for interactive Plots included
with CIs, and all data points linked to corresponding queries (Kupietz et al. 2017)

Example: Comparison between different VC
relative frequency of ›sozusagen‹ (=so to say) in spoken FOLK corpus vs. DeReKo

Build you own specialised UI with R & Shiny
Contrasting of orthographic variants with ›KoViT‹ (Fischer & Lang 2022)

Good starting point for other applications:
Run and modify one of the demos provided with the R-package

Collocation analysis
recursively applicable, with many options – but still slow

collocationAnalysis(
  "head",
  vc = "corpusSigle=WPE15",      # virtual corpus definition
  lemmatizeNodeQuery = FALSE,
  minOccur = 5,                  # minimum absolute number of observed co-occurrences
  leftContextSize = 3,           # size of the left context window
  rightContextSize = 3,          # size of the right context window
  topCollocatesLimit = 40,       # limit analysis to the n most frequent collocates in the sample
  searchHitsSampleLimit = 10000, # limit the size of the search hits sample
  ignoreCollocateCase = TRUE,
  withinSpan = "base/s=s",       # KorAP span specification for collocations to be searched within
  exactFrequencies = TRUE,       # retrieve exact co-occurrence frequencies
  stopwords = STOPWORDS,         # words not to be considered as collocates
  maxRecurse = 1,                # apply collocation analysis recursively maxRecurse times
  addExamples = T,               # add found instances of collocations
  ...
)

Collocation analysis of ›head‹
based on English Wikipedia corpus WPE15

Diachronic Collocation Analysis
of ›Umwelt‹ (›environment‹)

5. KorAP in contrastive Linguistics

Contrastive applications are Very relevant for KorAP
and vice-versa

to support contrastive corpus linguistics
to test the idea of dynamically definable, virtual comparable corpora
to join forces in the development of linguistic research software
ongoing initiatives:
- European Reference Corpus EuReCo
- International Comparable Corpus ICC (Kirk et al. 2017, Čermáková et al. 2021).
- WikiCorp (Poudat et al. i.p.)

European Reference Corpus EuReCo
(Kupietz et al. 2020, Kupietz et al. 2017b)

open initiative founded in 2013 by IDS and the academies in Poland, Romania and Hungary.
goal and idea:
- many dynamically definable comparable corpora based on existing large corpora, with:

Reference Corpus of Contemporary Romanian CoRoLa
https://korap.racai.ro/ (Cristea et al. 2019)

Comparable DE-RO-Corpus
➞ KorAP auf DRuKoLa-VC ➞ KorAP auf CoRoLa: https://korap.racai.ro/

Hungarian National Corpus HNC
https://korap.nlp.nytud.hu/ (Váradi 2002)

WIP: Polish National Corpus NKJP
currently only internally available http://10.0.10.58:8651/

French (WikiCorp)
only small Wikipedia sample: https://korap.ids-mannheim.de/instance/wikidemo

English
English Wikipedia Corpus WPE15: https://korap.ids-mannheim.de/instance/english

Czech
in development: Czech part of ICC

Current experiments: LVC comparison Romanian-German
using contrastive collocation analysis via RKorAPClient (Kupietz & Trawiński forthcoming)

<pune> în <NN> / CoRoLa
NN	logDice	EN (~DeepL)
pericol	11,16	Danger
aplicare	10,74	Application
mișcare	10,63	Move
discuție	10,07	Discussion
funcțiune	9,97	Function
evidență	9,64	Highlight
practică	8,95	Practice
executare	8,85	Version
scenă	8,81	Scene
Vânzare	8,51	Sale
circulație	8,44	Circulation
valoare	8,31	Value
slujba	8,24	Job
lumină	7,88	Light
vedere	7,26	View
discuția	7,11	Discussion
JOC	7,10	Game
libertate	7,04	Freedom
relație	6,87	Relationship
balanță	6,79	Balance
situația	6,55	Situation
borcane	6,48	Glasses
serviciul	6,41	Service
umbră	6,23	Shadow
legătură	6,20	Link
primejdie	6,13	Emergency
posesie	6,03	Possession
față	6,02	Face

in <NN> <setzen> / vc_drukola
<NN>	logDice
Gang	10,84
Szene	10,59
Brand	10,12
Kenntnis	9,55
Bewegung	9,44
Verbindung	9,16
Marsch	9,07
Kraft	8,41
Beziehung	7,80
Umlauf	7,70
Anführungszeichen	7,40
Flammen	6,59
Relation	6,39
Untersuchungshaft	6,38
Klammern	6,12
Betrieb	5,92
Stand	5,90
Erstaunen	5,75
Bezug	5,51
Vollzug	5,13
Anführungsstriche	5,06
Gänsefüßchen	4,74
Auslieferungshaft	4,42
Parallele	4,39
Vergleich	4,38
Verkehr	4,28
Pose	4,15
Positur	4,10

Preliminary experiences
with contrastive DE-RO analyses using KorAP

when comparing syntagmatic patterns, the corpus composition itself can play a greater role than comparability
- dynamically definable corpora are key
corpus access via scripts seems essential
- for reproducible results which are essential not only in the iterative corpus refinements
- but for all analyses consisting of multiple steps
- for custom-tailored visualizations
practical comparability through a common tool seems more essential than ›theoretical‹ comparability

6. Installing your own KorAP

Now easy with KorAP-Docker

simply follow the instructions at
https://github.com/KorAP/KorAP-Docker
supports web UI, API, TEI conversion, indexing
not yet included: authorization-components

7. Summary & Outlook

KorAP Strengths

support for very large corpora
user definable virtual corpora
arbitrary number of morphosyntax, constituency and dependency annotations
arbitrary metadata
multiple, extensible query languages
query by example
client libraries for R and Python
sustainable standing FOSS project
- with 3.5 RSE FTEs

Not yet completed important Features

horizontal scalability (Krawfish component)
- sorting and aggregation of query results
- sampling function for virtual corpus construction
- collocation analysis
- export/analysis of token frequency lists

General weaknesses

very limited support for spoken or multi-modal corpora
- no time axis, just one token axis
not primarily intended for personal installations
- now very easy using KorAP-Docker
statistical functions not as easily optimizable as in other corpus tools
- because of flexible virtual corpora, frequencies and association measures cannot be pre-calculated

Most important current issues

punctuation, emoticons, emojis are not queryable
no query/display of possible metadata values
virtual corpora:
- no overview over contents
- no UI for management, export, persistence, …
import filters for other corpus formats not well supported
- TEI P5, NoSketchEngine, CWB, TXM
  (in descending order of support)

Next steps

implement all missing features and fix all issues :)
focus: horizontal scaling component
- not so important for corpora with < 10G tokens
- but essential for DeReKo
- and code base for all missing quantitative functionalities

Thank you very much for your attention!

References

References I

al-Wadi, Doris (1994):

COSMAS – Ein Computersystem für den Zugriff auf Textkorpora. Version R.1.3-1. Benutzerhandbuch. Mit einem Geleitwort von Prof. Dr. Gerhard Stickel. XII/278 S. - Mannheim: Institut für deutsche Sprache, 1994. ISBN: 3-922641-42-3

Bański, Piotr/Fischer, Peter M./Frick, Elena/Ketzan, Erik/Kupietz, Marc/Schnober, Carsten/Schonefeld, Oliver/Witt, Andreas (2012):

The New IDS Corpus Analysis Platform: Challenges and Prospects. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey, May 2012. S. 2905-2911 - European Language Resources Association (ELRA), 2012.

Bański, Piotr/Frick, Elena/Hanl, Michael/Kupietz, Marc/Schnober, Carsten/Witt, Andreas (2013):

Robust corpus architecture: a new look at virtual collections and data access. In: Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, 2013. 23-25

Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):

KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Pozna?: Fundacja Uniwersytetu im. A., 2013. →IDS-Publikationsserver

Bański, Piotr/Frick, Elena/Witt, Andreas (2016):

Corpus Query Lingua Franca (CQLF). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA):. 2804–2809.

Belica, Cyril/Kupietz, Marc/Witt, Andreas/Lüngen, Harald (2011):

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls. In: Konopka, Marek/Kubczak, Jacqueline/Mair, Christian/Šticha, František/Waßner, Ulrich Hermann (eds.): Grammar and Corpora 2009. Third International Conference. Mannheim, 22.-24.9.2009.

Bodmer, Franck (1996):

Aspekte der Abfragekompononente von COSMAS-II. LDV-INFO. Informationsschrift der Arbeitsstelle Linguistische Datenverarbeitung, 8:112–122.

Brückner, Tobias (1989):

REFER. Benutzerhandbuch. - Mannheim: Institut für deutsche Sprache, 1989. ISBN: 3-922641-35-0

Čermáková, A., Jantunen, J., Jauhiainen, T., Kirk, J., Křen, M., Kupietz, M., & Uí Dhonnchadha, E. (2021):

The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora. Research in Corpus Linguistics, 9(1), 89-103.

Cosma, Ruxandra/Kupietz, Marc (2019):

On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române.

References II

Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):

The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (eds.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. 265-277.

Evert, Stefan/Harlamov, Oleg/Heinrich, Philipp/Banski, Piotr (2020):

Corpus Query Lingua Franca Part II: Ontology. Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association: 3346–3352.

Fischer, Peter M. & Lang, Christian (2022):

Kontrastierungs- und Visualisierungs-Tool" (KoViT). Unveröffentlichte Software. Version 0.1.19. Institut für Deutsche Sprache, Projekt Grammatische Ressourcen.

Gray, Jim (2003):

Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.

Janus, Daniel/Przepiórkowski, Adam (2007):

Poliqarp: An open source corpus indexer and search engine with syntactic extensions. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Prague, Czech Republic: Association for Computational Linguistics. S. 85–88. https://www.aclweb.org/anthology/P07-2022.

Kirk, John/Čermáková, Anna (2017):

From ICE to ICC: The new International Comparable Corpus. In Bański et al. (eds.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section

Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):

Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger & Marie-Aude Lefer (eds) Translating and Comparing Languages: Corpus-based Insights. Corpora and Language in Use Proceedings 6, Louvain-la-Neuve: Presses universitaires de Louvain,

Kupietz, Marc/Diewald, Nils/Hanl, Michael/Margaretha, Eliza (2017):

Möglichkeiten der Erforschung grammatischer Variation mithilfe von KorAP, der neuen Korpusanalyseplattform des IDS, In: Konopka, Marek/Wöllstein, Angelika (Hrsg.), Grammatische Variation. Empirische Zugänge und theoretische Modellierung, Proceedings of the Methodenmesse im Rahmen der Jahrestagung des Instituts für Deutsche Sprache. De Gruyter, 9. März 2016, Mannheim, Germany, S. 319–329.

Kupietz, Marc/Witt, Andreas/Bański, Piotr/Tufiş, Dan/Cristea, Dan/Váradi, Tamás (2017b):

EuReCo – Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research. In: Bański, Piotr/Kupietz, Marc/Lüngen, Harald/Rayson, Paul/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Mariani, John/Stevenson, Mark/Sick, Theresa (Hrsg.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017. Mannheim: Institut für Deutsche Sprache, 2017. 15-19.

References III

Kupietz, Marc / Trawiński, Beata (forthcoming):

Neue Perspektiven für kontrastive Korpuslinguistik: Das Europäische Referenzkorpus EuReCo. In: Akten des XIV. Kongresses der Internationalen Vereinigung für Germanische Sprach- und Literaturwissenschaft (IVG). Peter Lang (to appear in 2022)

Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):

The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). 1848-1854 - European Language Resources Association (ELRA)

Lüngen, Harald/Kupietz, Marc (2020):

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache. In: Marx, Konstanze/Lobin, Henning/Schmidt, Axel (eds.): Deutsch in Sozialen Medien. Interaktiv, multimodal, vielfältig. Jahrbuch des Instituts für Deutsche Sprache 2019. (= Jahrbuch des Instituts für Deutsche Sprache 2019). Berlin/Boston: de Gruyter. 319-344.

Margaretha, Eliza/Lüngen, Harald (2014):

Building Linguistic Corpora from Wikipedia Articles and Discussions. In: Beißwenger, Michael/Oostdijk, Nelleke/Storrer, Angelika/van den Heuvel, Henk (eds.): Building and Annotating Corpora of Computer-mediated Communication: Issues and Challenges at the Interface between Computational and Corpus Linguistics. S. 59-82 - Regensburg: GSCL, 2014.

Perkuhn, Rainer/Kupietz, Marc (2018):

Visualisierung als aufmerksamkeitsleitendes Instrument bei der Analyse sehr großer Korpora. In: Bubenhofer, Noah/Kupietz, Marc (eds.): Visualisierung sprachlicher Daten. Visual Linguistics – Praxis – Tools. Heidelberg: Heidelberg University Publishing, 2018. S. 63-90.

Poudat, Céline / Lüngen, Harald / Herzberg, Laura (eds.) (in preparation):

Wikipedia as Corpus. Linguistic corpus building, exploration and analysis. Benjamins

Teubert, Wolfgang/Belica, Cyril (2014):

Von der linguistischen Datenverarbeitung am IDS zur “Mannheimer Schule der Korpuslinguistik”. In: Institut für Deutsche Sprache (eds.): Ansichten and Einsichten. 50 Jahre Institut für Deutsche Sprache. Redaktion: Melanie Steinle, Franz Josef Berens. S. 298-319 - Mannheim: Institut für Deutsche Sprache, 2014.

The Corpus Analysis Platform KorAP

its Philosophy, Features, Pros & Cons

Overview

1. Background, History & Philosophy

Leibniz-Institut für Deutsche Sprache (IDS)

Leibniz-Institut für Deutsche Sprache (IDS) R5-building

Corpora and analysis tools at the IDS long tradition (Teubert & Belica 2014)

German Reference Corpus DeReKo IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

Primordial Sample Design Most distinctive feature of DeReKo (Kupietz et al. 2010)

Distinction between Observation and Interpretation Second most distinctive design principle (Belica et al. 2011)

DeReKo-growth since 2000 One more distinctive feature: Size

Big pile of corpus on its own, not very useful

not readily interpretable linguistically legally restricted, high-dimensional, opaque structured, ...

If the data cannot move … … pave ways to put the computation near the data (Gray 2003, Kupietz et. al 2010)

Research Tools that make DeReKo accessible different tools for different purposes

Main general purpose platform: COSMAS II http://cosmas2.ids-mannheim.de/ (Bodmer 1994)

Morphological Assistant

Recursive Collocation Analysis Which provides multi-unit syntagmatic patterns

Different result aggregations By text type, topic domain, decade, place of publication, …

However

What to do?

Background: Summary

2. KorAP Goals & Approaches

The KorAP project

Main project aims

Sub-Goals and approaches as of the project start (Bański et al 2012)

General Architecture Microservice Approach

Some more Central Ideas

Open source Project BSD licensed and available from https://github.com/KorAP and via our gerrit code review

3. Using KorAP

Low-threshold entry UI Information-on-Demand- approach (Diewald et al. 2020)

In principle, unlimited corpus size

Unbounded number of annotation layers For DeReKo currently …

Definition of virtual corpora ➞ stratified sub-sampling based on text metadata (Bański et al 2013)

»Corpus by Match« Definition of virtual corpora based on query hits (Kupietz et al. 2020)

Multiple Query Languages (Bingel & Diewald 2015)

Poliqarp+ CQP dialect, developed for Polish National Corpus NKJP (Janus/Przepiórkowski 2007)

Annotation queries supported by search assistant

»Query by Example/Match« Learning complex queries without prior knowledge (Diewald et al. 2019)

Queries involving text-structural annotations Example: Search ›Optimierung‹ only in sub-headings

Maximize Recall using concurrent annotations Search for ›das‹ annotated as relative pronoun only by CoreNLP tools

Maximize Recall using concurrent annotations Search for ›das‹ annotated as relative pronoun by CoreNLP **OR** tree-tagger

Search on morphological Annotations Superlative adjective + superlative adjective + noun

Regular Expressions Verb that starts with ›ver‹ or ›zer‹ and contains an ›ö‹ Umlaut (➞ Cheat-Sheet)

Search on Constituency Annotations using »spans«: NP that ends with postposded attributive adjective ›pur‹

Sentence Annotation, Negation, Fokus trying to find a postposed attributive adjective ›satt‹

Search on Dependency Annotations Verb with „Satz“ as direct object (with Annis QL)