Marc Kupietz, Nils Diewald, Eliza Margaretha, Helge Stallkamp & Franck Bodmer
Leibniz-Institute for the German Language (IDS)

The Corpus Analysis Platform KorAP

its Philosophy, Features, Pros & Cons

Journée d'étude sur les outils d’exploration de corpus numériques, Paris 2022-06-17


  1. Background, History & Philosophy

  2. KorAP Goals & Approaches

  3. Using KorAP

  4. Using KorAP via Client Libraries

  5. KorAP in contrastive Linguistics

  6. Installing your own KorAP

  7. Summary & Outlook

1. Background, History & Philosophy

Leibniz-Institut für Deutsche Sprache (IDS)

  • central scientific institution for the documentation and research of the German language in the present and recent history

  • founded in 1964

  • is one of the 96 institutes of the Leibniz Association

    • financed jointly by the federal government and the state of Baden-Württemberg

  • 227 employees (105 researchers + students + administration)

    • (10 of which in the Corpus Linguistics programme area)

  • closely cooperates with Mannheim and Heidelberg University

Leibniz-Institut für Deutsche Sprache (IDS)


Corpora and analysis tools at the IDS
long tradition (Teubert & Belica 2014)

start of first corpus building projects
1st first corpus published: Mannheimer Korpus I
  • 2.2M words on punch cards

1st query software REFER (Brückner 1989)
1st analysis software online: COSMAS I
  • (al-Wadi 1994) IDS-Corpora: ca. 20M. words

German Reference Corpus DeReKo
IDS corpora of contemporary written German (Kupietz et al. 2010, 2018)

  • construction started in 1964

  • aims to serve as an empirical basis for German linguistics

  • samples the current use of written German

    • since ~1956

    • is continually expanded

  • covers a broad range of text types

  • legally compliant through > 200 licence agreements

Primordial Sample Design
Most distinctive feature of DeReKo (Kupietz et al. 2010)

  • unlike other reference corpora, DeReKo does not strive for »balance«

    • because »balance« like »representativeness« depends on the research question and the targeted language domain

  • researchers themselves should be able to draw stratified sub-samples (»virtual corpora«) from DeReKo

    • that are as representative as possible wrt. to their targeted language domain and research question

Distinction between Observation and Interpretation
Second most distinctive design principle (Belica et al. 2011)

  • e.g.: linguistic annotations are just interpretations

  • in case of interpretations, allow multiple ›opinions‹

DeReKo-growth since 2000
One more distinctive feature: Size

Big pile of corpus
on its own, not very useful


not readily interpretable linguistically
legally restricted, high-dimensional, opaque structured, ...


If the data cannot move …
… pave ways to put the computation near the data (Gray 2003, Kupietz et. al 2010)


Research Tools that make DeReKo accessible
different tools for different purposes


Main general purpose platform: COSMAS II

  • Corpus Search, Management and Analysis System

  • designed in 1994

    • EU-project MECOLB (with John Sinclair, …)

  • has > 40,000 users

  • very stable and lots of features, e.g. …

Morphological Assistant


Recursive Collocation Analysis
Which provides multi-unit syntagmatic patterns


Different result aggregations
By text type, topic domain, decade, place of publication, …



  • COSMAS II already designed in 1994

  • underlying database managing gigabytes out of maintenance

  • limited to < 17G words (and less with annotations)

  • further developments increasingly expensive

What to do?

  • market survey in 2009: there is no corpus platform that …

    • supports > 8G words

    • supports multiple, potentially concurrent annotation layers

    • is open source

  • conclusion:

    • build a new corpus platform

  • funding opportunity:

    • Risk R&D section of the “Leibniz Competition”

Background: Summary

  • huge, fast growing corpora

  • > 10k users from all areas of linguistics

  • methodological fundamentals:

    • user definable virtual corpora

    • theory neutrality and distinction between observation & interpretation

  • already working but hardly extensible query platform

  • absolute necessity to act IPR- and licence-compliant in order not to lose good reputation and text donors

2. KorAP Goals & Approaches

The KorAP project

  • project proposal was successful in the Risk-R&D-section of the Leibniz Competition

  • 10 PY knock-on funding 2011 – 2015

  • some additional funding (CLARIN, KobRA)

  • support by an increasing portion of the COSMAS-II project members

Main project aims

  • minimize the risk of total failure:

  • start with complementing COSMAS II

    • there will be no single jack of all trades corpus tool!

    • rather many of them complementing each other

    • also: corpus-driven background, but focus on corpus-based features

  • core-sustainability for >20 years

  • realistic extensibility with important features

  • always meet the requirements of a scientific tool

Sub-Goals and approaches
as of the project start (Bański et al 2012)

  • support for in principle unlimited amount of primary data and annotation levels

    • by horizontal scalability
      (if the system gets too slow, just add another machine)

  • good support of user defined virtual corpora

  • easy integration of external developments

    • it's impossible to develop all desired features allone

    • ideally join forces with other corpus hosting institutions

General Architecture
Microservice Approach


Some more Central Ideas

  • support multiple query languages to reach out for different user communities

  • use “query rewriting” to handle fine-grained authorization

    • more efficient than filtering query hits

    • more transparent (traceable, replicable)

    • backends and frontends can be developed without paying attention to authorization

Open source Project
BSD licensed and available from and via our gerrit code review


  • All contributions and issue reports are extremely welcome!

3. Using KorAP

Low-threshold entry UI
Information-on-Demand- approach (Diewald et al. 2020)

In principle, unlimited corpus size


Unbounded number of annotation layers
For DeReKo currently …

Definition of virtual corpora
➞ stratified sub-sampling based on text metadata (Bański et al 2013)


»Corpus by Match«
Definition of virtual corpora based on query hits (Kupietz et al. 2020)

Multiple Query Languages
(Bingel & Diewald 2015)


CQP dialect, developed for Polish National Corpus NKJP (Janus/Przepiórkowski 2007)

  • currently best supported QL, very similar to QLs of:

  • differences (a.o.):

    • syntax for regular expressions and verbatim strings

    • span handling: CQP: "laufen" </base/s=s>;
      vs. Poliqarp+: endsWith(<base/s=s>, laufen)

  • plain CQP in development

  • QL discussion spin-off:

    • ISO 24623 Corpus Query Lingua Franca CQLF (Bański et al. 2016, Evert et al. 2020)

Annotation queries supported by search assistant

»Query by Example/Match«
Learning complex queries without prior knowledge (Diewald et al. 2019)

Queries involving text-structural annotations


Maximize Recall using concurrent annotations
Search for ›das‹ annotated as relative pronoun only by CoreNLP tools


Maximize Recall using concurrent annotations
Search for ›das‹ annotated as relative pronoun by CoreNLP **OR** tree-tagger


Search on morphological Annotations
Superlative adjective + superlative adjective + noun

Regular Expressions
Verb that starts with ›ver‹ or ›zer‹ and contains an ›ö‹ Umlaut (➞ Cheat-Sheet)

Search on Constituency Annotations


Sentence Annotation, Negation, Fokus
trying to find a postposed attributive adjective ›satt‹

Search on Dependency Annotations
Verb with „Satz“ as direct object (with Annis QL)

4. Using KorAP via Client Libraries

Goal: Make corpora as accessible as possible
by providing Multiple Levels of Access (Kupietz et al. 2022)

User Interface
  • the web user interface

Web Service API
  • accessible directly or via client libraries

  • user interface plugins

  • independent access by fully customized installations

Open Source
  • new features by source code contributions

  • direct access to corpus data (without KorAP)

Intended Properties of the levels
the higher the, the cheaper and the more application instances (Kupietz et al. forthcoming)


Level 1: Web Service API

  • makes all backend functionality available

    • all query languages

    • virtual corpus definitions

    • complex query expressions

    • UI itself uses the API only

  • provides OAuth2 for authorized access to restricted data

  • offers unauthorised access to not copyrighted numerical data

  • accessible directly (➞ documentation)

  • and via client libraries for R and Python (Kupietz et al. 2020b)

API client libraries for R and Python
Application areas

  • complex, multipart queries

  • applications where reproducibility and replicability is required (with varied query or corpus base)

  • providing features that are not yet supported by KorAP's backend or UI, currently e.g.

    • collocation analysis

    • aggregation of search results

  • aim: make programmatic use as easy as possible

    • in order to also pick up linguists who never coded before

    • and to support close links between quantitative analysis and qualitative interpretation

Installation of Korap’s R client Library
in RStudio: Tools ➞ Install Packages ➞ RKorAPClient


Installation of Korap’s Python client Library

  • first install R client library, then

  • pip3 install KorAPClient

Basic Example: Frequency query
for ›Shop‹ and ›Boutique‹ in DeReKo newspapers' year slices ➞ data frame

Helper Functions for interactive Plots included
with CIs, and all data points linked to corresponding queries (Kupietz et al. 2017)

Example: Comparison between different VC
relative frequency of ›sozusagen‹ (=so to say) in spoken FOLK corpus vs. DeReKo

Build you own specialised UI with R & Shiny
Contrasting of orthographic variants with ›KoViT‹ (Fischer & Lang 2022)


Good starting point for other applications:
Run and modify one of the demos provided with the R-package


Collocation analysis
recursively applicable, with many options – but still slow

  vc = "corpusSigle=WPE15",      # virtual corpus definition
  lemmatizeNodeQuery = FALSE,
  minOccur = 5,                  # minimum absolute number of observed co-occurrences
  leftContextSize = 3,           # size of the left context window
  rightContextSize = 3,          # size of the right context window
  topCollocatesLimit = 40,       # limit analysis to the n most frequent collocates in the sample
  searchHitsSampleLimit = 10000, # limit the size of the search hits sample
  ignoreCollocateCase = TRUE,
  withinSpan = "base/s=s",       # KorAP span specification for collocations to be searched within
  exactFrequencies = TRUE,       # retrieve exact co-occurrence frequencies
  stopwords = STOPWORDS,         # words not to be considered as collocates
  maxRecurse = 1,                # apply collocation analysis recursively maxRecurse times
  addExamples = T,               # add found instances of collocations

Collocation analysis of ›head‹
based on English Wikipedia corpus WPE15

Diachronic Collocation Analysis
of ›Umwelt‹ (›environment‹)

5. KorAP in contrastive Linguistics

Contrastive applications are Very relevant for KorAP
and vice-versa

  • to support contrastive corpus linguistics

  • to test the idea of dynamically definable, virtual comparable corpora

  • to join forces in the development of linguistic research software

  • ongoing initiatives:

    • European Reference Corpus EuReCo

    • International Comparable Corpus ICC (Kirk et al. 2017, Čermáková et al. 2021).

    • WikiCorp (Poudat et al. i.p.)

European Reference Corpus EuReCo
(Kupietz et al. 2020, Kupietz et al. 2017b)

  • open initiative founded in 2013 by IDS and the academies in Poland, Romania and Hungary.

  • goal and idea:

    • many dynamically definable comparable corpora based on existing large corpora, with:

  • image3.png

  • image4.png

  • image2.png

Reference Corpus of Contemporary Romanian CoRoLa (Cristea et al. 2019)


Comparable DE-RO-Corpus


Hungarian National Corpus HNC


WIP: Polish National Corpus NKJP
currently only internally available


French (WikiCorp)


English Wikipedia Corpus WPE15:


in development: Czech part of ICC


Current experiments: LVC comparison Romanian-German
using contrastive collocation analysis via RKorAPClient (Kupietz & Trawiński forthcoming)

<pune> în <NN> / CoRoLa
NN logDice EN (~DeepL)
pericol 11,16 Danger
aplicare 10,74 Application
mișcare 10,63 Move
discuție 10,07 Discussion
funcțiune 9,97 Function
evidență 9,64 Highlight
practică 8,95 Practice
executare 8,85 Version
scenă 8,81 Scene
Vânzare 8,51 Sale
circulație 8,44 Circulation
valoare 8,31 Value
slujba 8,24 Job
lumină 7,88 Light
vedere 7,26 View
discuția 7,11 Discussion
JOC 7,10 Game
libertate 7,04 Freedom
relație 6,87 Relationship
balanță 6,79 Balance
situația 6,55 Situation
borcane 6,48 Glasses
serviciul 6,41 Service
umbră 6,23 Shadow
legătură 6,20 Link
primejdie 6,13 Emergency
posesie 6,03 Possession
față 6,02 Face
in <NN> <setzen> / vc_drukola
<NN> logDice
Gang 10,84
Szene 10,59
Brand 10,12
Kenntnis 9,55
Bewegung 9,44
Verbindung 9,16
Marsch 9,07
Kraft 8,41
Beziehung 7,80
Umlauf 7,70
Anführungszeichen 7,40
Flammen 6,59
Relation 6,39
Untersuchungshaft 6,38
Klammern 6,12
Betrieb 5,92
Stand 5,90
Erstaunen 5,75
Bezug 5,51
Vollzug 5,13
Anführungsstriche 5,06
Gänsefüßchen 4,74
Auslieferungshaft 4,42
Parallele 4,39
Vergleich 4,38
Verkehr 4,28
Pose 4,15
Positur 4,10

Preliminary experiences
with contrastive DE-RO analyses using KorAP

  • when comparing syntagmatic patterns, the corpus composition itself can play a greater role than comparability

    • dynamically definable corpora are key

  • corpus access via scripts seems essential

    • for reproducible results which are essential not only in the iterative corpus refinements

    • but for all analyses consisting of multiple steps

    • for custom-tailored visualizations

  • practical comparability through a common tool seems more essential than ›theoretical‹ comparability

6. Installing your own KorAP

Now easy with KorAP-Docker

7. Summary & Outlook

KorAP Strengths

  • support for very large corpora

  • user definable virtual corpora

  • arbitrary number of morphosyntax, constituency and dependency annotations

  • arbitrary metadata

  • multiple, extensible query languages

  • query by example

  • client libraries for R and Python

  • sustainable standing FOSS project

    • with 3.5 RSE FTEs

Not yet completed important Features

  • horizontal scalability (Krawfish component)

    • sorting and aggregation of query results

    • sampling function for virtual corpus construction

    • collocation analysis

    • export/analysis of token frequency lists

General weaknesses

  • very limited support for spoken or multi-modal corpora

    • no time axis, just one token axis

  • not primarily intended for personal installations

    • now very easy using KorAP-Docker

  • statistical functions not as easily optimizable as in other corpus tools

    • because of flexible virtual corpora, frequencies and association measures cannot be pre-calculated

Most important current issues

  • punctuation, emoticons, emojis are not queryable

  • no query/display of possible metadata values

  • virtual corpora:

    • no overview over contents

    • no UI for management, export, persistence, …

  • import filters for other corpus formats not well supported

    • TEI P5, NoSketchEngine, CWB, TXM
      (in descending order of support)

Next steps

  • implement all missing features and fix all issues :)

  • focus: horizontal scaling component

    • not so important for corpora with < 10G tokens

    • but essential for DeReKo

    • and code base for all missing quantitative functionalities

Thank you very much for your attention!


References I

al-Wadi, Doris (1994):

COSMAS – Ein Computersystem für den Zugriff auf Textkorpora. Version R.1.3-1. Benutzerhandbuch. Mit einem Geleitwort von Prof. Dr. Gerhard Stickel. XII/278 S. - Mannheim: Institut für deutsche Sprache, 1994. ISBN: 3-922641-42-3

Bański, Piotr/Fischer, Peter M./Frick, Elena/Ketzan, Erik/Kupietz, Marc/Schnober, Carsten/Schonefeld, Oliver/Witt, Andreas (2012):

The New IDS Corpus Analysis Platform: Challenges and Prospects. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey, May 2012. S. 2905-2911 - European Language Resources Association (ELRA), 2012.

Bański, Piotr/Frick, Elena/Hanl, Michael/Kupietz, Marc/Schnober, Carsten/Witt, Andreas (2013):

Robust corpus architecture: a new look at virtual collections and data access. In: Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, 2013. 23-25

Bański, Piotr/Bingel, Joachim/Diewald, Nils/Frick, Elena/Hanl, Michael/Kupietz, Marc/Pęzik, Piotr/Schnober, Carsten/Witt, Andreas (2013):

KorAP: the new corpus analysis platform at IDS Mannheim. In: Vetulani, Zygmunt/Uszkoreit, Hans (eds.): Human Language Technologies as a Challenge for Computer Science and Linguistics. Proceedings of the 6th Language and Technology Conference. S. 586-587 - Pozna?: Fundacja Uniwersytetu im. A., 2013. →IDS-Publikationsserver

Bański, Piotr/Frick, Elena/Witt, Andreas (2016):

Corpus Query Lingua Franca (CQLF). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA):. 2804–2809.

Belica, Cyril/Kupietz, Marc/Witt, Andreas/Lüngen, Harald (2011):

The Morphosyntactic Annotation of DeReKo: Interpretation, Opportunities, and Pitfalls. In: Konopka, Marek/Kubczak, Jacqueline/Mair, Christian/Šticha, František/Waßner, Ulrich Hermann (eds.): Grammar and Corpora 2009. Third International Conference. Mannheim, 22.-24.9.2009.

Bodmer, Franck (1996):

Aspekte der Abfragekompononente von COSMAS-II. LDV-INFO. Informationsschrift der Arbeitsstelle Linguistische Datenverarbeitung, 8:112–122.

Brückner, Tobias (1989):

REFER. Benutzerhandbuch. - Mannheim: Institut für deutsche Sprache, 1989. ISBN: 3-922641-35-0

Čermáková, A., Jantunen, J., Jauhiainen, T., Kirk, J., Křen, M., Kupietz, M., & Uí Dhonnchadha, E. (2021):

The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora. Research in Corpus Linguistics, 9(1), 89-103.

Cosma, Ruxandra/Kupietz, Marc (2019):

On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române.

References II

Diewald, Nils/Barbu Mititelu, Verginica/Kupietz, Marc (2019):

The KorAP user interface. Accessing CoRoLa via KorAP. In: Cosma, Ruxandra/Kupietz, Marc (eds.): On design, creation and use of the Reference Corpus of Contemporary Romanian and its analysis tools. CoRoLa, KorAP, DRuKoLA and EuReCo. (= Revue Roumaine de Linguistique 64(3)). Bucureşti: Editura Academiei Române, 2019. 265-277.

Evert, Stefan/Harlamov, Oleg/Heinrich, Philipp/Banski, Piotr (2020):

Corpus Query Lingua Franca Part II: Ontology. Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association: 3346–3352.

Fischer, Peter M. & Lang, Christian (2022):

Kontrastierungs- und Visualisierungs-Tool" (KoViT). Unveröffentlichte Software. Version 0.1.19. Institut für Deutsche Sprache, Projekt Grammatische Ressourcen.

Gray, Jim (2003):

Distributed Computing Economics. Technical Report MSR-TR-2003-24, Microsoft Research.

Janus, Daniel/Przepiórkowski, Adam (2007):

Poliqarp: An open source corpus indexer and search engine with syntactic extensions. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Prague, Czech Republic: Association for Computational Linguistics. S. 85–88.

Kirk, John/Čermáková, Anna (2017):

From ICE to ICC: The new International Comparable Corpus. In Bański et al. (eds.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section

Kupietz, Marc/Diewald, Nils/Trawiński, Beata/Cosma, Ruxandra/Cristea, Dan/Tufiş, Dan/Váradi, Tamás/Wöllstein, Angelika (2020):

Recent developments in the European Reference Corpus EuReCo. In Sylviane Granger & Marie-Aude Lefer (eds) Translating and Comparing Languages: Corpus-based Insights. Corpora and Language in Use Proceedings 6, Louvain-la-Neuve: Presses universitaires de Louvain,

Kupietz, Marc/Diewald, Nils/Hanl, Michael/Margaretha, Eliza (2017):

Möglichkeiten der Erforschung grammatischer Variation mithilfe von KorAP, der neuen Korpusanalyseplattform des IDS, In: Konopka, Marek/Wöllstein, Angelika (Hrsg.), Grammatische Variation. Empirische Zugänge und theoretische Modellierung, Proceedings of the Methodenmesse im Rahmen der Jahrestagung des Instituts für Deutsche Sprache. De Gruyter, 9. März 2016, Mannheim, Germany, S. 319–329.

Kupietz, Marc/Witt, Andreas/Bański, Piotr/Tufiş, Dan/Cristea, Dan/Váradi, Tamás (2017b):

EuReCo – Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research. In: Bański, Piotr/Kupietz, Marc/Lüngen, Harald/Rayson, Paul/Biber, Hanno/Breiteneder, Evelyn/Clematide, Simon/Mariani, John/Stevenson, Mark/Sick, Theresa (Hrsg.): Proceedings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest section. Birmingham, 24 July 2017. Mannheim: Institut für Deutsche Sprache, 2017. 15-19.

References III

Kupietz, Marc / Trawiński, Beata (forthcoming):

Neue Perspektiven für kontrastive Korpuslinguistik: Das Europäische Referenzkorpus EuReCo. In: Akten des XIV. Kongresses der Internationalen Vereinigung für Germanische Sprach- und Literaturwissenschaft (IVG). Peter Lang (to appear in 2022)

Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010):

The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). 1848-1854 - European Language Resources Association (ELRA)

Lüngen, Harald/Kupietz, Marc (2020):

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache. In: Marx, Konstanze/Lobin, Henning/Schmidt, Axel (eds.): Deutsch in Sozialen Medien. Interaktiv, multimodal, vielfältig. Jahrbuch des Instituts für Deutsche Sprache 2019. (= Jahrbuch des Instituts für Deutsche Sprache 2019). Berlin/Boston: de Gruyter. 319-344.

Margaretha, Eliza/Lüngen, Harald (2014):

Building Linguistic Corpora from Wikipedia Articles and Discussions. In: Beißwenger, Michael/Oostdijk, Nelleke/Storrer, Angelika/van den Heuvel, Henk (eds.): Building and Annotating Corpora of Computer-mediated Communication: Issues and Challenges at the Interface between Computational and Corpus Linguistics. S. 59-82 - Regensburg: GSCL, 2014.

Perkuhn, Rainer/Kupietz, Marc (2018):

Visualisierung als aufmerksamkeitsleitendes Instrument bei der Analyse sehr großer Korpora. In: Bubenhofer, Noah/Kupietz, Marc (eds.): Visualisierung sprachlicher Daten. Visual Linguistics – Praxis – Tools. Heidelberg: Heidelberg University Publishing, 2018. S. 63-90.

Poudat, Céline / Lüngen, Harald / Herzberg, Laura (eds.) (in preparation):

Wikipedia as Corpus. Linguistic corpus building, exploration and analysis. Benjamins

Teubert, Wolfgang/Belica, Cyril (2014):

Von der linguistischen Datenverarbeitung am IDS zur “Mannheimer Schule der Korpuslinguistik”. In: Institut für Deutsche Sprache (eds.): Ansichten and Einsichten. 50 Jahre Institut für Deutsche Sprache. Redaktion: Melanie Steinle, Franz Josef Berens. S. 298-319 - Mannheim: Institut für Deutsche Sprache, 2014.