DeutUng in the Context of the European Reference Corpus EuReCo

DeutUng final workshop, 16.12.2021, Mannheim / online

1. Introduction


  • importance of corpora for linguistic research, both in single language as well as in cross-linguistic contexts

  • the number of linguistic studies based on corpus data is increasing

  • the number (and size) of corpora is growing

  • linguists are often faced with a choice between several different types of corpora

  • options available for cross-linguistic research →

Corpora for language comparison
2. Monolingual corpora


Monolingual corpora

  • texts in only one language

  • usually in the original language, therefore of high quality

  • usually linguistically annotated (language-specific)

  • examples of large national reference corpora:

    • DeReKo , ANC, BNC, CNC, NKJP, RNC, CoRoLa, HNC

  • frequently used in cross-linguistic research

    • Augustin (2017), Taborek (2018, 2020), GDE etc.

Methodological questions

  • To what extent are the results of studies based on monolingual corpora comparable?

    • The results are comparable at a meta-level (theoretical level / level of generalizations).

    • At the empirical level (data level) they are less comparable.

    • Reason: Diversity of monolingual corpora (Trawiński and Kupietz 2021)

Conclusion: Monolingual corpora

  • low matching with regard to size, text types, topics, etc. (also morphosyntactic annotation)

  • image8.png

  • but high linguistic quality

  • image32.png

3. Parallel corpora


Parallel corpora

  • Parallel corpora consist of original texts in one language (source language) and their translations in other languages (target languages).

  • Texts in all languages aligned at sentence level

  • partially linguistically annotated (mainly language-specific annotation)

  • large multilingual parallel corpora:

    • OPUS, Europarl, InterCorp

Advantages of parallel corpora

  • Parallel data: sequences of linguistic units (words, sentences) in two or more languages,

    • which are translation equivalents of each other and as such convey the same meaning

    • are used in the same contexts

    • occur in the same text types from the same time periods etc.

  • perfect basis for identifying functional equivalence between linguistic structures (James 1980, Chesterman 1998) → tertium comparationis

  • allow to gain insights into cross-linguistic similarities and differences that might be overlooked when working with monolingual corpora

Linguistic work
contrastive, typological, translational

  • Johansson (2007), Altenberg and Granger (2002), Granger (2010), Languages in Contrast (International Journal for Contrastive Linguistics) etc.

  • Cysouw and Wälchli (2007) etc.

  • Granger et al (2003) etc.

Problems with parallel corpora

  • relatively small size

    • the more languages, the smaller and less differentiated is the corpus

  • unbalanced in terms of original texts and translations

  • specific properties of translations (a third code)

    • Laviosa (1998), Baker (1995), Teich (2003): shining through, normalization, simplification etc.

Conclusion: Parallel corpora

  • high comparability in terms of size and content (but not in terms of morphosyntactic annotation)

    • image29.png

  • lower quality of the linguistic material

    • image36.png

4. Comparable corpora


Need for comparable corpora

  • monolingual and parallel corpora alone are not suitable for finer grained linguistic research

    • because they lack either comparability or linguistic quality

  • possible remedy:

    • use a combination of parallel and monolingual corpora

    • disadvantages:

      • (quantitative) findings not directly assessable

  • desideratum: comparable corpora of high quality

  • image9.png

Comparable corpus
Definition (McEnery & Xiao 2007)

  • "a comparable corpus can be defined as a corpus containing components that are collected using the same sampling frame and similar balance and representativeness [...], e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period"

Actual or Practical comparability
is also relevant, we think

  • image20.png

  • through rich metadata

    • which should be mappable among the corpora

  • through linguistic annotations

  • through appropriate (and common) analysis tools

    • which make all these possibilities usable in the first place

Available general comparable corpora
... that include German

  • currently only Aranea - family of comparable gigaword web corpora (Benko 2014)

    • great and readily usable via NSkE

    • but composition not controlled

    • image11.png

  • in future: International Comparable Corpus (ICC)
    (Kirk et al. 2017, Čermáková et al. 2021).

    • approach complementary to EuReCo : using small corpora with controlled composition along the lines of the ICE (Greenbaum 1991)

    • ideally joined with EuReCo in the future

5. EuReCo


EuReCo - European Reference Corpus

  • open initiative founded in 2013 by the academies in Poland, Romania and Hungary and the IDS

  • pilot projects (Humboldt Research Group Linkage Programmes):

    • DRuKoLA: Romanian-German (2016-2018): CoRoLa (Tufiş et al. 2019).

    • DeutUng: Hungarian-German (2017-2021): HNC (Váradi 2002)

  • EuReCo is based on two core assumptions …

1. dedicated comparable corpora econom. infeasible

  • even monolingual universal corpora often cannot be permanently maintained / extended

  • dedicated multilingual comparable corpora would multiply the already unrealistic costs

  • dedicated comparable corpora cannot be built from scratch and maintained sustainably

EuReCo's approach

  • use the existing national and reference corpora

  • which are maintained and sometimes extended by sustainable institutions

  • instead of trying to build new ones

Expected benefits of the EuReCo approach

  • more economical, scalable and sustainable

    • especially since one can also benefit from ongoing and future extensions and improvements of these corpora

  • high linguistic quality and sufficient size to be expected in national corpora

  • image10.png

2. general comparability is not achievable

  • corpora with reasonable size and diversity cannot be perfectly comparable in general

    • there will always be a criterion against which the corpora are not comparable

    • whether an unequal distribution with regard to such a variable is relevant depends on the specific research question

  • generally comparable corpora are not a reasonable goal

2b. General representativeness is not possible

  • more important: single language corpora cannot be generally representative

    • since population=language is not generally definable

    • whether a corpus is sufficiently representative depends on the research question and the language domain

  • generally comparable and representative corpora are not meaningful objectives

EuReCo: dynamically definable, virt. comparable corpora!
In analogy to DeReKo's ›primordial sample‹ approach (Kupietz 2016)

  • EuReCo users are invited to ...

  • use predefined (comparable) corpora or

  • define themselves corpora that are
    suitably representative and comparable
    with regard to their respective research question

Basic approach
(Cosma et al. 2016) cf. McEnery & Xiao (2007).

  • draw sub-corpora from the monolingual corpora

  • so that they have similar token distributions with respect to metadata variables like:

    • topic area

    • text type

    • publication date

Refinement: Iterative (and question specific)
for gradual approximation to sufficient comparability

  1. begin as described above

  2. carry out comparative case studies

  3. if the findings appear to be artefacts of comparability criteria, refine the mapping and start again with 2

  4. image35.png

In principle possible with KorAP-VC-Builder
but not yet practicable due to missing downsampling function


Comparable CoRoLa/DeReKo corpus
just based on token distribution wrt topic domain (according to DeReKo's top level taxonomy)

Composition by year of publication
was not controlled, but is also quite similar

Comparable corpus usable with KorAP


Also HNC fully queryable via KorAP – but so far without controlled comparability


6. Current Work in progress

Comparison of syntagmatic patterns
and their contexts of use in German, Romanian & Hungarian

  • using collocation analysis

  • to investigate e.g. light verb constructions (cf. Taborek 2018)

    • depending on text-external variables

  • with manifold 'secondary' objectives:

    • investigate comparability criteria, consequences …

    • identify helpful extensions for KorAP

    • further develop the methodology to identify interesting syntagm. patterns (Belica & Perkuhn 2015)

Collocation analysis with KorAP's R library
not yet supported in the UI, but very flexible with R library (Kupietz et al. 2020b)

corola <- new("KorAPConnection", KorAPUrl = "")
dereko <- new("KorAPConnection", verbose = T)
vc_drukola <- "referTo drukola.20180909.1b_words"
in_NN_setzen <- collocationAnalysis(
  node = "focus(in [tt/p=NN] {[tt/l=setzen]})",
  vc = vc_drukola,
  leftContextSize = 1, # refers to {} in focus()
  rightContextSize = 0
pune_in_NN <- collocationAnalysis(
  node = "focus({[drukola/l=pune] în} [drukola/p=noun])",
  leftContextSize = 0,
  rightContextSize = 1

Example LVC comparison Romanian-German
Kupietz & Trawiński (forthcoming)

<pune> în <NN> / CoRoLa
NN logDice EN (~DeepL)
pericol 11,16 Danger
aplicare 10,74 Application
mișcare 10,63 Move
discuție 10,07 Discussion
funcțiune 9,97 Function
evidență 9,64 Highlight
practică 8,95 Practice
executare 8,85 Version
scenă 8,81 Scene
Vânzare 8,51 Sale
circulație 8,44 Circulation
valoare 8,31 Value
slujba 8,24 Job
lumină 7,88 Light
vedere 7,26 View
discuția 7,11 Discussion
JOC 7,10 Game
libertate 7,04 Freedom
relație 6,87 Relationship
balanță 6,79 Balance
situația 6,55 Situation
borcane 6,48 Glasses
serviciul 6,41 Service
umbră 6,23 Shadow
legătură 6,20 Link
primejdie 6,13 Emergency
posesie 6,03 Possession
față 6,02 Face
in <NN> <setzen> / vc_drukola
<NN> logDice
Gang 10,84
Szene 10,59
Brand 10,12
Kenntnis 9,55
Bewegung 9,44
Verbindung 9,16
Marsch 9,07
Kraft 8,41
Beziehung 7,80
Umlauf 7,70
Anführungszeichen 7,40
Flammen 6,59
Relation 6,39
Untersuchungshaft 6,38
Klammern 6,12
Betrieb 5,92
Stand 5,90
Erstaunen 5,75
Bezug 5,51
Vollzug 5,13
Anführungsstriche 5,06
Gänsefüßchen 4,74
Auslieferungshaft 4,42
Parallele 4,39
Vergleich 4,38
Verkehr 4,28
Pose 4,15
Positur 4,10

➞ Cohesion strengths strongly dependent on domain
Collocate rankings of "pune în ..." in domain = / ≠ law: ϱ(N=39) << 0.58

Domain = Law
pune în ... logDice
pericol 11,79
mișcare 11,10
aplicare 10,76
funcțiune 10,58
discuție 10,54
executare 9,79
liberă 9,07
Vânzare 8,78
circulație 8,71
discuția 8,05
vedere 8,05
practică 8,01
întârziere 7,56
evidență 7,32
libertate 7,18
corespondență 7,16
posesie 6,86
vînzare 6,77
serviciul 6,73
valoare 6,69
echivalență 6,60
dezbaterea 6,29
sarcina 6,27
posesia 6,09
decbatere 6,00
plicuri 5,94
primejdie 5,87
comun 5,76
Domain ≠ Law
pune în ... logDice
aplicare 11,09
evidență 10,92
pericol 9,96
practică 9,61
discuție 9,59
mișcare 9,54
scenă 9,41
valoare 9,25
funcțiune 8,87
circulație 8,70
slujba 8,69
Vânzare 8,49
lumină 8,38
situația 8,20
relație 7,75
JOC 7,59
balanță 7,16
libertate 7,14
gardă 7,02
primejdie 6,90
umbră 6,86
= 6,79
contact 6,69
dificultate 6,67
pagină 6,64
gând 6,54
legătură 6,47
față 6,46

Collocation analysis with HNC
to identify LVC with hoz (=bring) and nouns in sublative or illative

hulvc <- function(q, leftContextSize, rightContextSize) {
  hnc %>%
      leftContextSize = leftContextSize,
      rightContextSize = rightContextSize,
      exactFrequencies = TRUE,
      withinSpan = "",
      maxRecurse = 0,
      searchHitsSampleLimit = 1000,
      topCollocatesLimit = 30
hoz_left <- hulvc(
  'focus([hnc/p="FN.(SUB|ILL)"] {[hnc/l=hoz]})',
  leftContextSize = 1,
  rightContextSize = 0
hoz_right <- hulvc(
  'focus({[hnc/l=hoz]} [hnc/p="FN.(SUB|ILL)"])',
  leftContextSize = 0,
  rightContextSize = 1
hoz <- hoz_left %>% bind_rows(hoz_right) %>%

Light Verb Constructions in Hungarian
hoz (=bring) with noun in sublative or illative – focus([hnc/p="FN.(SUB|ILL)"] {[hnc/l=hoz]})

Larger synt. patterns by »recursive collocation analysis«
hoz (=bring) with recursion depth=2 (sorted by logDice)

Preliminary meta findings

  • when comparing syntagmatic patterns, the corpus composition itself can play a greater role than comparability

  • corpus access via scripts is essential

    • for reproducible results which are essential not only in the iterative corpus refinements

    • but for all analyses consisting of multiple steps

    • for custom-tailored visualizations

  • practical comparability is currently more essential than ›theoretical‹ comparability

Desirable Improvements

  • HNC-KorAP-integration: we need more metadata

    • to define comparable corpora

    • to detect dependencies on text external variables

    • but: some great practical comparability is already achieved
      just by the HNC/KorAP integration

  • KorAP

    • access to token annotations should be supported by client libraries

      • lemma access essential for agglutinative languages

7. Distributed Comparable Corpora

KorAP Support for EuReCo

  • VC for virtual comparable corpora already mentioned

  • support for

    • arbitrary metadata

    • arbitrary annotations

      • token-bound, hierarchical, relational

    • different query languages (Bingel & Diewald 2015)

    • arbitrary licences (Bański et al 2014)

  • But: Focus on DeReKo and limited resources

Levels of Access
Kupietz et al. (forthcoming)

User Interface
  • the web user interface (Diewald et al. 2019)

Web Service API
  • accessible directly or via client libraries (Kupietz et al 2020b)

  • user interface plugins

  • independent access by fully customized installations

Open Source
  • new features by source code contributions

  • direct access to corpus data (without KorAP system)

Plugin Level


Plugin Level
Functional areas of the UI

  • search input

    • e.g. lemma expansion

  • definition of virtual corpora

    • e.g. corpus visualization

  • search results

    • e.g. export

  • individual matches

    • e.g. annotation visualizations

Next steps in KorAP development

  • extension and publication of core functionalities
    (currently only available via API client libraries)

    • e.g. sorting, grouping, distribution

  • establishing plugin interfaces

  • publication of plugins

8. Conclusion and outlook


  • we would like to have comparable corpora:

    • image21.png

  • EuReCo offers a realistic approach to achieve this

    • by virtually joining existing large corpora

    • by means of user-defined, dynamically constructable comparable corpora

    • by providing the required sustainable research platform

Next EuReCo steps

  • integrate further large corpora

    • we are currently working on the NKJP!

  • continue the KorAP development

    • especially also the client libraries

  • continue working on ICC and its Integration

Thank you very much for your attention!


  • whether corpora are sufficiently comparable cannot be decided in general, but depends, among other things, on the question to be answered

  • dynamically definable virtual comparable corpora, are a good approach to solve this problem.

    • especially since this can also adjust the composition as a whole

  • most important for now seems the practical comparability with the help of a uniform tool, mappable metadata and annotations

Levels of Access
