This page demonstrates the use of Relative Entropy for analyzing language variation. Currently it covers the well investigated Brown/LOB family of corpora for a synchronic and diachronic comparison of registers in British English (1931, 1961, 1991) and in British vs. American English (1961, 1991).

The main (tentative) contributions of this work in progress are:

- Relative Entropy as a measure for distance and feature ranking: While relative entropy is a well understood and widely used measure for comparing probability distributions (such as language models), it seems to be neglected in the current mainstream of corpus linguistics. We show that relative entropy provides an intuitive resolution of some objections to information theoretic measures raised (e.g) in Kilgarriff [1].
- Term typicality and term significance are orthogonal:
Current best practice in comparing corpora (and thereby: language
variation) seems to conflate term
*typicality*with term*significance*. Our approach treats term typicality and significance as orthogonal assessements, using significance for term selection and typicality for term ranking. This overall approach can and should be generalized to other feature sets. - Combining Macro- and Micro-Analysis: The visualizations in themselves are standard: Heatmaps for visualizing distance among subcorpora and registers, word clouds for visualizing terms and their contribution to the distance at hand, and (planned) concordances for inspecting terms in their context. It is the combination of these visualizations that largely facilitates exploring language variation at various levels of granularity (macro-analysis) and inspecting the main features contributing to variation (micro-analysis).

TODO: for a web page this is too much text.

LOB: 1931, 1961 provides a typical entry point for the LOB family of corpora. At the top, there are three heatmaps. The left heatmap visualizes the overall distance for all nine pairs of the three time snapshots of British English (1931, 1961, 1991), and serves for drilling down to particular pairs for closer inspection. The two drill downs (1 for 1931) and (2 for 1961) are displayed in the middle and right heatmaps, which visualize the distances among the individual registers (A-R) in 1931 and 1961. Distance colors range from greenish to reddish; the color keys to the left and to the right provide for more detail. The main diagonals, which "contrast" a subcorpus with itself, are colored in grey.

Each heatmap also provides a term ranking for the currently selected pair of subcorpora, visualized by term clouds. The size of a term corresponds to its contribution to the distance, its color corresponds to its relative frequency in the selected (sub)corpus, ranging from blueish to reddish, as the color keys indicate. Both, size and color are scaled logarithmically. Term clouds for the main diagonal show the term ranking for the selected subcorpus in comparison to the rest of the corpus, otherwise they show the term ranking for the selected pair of subcorpora.

The *colors* panel to the top right allows to toggle between
two color schemes: *rbow* arguably visualizes relative
contrast more clearly, but *ryb* should also work for
red-green blind people.

The *p-value* panel to the middle right allows to choose
different levels of significance for the term rankings, by default *0.05*.
Note that "significance" levels 0.25 and 0.5 are highly unusual and
practically disregard significance.

Every selection setup gets a unique URL by means of a so called fragment identifier for further reference.

The initial selection already reveals some interesting insights for the corpus at hand: As is to be expected, the distance between 1931 and 1991 (top right cell) is larger than the distance between 1931 and 1961 (top middle cell) and also larger than the distance between 1961 and 1991 (middle right cell). The same holds for the other direction: The distance between 1991 and 1931 is larger than the distances between 1961 and 1931 and between 1991 and 1931. Thus, with respect to the distances the corpus for 1961 is clearly positioned between 1931 and 1991. Generally, the distances back in time are larger than the distances forward in time, i.e., encoding text from 1991 with the language model of 1931 requires more additional bits than the other way around (TODO: why is this so?).

The drill down heatmaps at the level of individual registers in 1931 and 1961 offer two main observations:

- The relative distance between registers remains fairly constant, i.e., distant registers in 1931, e.g., D (Religion), H (Miscellaneous), J (Learned) vs. the fictional registers K through P, are also distant in 1961, and close registers, e.g., the fictional registers with each other remain close in 1961. Incidentally, F (Popular Lore) and G (Belle-lettres) have a fairly small distance to all other registers throughout - easy reading.
- With only a few exceptions the distances increase over time, i.e., registers in 1961 are more diversified than in 1931. This is even more evident when comparing 1931 with 1991. In 1991 the divide between non-fictional and fictional registers is very clear.

By inspecting the individual contribution of terms to the distance, one can get an idea about how language use changes over time:

The leftmost term cloud in Contrast 1931 vs. 1991 shows terms typical for 1931 as opposed to 1991, i.e., their relative frequencies have gone down over time. It is evident that in particular highly frequent terms ("the", "of", "which", "that", "is") are less frequent in 1991 than in 1931, again with 1961 in the middle (see Contrast 1931 vs. 1961 and Contrast 1961 s. 1991). If highly frequent terms decrease, the overall distribution of the language model becomes smoother, which leads to a higher overall entropy of language. And indeed, the average entropy of registers increases over time from 7.64 in 1931 over 7.74 in 1961 to 7.91 in 1991 (for more information on how this entropy is calculated see the companion page Visualizing Surprise).

Contrast 1991 vs. 1931 shows terms that have increased over time. Compared to Contrast 1931 vs. 1991 these terms generally are from a slightly lower frequency band. "s" and "t" are among the top terms. Inspecting the corpus reveals that the increase of "s" (4719 in 1931 vs. 7640 in 1991) is mainly due to an increase of genitive "s", and partially due to the increasing contraction "it's" (283 in 1931 vs. 597 in 1991). Note that the increase of genitive "s" is mirrored by the decrease of "of" (see above). "t" and - to a much lesser extent - "d" are due to increasing contraction too, as in "didn't", "can't", "I'd", "she'd" ("ll" also increases slightly, but interestingly has its peak in 1961). "didn't" is so prominent that "didn" is among the top terms. Naturally, some terms typical for 1991 just reflect prominent subjects of discourse at the time ("thatcher", "european", "ec") and years. (TODO: maybe a statement about personal pronouns).

The middle and right term clouds allow for investigating the development of individual registers over time. For example when comparing A (Press: Reportage) in 1991 vs. A the terms "said", "says", and "told" typical for 1991 indicate an increase of reported speech, whereas the terms "the", "of", "and" typical for A in 1931, mirror the general trend of decreasing high frequency terms over time. Other typical terms are mostly due to the different subject of discourse in A 1931 vs. 1991. Comparing fictional registers over time (e.g. Fiction: Mistery and Detective (1991 vs. 1931)) can also offer interesting insights: The increase of female personal pronouns ("she", "her") in 1991 indicates an increase in female protagonists for this register (maybe also owing to a larger bias on mistery), which is mirrored by the male personal pronouns ("he", "his") in 1931. The other high frequency terms corroborate the observations on general language: "t", "d" for contraction in 1991, "that", "which", "of" in 1931. At the chosen significance level this is about it. The just 24 documents in L do not provide for more significantly typical terms. At lower "significance" levels (e.g. 0.25) many more terms appear, with most of them proper nouns. This is not surprising, as fiction usually is about persons with names. Also there, female proper names appear to occur more often in 1991 than in 1931. However, each individual proper name cannot be regarded as significantly typical here. Contrasting L: 1931 vs. 1961, and L: 1961 vs. 1991 shows that these changes have developed "gradually", i.e., both comparisons look similar as 1931 vs. 1961, though only at the rather low "significance" level of 0.25; there are too few data to provide insight on this rather fine level of distinction.

Within a time snapshot (selected from the main diagonal of the leftmost heatmap by clicking on a cell twice) it is also interesting to compare individual registers. For example, Miscellaneous, Learned (1961) shows the terms typical for Miscellaneous and Learned in contrast to all other registers in 1961. This mainly allows to inspect the commonalities between two registers in contrast to general language; in this case high frequency terms such as "the", "of", "by", "in", etc. By contrasting these two registers with each other as in Miscellaneous vs. Learned (1961), one can inspect their differences, which largely consist of (lower frequency) domain specific terms such as "company", "government" for Miscellaneous, and "equation", "theory" for Learned. By selecting different time snapshots in the left heatmap, these contrasts can be quickly inspected in another year.

Altogether the 3 subcorpora of the LOB family split into 15 registers allow for 3*3*15=2025 different selections of contrast. Not all of these will provide interesting insights. As a general rule one should avoid varying more than one parameter, i.e., either keep the time fixed and vary the register for synchronic comparisons among registers, or keep the register fixed and vary the time for diachronic comparisons. This still leaves plenty of available contrast: 3*15*15 (synchronic) + 6*15 (diachronic) = 765. For synchronic comparisons, it is usually more interesting to contrast registers with some expected commonalities, such as the press registers A,B, and C, the fictional registers K through P, or the (formal reporting) registers H and J, etc.

If you run out of interesting contrasts in the LOB family, LOB, Brown offers exploring the combined LOB, Brown family for differences in British and American English in 1961 and 1991, ScaSciTex visualizes the Saarbrucken Corpus of Scientific Text, and Diskurs in der Weimarer Republik allows to explore word usage in several registers during the Weimar Republic.

TODO: Compare empirical "findings" with the vast literature on this corpus (e.g. [1]) and register analysis in general (e.g. Biber). So far it seems that measuring distance and ranking features based on relative entropy (+ t-test) largely agrees with common knowledge, which is good.

We use simple unigram language models, i.e., (sub)corpora are
represented as a vector of relative term frequencies. The language
models are smoothed with Jelinek-Mercer smoothing: *p(w) =
(1-lambda)*p'(w)+lambda*(b(w))*, where *p'(w)* is the observed
probability of the term of the subcorpus (relative frequency), and *b(w)*
is the observed probability of the term in the entire corpus:
BLOB+LOB+FLOB for comparing BLOB with LOB etc., and e.g. LOB when
comparing individual registers in LOB. *lambda = 0.05*. For a
discussion of more smoothing methods for unigram language models see
e.g. [6].

Relative Entropy, also known as Kullback Leibler Divergence, measures
the number of *additional* bits per term needed to encode a
message following a language model *P* by using an encoding
optimized for a language model *Q*. It is defined as follows:
*KLD(P||Q) = Sum_w p(w)*log_2(p(w)/q(w))*, where *p(w)*
is the probability of a term *w* in *P*, and *q(w)*
is the probability of of *w* in *Q*.

To understand this definition, it is useful to break it down into its
constituents: *-log_2(p(w))* measures the number of bits
needed to encode *w* using an optimal encoding for *P*.
Large *p(w)* need few bits, small *p(w)* more. For
example, with *p("the")~=0.0742* as in BLOB (1931), *"the"*
requires *3.75 bits*, whereas the less frequent term "certain"
with *p("certain") ~= 0.0003* requires *11.7 bits*. As
can be easily seen, *log_2(p(w)/q(w)) = log_2(p(w)) -
log_2(q(w))* then gives the number of additional bits needed for
encoding *w* when using an optimal encoding for *Q* as
opposed to *P*, if *p(w) > q(w)*, or the number of
bits "spared" if *p(w) < q(w)*. Finally, *p(w)*log_2(p(w)/q(w))*
weights the additional bits with p(w), such that the Sum over all *w*
indeed gives the *average* number of additional bits per term
needed. It can easily be shown that *KLD(P||Q) = 0* iff *P=Q*,
and this is the minimum.

The individual contribution of a term *w* to the Relative
Entropy, *KLD_w(P||Q) = p(w)*log(p(w)/q(w))* can be used for
assessing how "typical" the term is for *P* as opposed to *Q*,
and thus for term ranking, or more generally feature ranking. By
weighting the ratio of term probabilities measured by *log_2(p(w)/q(w))*
with the term probability *p(w)*, a bias towards
overemphasizing large ratios for low frequency terms (as criticized
for a variant of this measure by Kilgarriff [1], see 2.4 below) is
avoided.

As observed by Kilgarriff [3], Gries [4], and many others, even fairly large differences in term frequencies are not necessarily significant, especially if they vary a lot between documents. In the extreme, a term, for example a proper name, that frequently occurs in just one document of a corpus, can lead to a still fairly large frequency for the entire corpus, especially, if the corpus consists of fairly few documents (the number of documents per register in the LOB family ranges between 6 and 80, the overall number of documents per corpus is 500). Thus a large frequency ratio may well be just by chance, and not representative for the corpus contrast at hand.

We use an unpaired t-test (more precisely Welch's t-test) to deal
with unequal variances) to test an observed difference between the
relative frequencies of a term in two corpora. The t-statistics is
defined as follows: *t = (mean_1 - mean_2)/sqrt(var_1/n_1 +
var_2/n_2)*, where *mean_i* is the mean probability (relative
document frequency) of a term in Corpus *i*, *var_i* is
its (empirical) variance, and *n_i* is the number of documents
in Corpus *i*. Given the t-statistics and the degrees of
freedom (calculated from *var_i* and *n_i*, formula
omitted), one can calculate the so-called p-value, which gives the
probability that the observed difference between *mean_1* and
*mean_2* is just by chance. In statistics parlance, we can
reject the null hypothesis *H_0* that *mean1 - mean2 =
0* if the p-value is below a given significance level, typically 0.05
or even 0.01.

Even though sometimes conflated, term typicality (as e.g. measured by
relative entropy) and term significance are orthogonal. Term
typicality asks whether an observed difference *matters*. Term
significance asks whether we have enough data to be sure that the
difference is significant. A difference can be significant but not
matter (relatively). An unsignificant difference cannot be judged for
typicality. Thus we use the t-test as a filter for term rankings.
Only terms with a difference at a (choosable) level of significance
are displayed.

Note that the t-test requires a corpus consisting of *multiple*
documents, with document boundaries known. For small *n* (such
as in Register M consisting of only 6 documents), the t-statistics is
not very accurate, because then the distribution of the document
frequencies of a particular term is very likely not normal. However,
in this case there often exist very few significant differences
anyway. Dividing a single document into slices to increase the number
of samples does not help, as this would violate the underlying
assumption of independently drawn samples.

TODO: Gries [5] introduces the measure DP (deviation of proportions) for taking into account what he calls dispersion (aka variance). Among all other measures he discusses, the t-test is notably absent. Thus it would be good to compare t-test as a filter for relative entropy to his method of frequency adjustment.

Kilgarriff [1] discusses and analyses a variety of distance measures and term rankings, but notably does not take into account Relative Entropy. The two information theoretic measures he discusses are Mutual Information and Cross Entropy:

Mutual information, as defined by Kilgarriff, only takes into account
the log part of relative entropy: *MI_w(P,Q) =
log_2(p(w)/q(w))*, where *p* is the distribution of subcorpus
*P*, and *q* is the distribution of the entire corpus
(including *P*). Kilgarriff notes that this definition of *MI*
overemphasizes low frequency terms. To see how Relative Entropy can
alleviate this problem it is illustrative to define it in terms of
the above definition of *MI*: *KLD_w(P||Q) =
p(w)*MI_w(P,Q)*. By multiplying *MI* with *p(w)* low
frequency terms receive an overall lower weight. In addition, it
seems more appropriate to not compare *P* with the
distribution of the entire corpus, but rather with the entire corpus
- *P*.

Cross Entropy is defined as follows: *H(P||Q) = - Sum_w
p(w)*log_2(q(w))*. Kilgarriff uses so called "Known Similarity
Corpora" (KSC) to compare Cross Entropy with the alternative measures
Spearman Rank Correlation and Chi-square, and shows that Chi-square
clearly outperforms the other two measures, with Cross Entropy
performing worst. To understand this result it is again illustrative
to express Cross Entropy in terms of Relative Entropy: *H(P||Q)
= KLD(P||Q)+H(P)*, which follows straightforwardly from the
definition of Relative Entropy: *KLD(P||Q) = H(P||Q) - H(P)*.

The key problem with Cross Entropy *H(P||Q)* as a distance
measure among arbitrary pairs of distributions is that it heavily
depends on the Entropy of *P*, *H(P)=H(P||P)*. In
particular, when *P* is more uniformly distributed than *P'*,
there always exist distributions *Q* and *Q'* such that
*H(P||Q) > H(P'||Q')*, even though by other considerations
*H(P||Q) < H(P'||Q')* should hold.

The construction of KSCs very likely leads to such a situation: Given
two corpora with distributions *P* and *Q*, a KSC
consists of (e.g.) 11 subcorpora with the first one (*Q0P*)
sampled only from *P*, the second one (*Q1P*) sampled
with 90% from *P* and 10% from *Q*, and so on.
Essentially these corpora thus approximate simple weighted mixtures:
*QaP=P+a/10*Q*. By this construction, the distances for two
pairs of corpora *dist(QaP,QbP)* and *dist(QcP,QdP)*, *a
< b*, *c < d*, *b <= c* should follow the
simple rule *dist(QaP,QbP) < dist(QcP,QdP) iff b-a <
d-c*. As Kilgarriff points out, a perfect measure would respect this
rule. The problem with Cross Entropy is that by mixing corpora, the
underlying distribution will become more uniform, and thus the
Entropy will increase. Indeed, often the (self) Entropy of the
"middle" distribution *H(Q5P)* will be the maximum. Therefore,
it can easily happen that, e.g., *H(Q4P||Q5P) >
H(Q0P||Q2P)*, even though *5-4 < 2-0*, just because *H(Q4P)
> H(Q0P)*. *Relative* Entropy cancels out the effect of
(self) Entropy, and thus does not suffer from this problem.

Kilgarriff [1] concludes his discussion on various term ranking
methods as follows: "Linguists have long made a distinction
approximating to the high/low frequency contrast: form words (or
‘grammar words’ or ‘closed class words’) vs. content words (or
‘lexical words’ or ‘open class words’). The relation between the
distinct linguistic behaviour, and the distinct statistical behaviour
of high-frequency words is obvious yet intriguing. It would not be
surprising if we cannot find a statistic which works well for both
high and medium-to-low frequency words. *It is far from clear
what a comparison of the distinctiveness of a very common word and a
rare word would mean.*" (my emphasis).

TODO: much improve the text above; it is illegible atm, and rendering of formulae is only a small part of the problem.

TODO: repeat Kilgarriff's evaluation including Relative Entropy. Unfortunately, Kilgarriff's KSCs are not available anymore, so we need to reconstruct it on the basis of the corpora at hand. (DONE: indeed Relative Entropy performs on par with chisquare and spearman rank on KSCs. Results to be detailed.)

The implementation of heatmaps heavily borrows from Joe Golike's excellent visualization of house-hunting habits at Trulia.

Term clouds are realized based on Jason Davies' implementation, with invaluable help (for a javascript novice) from Lars Ebert's tutorial (in German).

Both sources and this implementation make heavy use of Michael Bostock's javascript library for Data Driven Documents (D3).

Peter Fankhauser. fankhauser at ids-mannheim.de

[1] Adam Kilgarriff: Comparing Corpora. In: International Journal of Corpus Linguistics, 6:1 (2001), 97–133.

[2] Lars Hinrichs, Nicholas Smith, Birgit Waibel: Manual of information for the part-of-speech-tagged, post-edited ‘Brown’ corpora. In: ICAME Journal No. 34 (April 2010), 189-231

[3] Adam Kilgarriff: Language is never, ever, ever, random. In: Corpus Linguistics and Linguistic Theory 1-2, pp. 263-275, 2005.

[4] Stefan Th. Gries: Null-hypothesis significance testing of word frequencies: a follow-up on Kilgarriff. In: Corpus Linguistics and Linguistic Theory 1-2 (2005), pp. 277 - 294

[5] Stefan Th. Gries: Dispersions and adjusted frequencies in corpora. In International Journal of Corpus Linguistics 13:4 (2008), pp. 403–437

[6] Chengxiang Zhai, John Lafferty: A study of smoothing methods for language models applied to information retrieval. In: ACM Transactions on Information Systems (TOIS), 22:2, April 2004. pp. 179-214