Exploring contrast

Exploring Contrast

This page demonstrates the use of Relative Entropy for analyzing language variation. Currently it covers the well investigated Brown/LOB family of corpora for a synchronic and diachronic comparison of registers in British English (1931, 1961, 1991) and in British vs. American English (1961, 1991).

The main (tentative) contributions of this work in progress are:

Relative Entropy as a measure for distance and feature ranking: While relative entropy is a well understood and widely used measure for comparing probability distributions (such as language models), it seems to be neglected in the current mainstream of corpus linguistics. We show that relative entropy provides an intuitive resolution of some objections to information theoretic measures raised (e.g) in Kilgarriff [1].
Term typicality and term significance are orthogonal: Current best practice in comparing corpora (and thereby: language variation) seems to conflate term typicality with term significance. Our approach treats term typicality and significance as orthogonal assessements, using significance for term selection and typicality for term ranking. This overall approach can and should be generalized to other feature sets.
Combining Macro- and Micro-Analysis: The visualizations in themselves are standard: Heatmaps for visualizing distance among subcorpora and registers, word clouds for visualizing terms and their contribution to the distance at hand, and (planned) concordances for inspecting terms in their context. It is the combination of these visualizations that largely facilitates exploring language variation at various levels of granularity (macro-analysis) and inspecting the main features contributing to variation (micro-analysis).

TODO: for a web page this is too much text.

1. A guided tour

1.1 Overview

LOB: 1931, 1961 provides a typical entry point for the LOB family of corpora. At the top, there are three heatmaps. The left heatmap visualizes the overall distance for all nine pairs of the three time snapshots of British English (1931, 1961, 1991), and serves for drilling down to particular pairs for closer inspection. The two drill downs (1 for 1931) and (2 for 1961) are displayed in the middle and right heatmaps, which visualize the distances among the individual registers (A-R) in 1931 and 1961. Distance colors range from greenish to reddish; the color keys to the left and to the right provide for more detail. The main diagonals, which "contrast" a subcorpus with itself, are colored in grey.

Each heatmap also provides a term ranking for the currently selected pair of subcorpora, visualized by term clouds. The size of a term corresponds to its contribution to the distance, its color corresponds to its relative frequency in the selected (sub)corpus, ranging from blueish to reddish, as the color keys indicate. Both, size and color are scaled logarithmically. Term clouds for the main diagonal show the term ranking for the selected subcorpus in comparison to the rest of the corpus, otherwise they show the term ranking for the selected pair of subcorpora.

The colors panel to the top right allows to toggle between two color schemes: rbow arguably visualizes relative contrast more clearly, but ryb should also work for red-green blind people.

The p-value panel to the middle right allows to choose different levels of significance for the term rankings, by default 0.05. Note that "significance" levels 0.25 and 0.5 are highly unusual and practically disregard significance.

Every selection setup gets a unique URL by means of a so called fragment identifier for further reference.

1.2 Contrast over Time

The initial selection already reveals some interesting insights for the corpus at hand: As is to be expected, the distance between 1931 and 1991 (top right cell) is larger than the distance between 1931 and 1961 (top middle cell) and also larger than the distance between 1961 and 1991 (middle right cell). The same holds for the other direction: The distance between 1991 and 1931 is larger than the distances between 1961 and 1931 and between 1991 and 1931. Thus, with respect to the distances the corpus for 1961 is clearly positioned between 1931 and 1991. Generally, the distances back in time are larger than the distances forward in time, i.e., encoding text from 1991 with the language model of 1931 requires more additional bits than the other way around (TODO: why is this so?).

The drill down heatmaps at the level of individual registers in 1931 and 1961 offer two main observations:

The relative distance between registers remains fairly constant, i.e., distant registers in 1931, e.g., D (Religion), H (Miscellaneous), J (Learned) vs. the fictional registers K through P, are also distant in 1961, and close registers, e.g., the fictional registers with each other remain close in 1961. Incidentally, F (Popular Lore) and G (Belle-lettres) have a fairly small distance to all other registers throughout - easy reading.
With only a few exceptions the distances increase over time, i.e., registers in 1961 are more diversified than in 1931. This is even more evident when comparing 1931 with 1991. In 1991 the divide between non-fictional and fictional registers is very clear.

1.3 Language over Time

By inspecting the individual contribution of terms to the distance, one can get an idea about how language use changes over time:

The leftmost term cloud in Contrast 1931 vs. 1991 shows terms typical for 1931 as opposed to 1991, i.e., their relative frequencies have gone down over time. It is evident that in particular highly frequent terms ("the", "of", "which", "that", "is") are less frequent in 1991 than in 1931, again with 1961 in the middle (see Contrast 1931 vs. 1961 and Contrast 1961 s. 1991). If highly frequent terms decrease, the overall distribution of the language model becomes smoother, which leads to a higher overall entropy of language. And indeed, the average entropy of registers increases over time from 7.64 in 1931 over 7.74 in 1961 to 7.91 in 1991 (for more information on how this entropy is calculated see the companion page Visualizing Surprise).

Contrast 1991 vs. 1931 shows terms that have increased over time. Compared to Contrast 1931 vs. 1991 these terms generally are from a slightly lower frequency band. "s" and "t" are among the top terms. Inspecting the corpus reveals that the increase of "s" (4719 in 1931 vs. 7640 in 1991) is mainly due to an increase of genitive "s", and partially due to the increasing contraction "it's" (283 in 1931 vs. 597 in 1991). Note that the increase of genitive "s" is mirrored by the decrease of "of" (see above). "t" and - to a much lesser extent - "d" are due to increasing contraction too, as in "didn't", "can't", "I'd", "she'd" ("ll" also increases slightly, but interestingly has its peak in 1961). "didn't" is so prominent that "didn" is among the top terms. Naturally, some terms typical for 1991 just reflect prominent subjects of discourse at the time ("thatcher", "european", "ec") and years. (TODO: maybe a statement about personal pronouns).

The middle and right term clouds allow for investigating the development of individual registers over time. For example when comparing A (Press: Reportage) in 1991 vs. A the terms "said", "says", and "told" typical for 1991 indicate an increase of reported speech, whereas the terms "the", "of", "and" typical for A in 1931, mirror the general trend of decreasing high frequency terms over time. Other typical terms are mostly due to the different subject of discourse in A 1931 vs. 1991. Comparing fictional registers over time (e.g. Fiction: Mistery and Detective (1991 vs. 1931)) can also offer interesting insights: The increase of female personal pronouns ("she", "her") in 1991 indicates an increase in female protagonists for this register (maybe also owing to a larger bias on mistery), which is mirrored by the male personal pronouns ("he", "his") in 1931. The other high frequency terms corroborate the observations on general language: "t", "d" for contraction in 1991, "that", "which", "of" in 1931. At the chosen significance level this is about it. The just 24 documents in L do not provide for more significantly typical terms. At lower "significance" levels (e.g. 0.25) many more terms appear, with most of them proper nouns. This is not surprising, as fiction usually is about persons with names. Also there, female proper names appear to occur more often in 1991 than in 1931. However, each individual proper name cannot be regarded as significantly typical here. Contrasting L: 1931 vs. 1961, and L: 1961 vs. 1991 shows that these changes have developed "gradually", i.e., both comparisons look similar as 1931 vs. 1961, though only at the rather low "significance" level of 0.25; there are too few data to provide insight on this rather fine level of distinction.

1.4 Contrast between Registers

Within a time snapshot (selected from the main diagonal of the leftmost heatmap by clicking on a cell twice) it is also interesting to compare individual registers. For example, Miscellaneous, Learned (1961) shows the terms typical for Miscellaneous and Learned in contrast to all other registers in 1961. This mainly allows to inspect the commonalities between two registers in contrast to general language; in this case high frequency terms such as "the", "of", "by", "in", etc. By contrasting these two registers with each other as in Miscellaneous vs. Learned (1961), one can inspect their differences, which largely consist of (lower frequency) domain specific terms such as "company", "government" for Miscellaneous, and "equation", "theory" for Learned. By selecting different time snapshots in the left heatmap, these contrasts can be quickly inspected in another year.

1.5 Explore Yourself

Altogether the 3 subcorpora of the LOB family split into 15 registers allow for 3*3*15=2025 different selections of contrast. Not all of these will provide interesting insights. As a general rule one should avoid varying more than one parameter, i.e., either keep the time fixed and vary the register for synchronic comparisons among registers, or keep the register fixed and vary the time for diachronic comparisons. This still leaves plenty of available contrast: 3*15*15 (synchronic) + 6*15 (diachronic) = 765. For synchronic comparisons, it is usually more interesting to contrast registers with some expected commonalities, such as the press registers A,B, and C, the fictional registers K through P, or the (formal reporting) registers H and J, etc.

If you run out of interesting contrasts in the LOB family, LOB, Brown offers exploring the combined LOB, Brown family for differences in British and American English in 1961 and 1991, ScaSciTex visualizes the Saarbrucken Corpus of Scientific Text, and Diskurs in der Weimarer Republik allows to explore word usage in several registers during the Weimar Republic.

TODO: Compare empirical "findings" with the vast literature on this corpus (e.g. [1]) and register analysis in general (e.g. Biber). So far it seems that measuring distance and ranking features based on relative entropy (+ t-test) largely agrees with common knowledge, which is good.

2. Technical Background

2.1 Language Models

We use simple unigram language models, i.e., (sub)corpora are represented as a vector of relative term frequencies. The language models are smoothed with Jelinek-Mercer smoothing: p(w) = (1-lambda)*p'(w)+lambda*(b(w)), where p'(w) is the observed probability of the term of the subcorpus (relative frequency), and b(w) is the observed probability of the term in the entire corpus: BLOB+LOB+FLOB for comparing BLOB with LOB etc., and e.g. LOB when comparing individual registers in LOB. lambda = 0.05. For a discussion of more smoothing methods for unigram language models see e.g. [6].

2.2 Relative Entropy

Relative Entropy, also known as Kullback Leibler Divergence, measures the number of additional bits per term needed to encode a message following a language model P by using an encoding optimized for a language model Q. It is defined as follows: KLD(P||Q) = Sum_w p(w)*log_2(p(w)/q(w)), where p(w) is the probability of a term w in P, and q(w) is the probability of of w in Q.

To understand this definition, it is useful to break it down into its constituents: -log_2(p(w)) measures the number of bits needed to encode w using an optimal encoding for P. Large p(w) need few bits, small p(w) more. For example, with p("the")~=0.0742 as in BLOB (1931), "the" requires 3.75 bits, whereas the less frequent term "certain" with p("certain") ~= 0.0003 requires 11.7 bits. As can be easily seen, log_2(p(w)/q(w)) = log_2(p(w)) - log_2(q(w)) then gives the number of additional bits needed for encoding w when using an optimal encoding for Q as opposed to P, if p(w) > q(w), or the number of bits "spared" if p(w) < q(w). Finally, p(w)*log_2(p(w)/q(w)) weights the additional bits with p(w), such that the Sum over all w indeed gives the average number of additional bits per term needed. It can easily be shown that KLD(P||Q) = 0 iff P=Q, and this is the minimum.

The individual contribution of a term w to the Relative Entropy, KLD_w(P||Q) = p(w)*log(p(w)/q(w)) can be used for assessing how "typical" the term is for P as opposed to Q, and thus for term ranking, or more generally feature ranking. By weighting the ratio of term probabilities measured by log_2(p(w)/q(w)) with the term probability p(w), a bias towards overemphasizing large ratios for low frequency terms (as criticized for a variant of this measure by Kilgarriff [1], see 2.4 below) is avoided.

2.3 Testing Significance

As observed by Kilgarriff [3], Gries [4], and many others, even fairly large differences in term frequencies are not necessarily significant, especially if they vary a lot between documents. In the extreme, a term, for example a proper name, that frequently occurs in just one document of a corpus, can lead to a still fairly large frequency for the entire corpus, especially, if the corpus consists of fairly few documents (the number of documents per register in the LOB family ranges between 6 and 80, the overall number of documents per corpus is 500). Thus a large frequency ratio may well be just by chance, and not representative for the corpus contrast at hand.

We use an unpaired t-test (more precisely Welch's t-test) to deal with unequal variances) to test an observed difference between the relative frequencies of a term in two corpora. The t-statistics is defined as follows: t = (mean_1 - mean_2)/sqrt(var_1/n_1 + var_2/n_2), where mean_i is the mean probability (relative document frequency) of a term in Corpus i, var_i is its (empirical) variance, and n_i is the number of documents in Corpus i. Given the t-statistics and the degrees of freedom (calculated from var_i and n_i, formula omitted), one can calculate the so-called p-value, which gives the probability that the observed difference between mean_1 and mean_2 is just by chance. In statistics parlance, we can reject the null hypothesis H_0 that mean1 - mean2 = 0 if the p-value is below a given significance level, typically 0.05 or even 0.01.

Even though sometimes conflated, term typicality (as e.g. measured by relative entropy) and term significance are orthogonal. Term typicality asks whether an observed difference matters. Term significance asks whether we have enough data to be sure that the difference is significant. A difference can be significant but not matter (relatively). An unsignificant difference cannot be judged for typicality. Thus we use the t-test as a filter for term rankings. Only terms with a difference at a (choosable) level of significance are displayed.

Note that the t-test requires a corpus consisting of multiple documents, with document boundaries known. For small n (such as in Register M consisting of only 6 documents), the t-statistics is not very accurate, because then the distribution of the document frequencies of a particular term is very likely not normal. However, in this case there often exist very few significant differences anyway. Dividing a single document into slices to increase the number of samples does not help, as this would violate the underlying assumption of independently drawn samples.

TODO: Gries [5] introduces the measure DP (deviation of proportions) for taking into account what he calls dispersion (aka variance). Among all other measures he discusses, the t-test is notably absent. Thus it would be good to compare t-test as a filter for relative entropy to his method of frequency adjustment.

2.4 Discussion

Kilgarriff [1] discusses and analyses a variety of distance measures and term rankings, but notably does not take into account Relative Entropy. The two information theoretic measures he discusses are Mutual Information and Cross Entropy:

Mutual information, as defined by Kilgarriff, only takes into account the log part of relative entropy: MI_w(P,Q) = log_2(p(w)/q(w)), where p is the distribution of subcorpus P, and q is the distribution of the entire corpus (including P). Kilgarriff notes that this definition of MI overemphasizes low frequency terms. To see how Relative Entropy can alleviate this problem it is illustrative to define it in terms of the above definition of MI: KLD_w(P||Q) = p(w)*MI_w(P,Q). By multiplying MI with p(w) low frequency terms receive an overall lower weight. In addition, it seems more appropriate to not compare P with the distribution of the entire corpus, but rather with the entire corpus - P.

Cross Entropy is defined as follows: H(P||Q) = - Sum_w p(w)*log_2(q(w)). Kilgarriff uses so called "Known Similarity Corpora" (KSC) to compare Cross Entropy with the alternative measures Spearman Rank Correlation and Chi-square, and shows that Chi-square clearly outperforms the other two measures, with Cross Entropy performing worst. To understand this result it is again illustrative to express Cross Entropy in terms of Relative Entropy: H(P||Q) = KLD(P||Q)+H(P), which follows straightforwardly from the definition of Relative Entropy: KLD(P||Q) = H(P||Q) - H(P).

The key problem with Cross Entropy H(P||Q) as a distance measure among arbitrary pairs of distributions is that it heavily depends on the Entropy of P, H(P)=H(P||P). In particular, when P is more uniformly distributed than P', there always exist distributions Q and Q' such that H(P||Q) > H(P'||Q'), even though by other considerations H(P||Q) < H(P'||Q') should hold.

The construction of KSCs very likely leads to such a situation: Given two corpora with distributions P and Q, a KSC consists of (e.g.) 11 subcorpora with the first one (Q0P) sampled only from P, the second one (Q1P) sampled with 90% from P and 10% from Q, and so on. Essentially these corpora thus approximate simple weighted mixtures: QaP=P+a/10*Q. By this construction, the distances for two pairs of corpora dist(QaP,QbP) and dist(QcP,QdP), a < b, c < d, b <= c should follow the simple rule dist(QaP,QbP) < dist(QcP,QdP) iff b-a < d-c. As Kilgarriff points out, a perfect measure would respect this rule. The problem with Cross Entropy is that by mixing corpora, the underlying distribution will become more uniform, and thus the Entropy will increase. Indeed, often the (self) Entropy of the "middle" distribution H(Q5P) will be the maximum. Therefore, it can easily happen that, e.g., H(Q4P||Q5P) > H(Q0P||Q2P), even though 5-4 < 2-0, just because H(Q4P) > H(Q0P). Relative Entropy cancels out the effect of (self) Entropy, and thus does not suffer from this problem.

Kilgarriff [1] concludes his discussion on various term ranking methods as follows: "Linguists have long made a distinction approximating to the high/low frequency contrast: form words (or ‘grammar words’ or ‘closed class words’) vs. content words (or ‘lexical words’ or ‘open class words’). The relation between the distinct linguistic behaviour, and the distinct statistical behaviour of high-frequency words is obvious yet intriguing. It would not be surprising if we cannot find a statistic which works well for both high and medium-to-low frequency words. It is far from clear what a comparison of the distinctiveness of a very common word and a rare word would mean." (my emphasis).

As illustrated above Relative Entropy avoids a bias towards low or high frequency terms. Indeed, we often get a mixture of high and relatively low frequency terms typical for a subcorpus or register.

TODO: much improve the text above; it is illegible atm, and rendering of formulae is only a small part of the problem.

TODO: repeat Kilgarriff's evaluation including Relative Entropy. Unfortunately, Kilgarriff's KSCs are not available anymore, so we need to reconstruct it on the basis of the corpora at hand. (DONE: indeed Relative Entropy performs on par with chisquare and spearman rank on KSCs. Results to be detailed.)

Acknowledgements

The implementation of heatmaps heavily borrows from Joe Golike's excellent visualization of house-hunting habits at Trulia.

Term clouds are realized based on Jason Davies' implementation, with invaluable help (for a javascript novice) from Lars Ebert's tutorial (in German).

Both sources and this implementation make heavy use of Michael Bostock's javascript library for Data Driven Documents (D3).

Contact

Peter Fankhauser. fankhauser at ids-mannheim.de

References

[1] Adam Kilgarriff: Comparing Corpora. In: International Journal of Corpus Linguistics, 6:1 (2001), 97–133.

[2] Lars Hinrichs, Nicholas Smith, Birgit Waibel: Manual of information for the part-of-speech-tagged, post-edited ‘Brown’ corpora. In: ICAME Journal No. 34 (April 2010), 189-231

[3] Adam Kilgarriff: Language is never, ever, ever, random. In: Corpus Linguistics and Linguistic Theory 1-2, pp. 263-275, 2005.

[4] Stefan Th. Gries: Null-hypothesis significance testing of word frequencies: a follow-up on Kilgarriff. In: Corpus Linguistics and Linguistic Theory 1-2 (2005), pp. 277 - 294

[5] Stefan Th. Gries: Dispersions and adjusted frequencies in corpora. In International Journal of Corpus Linguistics 13:4 (2008), pp. 403–437

[6] Chengxiang Zhai, John Lafferty: A study of smoothing methods for language models applied to information retrieval. In: ACM Transactions on Information Systems (TOIS), 22:2, April 2004. pp. 179-214