Measuring and Visualizing Diachronic Word Use

Summary

This visualization extends the original diachronic visualization with a variety of measures to analyse word use along the lines of the domain specific visualization. Currently the following measures are supported:

Corpus

As a corpus we use a portion of the Royal Society Corpus (release 5.1), ranging from 1665 to 1929. The corpus comprises 91.2 Mio tokens over about 462.000 types and has been split into 27 decades, with the number of tokens ranging between 455.351 and 13.583.475. For a more detailed breakdown see Figure 1. The raw frequencies per decade are available in decFreq1929.zip, decFreq1929Pos.zip.

Tokens per Decade
Figure 1: Number of tokens per decade

Word Embeddings

For computing word embeddings we use word2vec skipgrams (Mikolov et al. 2013) and structured skipgrams introduced by Ling et al. (2015). Whereas skipgrams represent the left/right usage context of a word as a bag of words, structured skipgrams represent each position in the context separately. For characterizing content words skipgrams and structured skipgrams seem to fare equally well, but structured skipgrams do a better job for characterizing function words.

For computing period specific word embeddings that are aligned with each other, we use two variants of the approach of Dubossarsky et al. (2015), Fankhauser and Kupietz (2017). Training for the first period is either initialized randomly (noinit) or with embeddings for the complete corpus (init). All subsequent periods are initialized with the embeddings of their previous period. For the noinit option, embeddings for the complete corpus are initialized with embeddings for the last period.

For words with enough support these two variants seem fairly equivalent. However, low frequency words can behave rather differently: With random initialization (noinit) low frequency words tend to be rather arbitrarily concentrated in the center of the semantic space for the first few periods. Corpus initialization (init) avoids this, but then the positioning of low frequency words may not really reflect their actual usage during the first few periods. Likewise, noinit may bias the representation of low frequency words for the complete corpus by the representation of the last period. Moreover, noinit also leads to partially erratic movement in the semantic space over time, evident by a larger average distance of word embeddings over time.

Altogether, option init together with carefully considering/filtering out low frequency words is recommended as default 5.

Table 1 lists all variants of embeddings currently available. Column Init distinguishes between corpus initialization and random initialization. Column Train Counts says whether training uses period specific word counts (yes) or the word counts for the complete corpus (no). Using period specific word counts is the more sensible approach, because it properly anneals the learning rate alpha and draws negative samples according to period specific counts rather than corpus wide ones. The original diachronic visualizations, however, have used corpus wide counts, and are included here for comparability. Column Architecture distinguishes between structured skipgrams (recommended) and skipgrams. For the two recommended options there are also embeddings available with part-of-speech attached to words (Column PoS). For all visualizations the underlying embeddings are also available for download.

Rec PoS Init Train Counts Architecture Visualization Vectors
*** no yes yes struct skip rsc-init-tc1-t3 rsc-vectors-init-tc1-t3.zip
* no no yes struct skip rsc-noinit-tc1-t3 rsc-vectors-noinit-tc1-t3.zip
*** yes yes yes struct skip rsc-pos-init-tc1-t3 rsc-vectors-pos-init-tc1-t3.zip
* yes no yes struct skip rsc-pos-noinit-tc1-t3 rsc-vectors-pos-noinit-tc1-t3.zip
no yes no struct skip rsc-init-tc0-t3 rsc-vectors-init-tc0-t3.zip
no no no struct skip rsc-noinit-tc0-t3 rsc-vectors-noinit-tc0-t3.zip
no yes yes skipgram init-tc1-t1 rsc-vectors-init-tc1-t1.zip
no no yes skipgram rsc-noinit-tc1-t1 rsc-vectors-noinit-tc1-t1.zip
no yes no skipgram rsc-init-tc0-t1 rsc-vectors-init-tc0-t1.zip
no no no skipgram rsc-noinit-tc0-t1 rsc-vectors-noinit-tc0-t1.zip
Table 1: Embedding Variants

All embeddings use a hidden layer with dimension 100, a window size of +/-5, negative sampling 10, and a minimum frequency of 5 in the complete corpus. Period specific word embeddings have been trained with 25 iterations, corpus embeddings with 5.

In addition, there is also a visualization of the Spiegel/Zeit-Corpus available. This is based on embeddings with a hidden layer with dimension 200.

Visualization

The visualization consists of four main components:

(1) Bubble Chart

To the left, a bubble chart represents the color encoded semantic space of words, with the size of bubbles proportional to the square root of the relative frequency in the chosen period, and the color indicating the slope of the diachronic growth of a word or the period with its maximum relative frequency. Words can be clicked for further analysis.

(2) Line Chart

To the top right frequency change of individual words is represented by simple line charts showing the fitted 2nd order polynomials of the logit transformed relative frequencies. The line chart also doubles as a selector for individual periods.

(3) Options

The top right also provides a number of visualization options. The first group of options applies to the bubble chart:

The second group of options applies to the line chart.

The last two buttons are for choosing the number of more zoom levels to show for bubbles (just a convenience option for producing pretty pictures), and choosing whether the word table should reload on demand or always (default: on demand). Clicking on the y-axis toggles between absolute and relative scaling.

(4) Word Table

The bottom right provides a word list for the chosen period, with the following columns:

The word list can be sorted on all columns, searched for words, and filtered by giving lower bounds. Filtering and sorting is in particular useful to focus on words with a minimum frequency per million, on words typical for a period (kld > 0), or for looking for words with a high slope in productivity. Clicking on a word zooms and centers the bubble chart on its position in the chosen period.

Macro Analysis

Measures over Time

Figure 2 breaks down productivity by (some) Part-of-Speech. As a general trend, paradigmatic productivity decreases consistently, whereas syntagmatic productivity increases. Nouns NN, verbs VV, and adjectives JJ behave very similar in diachronic productivity, however the overall trends are more pronounced for verbs. In particular, verbs start out with higher paradigmatic productivity than nouns and end up with a lower one. Adverbs RB have generally a lower paradigmatic and higher syntagmatic productivity than the average. Finally, determiners (like other function words) have unsurprisingly significantly higher syntagmatic productivity, and their diachronic development in paradigmatic productivity is much more erratic. Whereas Figure 2 (a,b) considers all words with minimum frequency 50 in the corpus (after discarding numbers etc.), Figure 2 (c,d) considers only words occuring in all decades, to account for the dominance of the corpus by later periods. As can be seen, the general trends in productity over time remain. All values are computed as type based means. Averaging over each individual token occurrence corroborates the overall trend of decreasing paradigmatic productivity, whereas syntagmatic productivity per token also decreases slightly.

Paradigmatic Productivity on all words
(a) Paradigmatic
Syntagmatic Productivity on all words
(b) Syntagmatic
Paradigmatic Productivity, words in all decades
(c) Paradigmatic, words in all decades
Syntagmatic Productivity, words in all decades
(d) Syntagmatic, words in all decades
Figure 2: Productivity over Time by PoS

Figure 3 compares productivity over time between nouns and individual verb classes. The overall trend of verbs decreasing more strongly than nouns in paradigmatic productivity is mainly due to participles (VVG present participle, VVN past participle) and past tense, VVD, which may sometimes be mixed up with VVN. Base form VV and present tense VVZ, VVP show a less pronounced difference to NN. Likewise, participles in general have higher syntagmatic productivity than nouns. In terms of overall increase of relative frequency, nouns stand out by more than doubling over time, but also participles increase by almost 75% (VVN) and 25% (VVG) over time.

Paradigmatic Productivity on verbs vs. nouns
(a) Paradigmatic
Syntagmatic Productivity on verbs vs. nouns
(b) Syntagmatic
Frequency of verbs vs. nouns
(c) Frequency per Million
Figure 3: Productivity and Frequency over Time: Nouns vs. Verbs

Correlations

With few exceptions, the introduced measures are largely orthogonal, i.e., they are basically uncorrelated. Table 2 (a) gives the Spearman rank correlations between the measures for the complete corpus. As to be expected by construction 1, there exists a weak positive correlation between frequency per million and typicality; in this case the Pearson correlation is much more pronounced (0.8). There also exists a weak positive correlation between frequency and syntagmatic productivity, i.e., higher frequency words (function words) tend to occur in more different contexts. Conversely, frequency and paradigmatic productivity are negatively correlated, and indeed low frequency words tend to have slightly more paradigmatic neighbours. These correlations are rather robust over different settings (with/without initialization, with/without PoS taggings).

fpm typ pprod sprod dist
fpm 1.00 0.46 -0.48 0.43 -0.26
typ 0.46 1.00 -0.20 0.22 0.04
pprod -0.48 -0.20 1.00 -0.26 0.08
sprod 0.43 0.22 -0.26 1.00 -0.02
dist -0.26 0.04 0.08 -0.02 1.00
Mean 166.52 56.60 3.18 6.00 0.28
StD 1757.58 524.76 0.68 1.15 0.06
(a) all words in corpus
fpm dist fpmΔ pprodΔ sprodΔ
fpm 1.00 -0.26 0.58 0.64 0.19
dist -0.26 1.00 -0.09 -0.38 0.05
fpmΔ 0.58 -0.09 1.00 0.29 0.72
pprodΔ 0.64 -0.38 0.29 1.00 0.05
sprodΔ 0.19 0.05 0.72 0.05 1.00
Mean 166.52 0.28 -0.56 -2.63 2.38
StD 1757.58 0.06 2.83 2.27 2.05
(b) words in all decades over time
Table 2: Correlations, Means, Standard Deviations

Table 2 (b) gives the Spearman rank correlations for the slopes of frequency, paradigmatic productivity, and syntagmatic productivity, taking only into account words that occur in all periods, and thus have rather reliable estimates for the slopes. The positive correlation of fpm (of the complete corpus) and fpmΔ just reflects the fact that the late periods have orders of magnitude more tokens than early periods. Thus words with high fpms also typically have high fpms during late periods, and thus a positive slope fpmΔ. However, the mean slope is roughly zero (larger standard deviation than mean). In contrast paradigmatic entropy has a significantly negative mean slope, and syntagmatic entropy a significantly positive mean slope. This indicates a general trend of expressing more diverse syntagmatic contexts with fewer paradigmatic choices. Interestingly, the positive correlation between fpm and sprod also shows as a rather strong positive correlation of the slopes fpmΔ and sprodΔ, whereas the negative correlation between fpm and pprod does not translate to a negative correlation of the respective slopes. However, the slope pprodΔ is positively correlated with fpm, i.e., higher frequency words tend to increase in paradigmatic productivity over time. Finally, there is a weak negative correlation between pprodΔ and the (maximum) distance of a word in the complete corpus and one of the individual periods, indicating that words with increasing paradigmatic productivity tend to be rather stable in their meaning.

(tbd) From previous analysis we know that close paradigmatic neighbours typically also have highly correlated fpmΔ. An open question then is, whether this also holds for productivity.

Micro Analysis

The visualization supports two main paths to finding possibly interesting diachronic developments in word use: Spot patterns and sort&filter. In the following we will illustrate these two by way of some example analyses. Unless explicitly stated otherwise all analyses are carried out on the embeddings with initialization.

Spot Patterns

The bubbles overview allows to identify and zoom to paradigmatic clusters typical for a period, or more precisely, rising or falling fairly consistently. On the complete corpus (All) these clusters show as regions with a dominant color (blueish for falling, readish for rising). Figure 4 (a) shows two example clusters, a rising one with chemical compounds, and a falling one with communicative verbs in present tense.

Because the corpus is dominated by late periods, falling clusters are less visible than rising clusters. These clusters can be more easily identified by selecting a particular period, preferably with option zoom: kld rather than zoom: freq clst to show the most typical words for a period. For example, Figure 4 (b) shows two falling paradigmatic clusters in 1680-89, one grammatical (of wh-adverbs) and one thematic (solar system).

Paradigmatic Clusters All
(a) Complete Corpus
Paradigmatic Clusters 1680
(b) 1680
Figure 4: Paradigmatic Clusters

Sort&Filter

Sorting (usually in decending order) is a simple means to identify possibly characteristic words for a particular decade or in diachronic change. For example Table 3 (a) lists the 10 most typical words in comparison with the complete corpus (kld) over time in steps of 50 years. The first three decades are dominated by (personal) pronouns and conjunctions indicating a personal reporting style (letters to the editor), the second three periods by rather generic thematic words (and symbols). The last column gives the average typicality of words in the complete corpus compared to individual periods. These clearly indicate the nominal style of the frequency wise dominant later periods. Table 3 (b) shows that the transition from personal reporting style to nominal style is rather continous: When comparing a decade with its immediately preceding decade, the most typical words in the first three decades overlap very much with the most typical words for the complete corpus. Again the second three decades, where the nominal style has settled in, are characterized by generic thematic words.

1670 1720 1770 1820 1870 1920 All
, i , i ; ) the
he 'd i needle w. ( ;
hath it ditto observations e. is of
that 's it inches on for )
they they air mean 8vo vol. (
: pox 's r \amp in in
them them wind magnets n. p. is
and or quicksilver grains s per a
it he them : sea values on
some aequator will distance fathoms the and
(a) Decade vs. Corpus
1670 1720 1770 1820 1870 1920 All
, , the needle ; ( the
of the , ; \amp ) of
the of in at 8vo the ,
and to is mean \apos of )
in will of magnets w. and (
; be a time s a ;
a a it inches e. is in
which in ; obs. \lt for is
to force and nerve on golgi vol.
that aequator ditto observations \gt fig. a
(b) Decade vs. Preceding Decade
Table 3: Typicality over Time

Finally, Table 4 looks at (a) paradigmatic and (b) syntagmatic productivity: To have enough datapoints for estimating the slope in productivity, here we consider only words occurring in at least 20 out of 27 decades, with a minimum frequency per million 100, and we filter out non alphabetic words (by the regular expression ^[A-Za-z]+$ in the search field for column word). In absolute terms, named entities (and numbers)6 have the highest paradigmatic productivity (pprod↓). They tend to occur in rather regular contexts and they constitute a very open class In contrast, closed class function words (pprod↑ sorted in ascending order) have low paradigmatic productivity. In terms of change over time, (pprodΔ↓) adverbs used for construing discourse stand out with increasing paradigmatic productivity. The list of words with decreasing paradigmatic productivity seems a bit arbitrary though, but then just looking at the top 10 words does not always suffice.

pprod↓ pprodΔ↓ pprod↑ pprodΔ↑
june consequently of contact
july thus the distribution
usually still alone to
april rather our group
march again per of
january however those active
soon probably any independent
apparently completely for unit
entirely even its you
completely air to standard
(a) Paradigmatic Productivity
sprod↓ sprodΔ↓ sprod↑ sprodΔ↑
des completely however however
de previously presence hand
der j existence said
j rapidly regarded indeed
being gas regard author
or group roy course
were arrangement addition think
was der top do
et average bottom due
of apparently royal communicated
(b) Syntagmatic Productivity
Table 4: Productivity in Corpus and over Time

The words with the highest syntagmatic productivity (Table 4 (b) sprod↓) are mostly function words, corroborating the analysis (DT) in Figure 2. The words with the lowest syntagmatic productivity sprod↑ appear to be words which occur mostly in rather fixed contexts (in existence, presence, with regard to, ...). Words with increasing syntagmatic productivity sprodΔ↓ comprise adverbs but also generic nouns with presumably many different modifiers (gas). Finally, words with decreasing syntagmatic productivity sprodΔ↑ indicate words increasingly becoming used in rather fixed context (e.g. due to, other hand, do not).

Further Reading

For more information see Bizzoni et al. 2020 and Teich et al. 2021.

Footnotes

  1. Typicality of a word \(x\) is defined as its contribution to the Kullback-Leibler Divergence between the (unigram) language model of a period (or more generally corpus) \(D_1\) to the language model of another period \(D_2\), also called relative entropy: \[Typ(x,D_1,D_2) = p(x|D_1) log(p(x|D_1)/p(x|D_2))\] This gives the number of bits lost when encoding \(x\) with an optimal encoding for \(D_2\) instead of \(D_1\). (see also: Contrastive Analysis)
  2. Paradigmatic Productivty is measured by the entropy over all close paradigmatic neighbours \(x_i\) of a word \(x\), including \(x\): \[ParProd(x)=-\sum_{cos(x_i,x) \gt \theta} p(x_i|C_x) log(p(x_i|C_x))\] \[\textrm{with}\:p(x_i|C_x) = \frac{cos(x_i,x) freq(x_i)}{\sum_{x_j}cos(x_j,x)freq(x_j)}\] i.e., \(P(x_i|C_x)\) is the conditional probability of word \(x_i\) in the close neighbourhood of word \(x\), weighted by the cosine similarity between \(x_i\) and \(x\) (max 25 words, cosine similarity \(\gt \theta = 0.6\)). For the chosen parameters this measure ranges between 0, no neighbours and \(log(25) = 4.64\), all 25 neighbours with maximum similartiy 1, uniformly distributed. Note that the term productivity is borrowed from analysis of word formation. Pexman et al. (2008) employ a closely related measure - number of paradigmatic neighbours - as an aspect of semantic richness.
  3. Syntagmatic Productivity is measured by the entropy over all syntagmatic neighbours of a word \(x\) within a window \(C_x\) of +/- 1: \[SynProd(x)=-\sum_{c_i \in C_x} p(c_i|x) log(p(c_i|x))\] \[\textrm{with}\: C_x = \{x_{-1}|x_{-1},x\} \cup \{x_1|x,x_1\}\] We have also looked at syntagmatic productivity in the window +/- 3, and the corresponding left and right only windows. All these windows result in highly correlated productivities (\(\gt 0.9\) spearman and pearson) with the chosen window.
  4. Other measures for ambiguity exist, such as curvature aka clustering coefficient (Dorow et al. 2005). Moreover, there exist more sophisticated approaches to analyse and detect ambiguity via word embeddings, for an overview see e.g. Van Landeghem (2016).
  5. We get rather convincing correlations \(\gt 0.7\) on all measures between init and noinit options for words with \(fpm(x) > 10\) in the complete corpus.
  6. Incidentally, foreign words occurring in stretches of text in foreign language also have an artificially high paradigmatic productivity: They are simply considered as one close knit neighbourhood of aliens.

Contact

Peter Fankhauser fankhauser at ids-mannheim.de

References