This visualization extends the original diachronic visualization with a variety of measures to analyse word use along the lines of the domain specific visualization. Currently the following measures are supported:
As a corpus we use a portion of the Royal Society Corpus (release 5.1), ranging from 1665 to 1929. The corpus comprises 91.2 Mio tokens over about 462.000 types and has been split into 27 decades, with the number of tokens ranging between 455.351 and 13.583.475. For a more detailed breakdown see Figure 1. The raw frequencies per decade are available in decFreq1929.zip, decFreq1929Pos.zip.
For computing word embeddings we use word2vec skipgrams (Mikolov et al. 2013) and structured skipgrams introduced by Ling et al. (2015). Whereas skipgrams represent the left/right usage context of a word as a bag of words, structured skipgrams represent each position in the context separately. For characterizing content words skipgrams and structured skipgrams seem to fare equally well, but structured skipgrams do a better job for characterizing function words.
For computing period specific word embeddings that are aligned with each other, we use two variants of the approach of Dubossarsky et al. (2015), Fankhauser and Kupietz (2017). Training for the first period is either initialized randomly (noinit) or with embeddings for the complete corpus (init). All subsequent periods are initialized with the embeddings of their previous period. For the noinit option, embeddings for the complete corpus are initialized with embeddings for the last period.
For words with enough support these two variants seem fairly equivalent. However, low frequency words can behave rather differently: With random initialization (noinit) low frequency words tend to be rather arbitrarily concentrated in the center of the semantic space for the first few periods. Corpus initialization (init) avoids this, but then the positioning of low frequency words may not really reflect their actual usage during the first few periods. Likewise, noinit may bias the representation of low frequency words for the complete corpus by the representation of the last period. Moreover, noinit also leads to partially erratic movement in the semantic space over time, evident by a larger average distance of word embeddings over time.
Altogether, option init together with carefully considering/filtering out low frequency words is recommended as default 5.
Table 1 lists all variants of embeddings currently available. Column Init distinguishes between corpus initialization and random initialization. Column Train Counts says whether training uses period specific word counts (yes) or the word counts for the complete corpus (no). Using period specific word counts is the more sensible approach, because it properly anneals the learning rate alpha and draws negative samples according to period specific counts rather than corpus wide ones. The original diachronic visualizations, however, have used corpus wide counts, and are included here for comparability. Column Architecture distinguishes between structured skipgrams (recommended) and skipgrams. For the two recommended options there are also embeddings available with part-of-speech attached to words (Column PoS). For all visualizations the underlying embeddings are also available for download.
All embeddings use a hidden layer with dimension 100, a window size of +/-5, negative sampling 10, and a minimum frequency of 5 in the complete corpus. Period specific word embeddings have been trained with 25 iterations, corpus embeddings with 5.
In addition, there is also a visualization of the Spiegel/Zeit-Corpus available. This is based on embeddings with a hidden layer with dimension 200.
The visualization consists of four main components:
To the left, a bubble chart represents the color encoded semantic space of words, with the size of bubbles proportional to the square root of the relative frequency in the chosen period, and the color indicating the slope of the diachronic growth of a word or the period with its maximum relative frequency. Words can be clicked for further analysis.
To the top right frequency change of individual words is represented by simple line charts showing the fitted 2nd order polynomials of the logit transformed relative frequencies. The line chart also doubles as a selector for individual periods.
The top right also provides a number of visualization options. The first group of options applies to the bubble chart:
The last two buttons are for choosing the number of more zoom levels to show for bubbles (just a convenience option for producing pretty pictures), and choosing whether the word table should reload on demand or always (default: on demand). Clicking on the y-axis toggles between absolute and relative scaling.
The bottom right provides a word list for the chosen period, with the following columns:
The word list can be sorted on all columns, searched for words, and filtered by giving lower bounds. Filtering and sorting is in particular useful to focus on words with a minimum frequency per million, on words typical for a period (kld > 0), or for looking for words with a high slope in productivity. Clicking on a word zooms and centers the bubble chart on its position in the chosen period.
Figure 2 breaks down productivity by (some) Part-of-Speech. As a general trend, paradigmatic productivity decreases consistently, whereas syntagmatic productivity increases. Nouns NN, verbs VV, and adjectives JJ behave very similar in diachronic productivity, however the overall trends are more pronounced for verbs. In particular, verbs start out with higher paradigmatic productivity than nouns and end up with a lower one. Adverbs RB have generally a lower paradigmatic and higher syntagmatic productivity than the average. Finally, determiners (like other function words) have unsurprisingly significantly higher syntagmatic productivity, and their diachronic development in paradigmatic productivity is much more erratic. Whereas Figure 2 (a,b) considers all words with minimum frequency 50 in the corpus (after discarding numbers etc.), Figure 2 (c,d) considers only words occuring in all decades, to account for the dominance of the corpus by later periods. As can be seen, the general trends in productity over time remain. All values are computed as type based means. Averaging over each individual token occurrence corroborates the overall trend of decreasing paradigmatic productivity, whereas syntagmatic productivity per token also decreases slightly.
Figure 3 compares productivity over time between nouns and individual verb classes. The overall trend of verbs decreasing more strongly than nouns in paradigmatic productivity is mainly due to participles (VVG present participle, VVN past participle) and past tense, VVD, which may sometimes be mixed up with VVN. Base form VV and present tense VVZ, VVP show a less pronounced difference to NN. Likewise, participles in general have higher syntagmatic productivity than nouns. In terms of overall increase of relative frequency, nouns stand out by more than doubling over time, but also participles increase by almost 75% (VVN) and 25% (VVG) over time.
With few exceptions, the introduced measures are largely orthogonal, i.e., they are basically uncorrelated. Table 2 (a) gives the Spearman rank correlations between the measures for the complete corpus. As to be expected by construction 1, there exists a weak positive correlation between frequency per million and typicality; in this case the Pearson correlation is much more pronounced (0.8). There also exists a weak positive correlation between frequency and syntagmatic productivity, i.e., higher frequency words (function words) tend to occur in more different contexts. Conversely, frequency and paradigmatic productivity are negatively correlated, and indeed low frequency words tend to have slightly more paradigmatic neighbours. These correlations are rather robust over different settings (with/without initialization, with/without PoS taggings).
Table 2 (b) gives the Spearman rank correlations for the slopes of frequency, paradigmatic productivity, and syntagmatic productivity, taking only into account words that occur in all periods, and thus have rather reliable estimates for the slopes. The positive correlation of fpm (of the complete corpus) and fpmΔ just reflects the fact that the late periods have orders of magnitude more tokens than early periods. Thus words with high fpms also typically have high fpms during late periods, and thus a positive slope fpmΔ. However, the mean slope is roughly zero (larger standard deviation than mean). In contrast paradigmatic entropy has a significantly negative mean slope, and syntagmatic entropy a significantly positive mean slope. This indicates a general trend of expressing more diverse syntagmatic contexts with fewer paradigmatic choices. Interestingly, the positive correlation between fpm and sprod also shows as a rather strong positive correlation of the slopes fpmΔ and sprodΔ, whereas the negative correlation between fpm and pprod does not translate to a negative correlation of the respective slopes. However, the slope pprodΔ is positively correlated with fpm, i.e., higher frequency words tend to increase in paradigmatic productivity over time. Finally, there is a weak negative correlation between pprodΔ and the (maximum) distance of a word in the complete corpus and one of the individual periods, indicating that words with increasing paradigmatic productivity tend to be rather stable in their meaning.
(tbd) From previous analysis we know that close paradigmatic neighbours typically also have highly correlated fpmΔ. An open question then is, whether this also holds for productivity.
The visualization supports two main paths to finding possibly interesting diachronic developments in word use: Spot patterns and sort&filter. In the following we will illustrate these two by way of some example analyses. Unless explicitly stated otherwise all analyses are carried out on the embeddings with initialization.
The bubbles overview allows to identify and zoom to paradigmatic clusters typical for a period, or more precisely, rising or falling fairly consistently. On the complete corpus (All) these clusters show as regions with a dominant color (blueish for falling, readish for rising). Figure 4 (a) shows two example clusters, a rising one with chemical compounds, and a falling one with communicative verbs in present tense.
Because the corpus is dominated by late periods, falling clusters are less visible than rising clusters. These clusters can be more easily identified by selecting a particular period, preferably with option zoom: kld rather than zoom: freq clst to show the most typical words for a period. For example, Figure 4 (b) shows two falling paradigmatic clusters in 1680-89, one grammatical (of wh-adverbs) and one thematic (solar system).
Sorting (usually in decending order) is a simple means to identify possibly characteristic words for a particular decade or in diachronic change. For example Table 3 (a) lists the 10 most typical words in comparison with the complete corpus (kld) over time in steps of 50 years. The first three decades are dominated by (personal) pronouns and conjunctions indicating a personal reporting style (letters to the editor), the second three periods by rather generic thematic words (and symbols). The last column gives the average typicality of words in the complete corpus compared to individual periods. These clearly indicate the nominal style of the frequency wise dominant later periods. Table 3 (b) shows that the transition from personal reporting style to nominal style is rather continous: When comparing a decade with its immediately preceding decade, the most typical words in the first three decades overlap very much with the most typical words for the complete corpus. Again the second three decades, where the nominal style has settled in, are characterized by generic thematic words.
Finally, Table 4 looks at (a) paradigmatic and (b) syntagmatic productivity: To have enough datapoints for estimating the slope in productivity, here we consider only words occurring in at least 20 out of 27 decades, with a minimum frequency per million 100, and we filter out non alphabetic words (by the regular expression ^[A-Za-z]+$ in the search field for column word). In absolute terms, named entities (and numbers)6 have the highest paradigmatic productivity (pprod↓). They tend to occur in rather regular contexts and they constitute a very open class In contrast, closed class function words (pprod↑ sorted in ascending order) have low paradigmatic productivity. In terms of change over time, (pprodΔ↓) adverbs used for construing discourse stand out with increasing paradigmatic productivity. The list of words with decreasing paradigmatic productivity seems a bit arbitrary though, but then just looking at the top 10 words does not always suffice.
The words with the highest syntagmatic productivity (Table 4 (b) sprod↓) are mostly function words, corroborating the analysis (DT) in Figure 2. The words with the lowest syntagmatic productivity sprod↑ appear to be words which occur mostly in rather fixed contexts (in existence, presence, with regard to, ...). Words with increasing syntagmatic productivity sprodΔ↓ comprise adverbs but also generic nouns with presumably many different modifiers (gas). Finally, words with decreasing syntagmatic productivity sprodΔ↑ indicate words increasingly becoming used in rather fixed context (e.g. due to, other hand, do not).
Peter Fankhauser fankhauser at ids-mannheim.de