Part II: searching and analyzing your corpus
In the previous post, we learned how to create, update and edit our corpus. In this second part we are going to fully exploit our work by using the analytical tools offered by HyperbaseWeb. Just to briefly recap what we were talking about, HyperbaseWeb is a software developed by Laurent Vanni at the Laboratory UMR 7320 : Bases, Corpus, Langage. It is a free online software that allows the user to easily manipulate textual corpora. It provides the possibility of executing complex searches, extracting cooccurrence information, and running statistical analyses returned with great visualizations. For all of these functions, the user can download the data. We will here focus on four aspects: how to get general information on your corpus, how to search it, how to study co-occurrences and how to analyze the distribution of linguistic features in the corpus.
1. Reading the corpus
If you click on the upper-left icon of the menu and select “Read”, you can obtain general information about the corpus. You can look for information about the corpus as a whole, or about one of its partitions (and here you can start to see the importance of manipulating the partitions). To navigate through these possibilities you can click on the upper right corner of the page where you see “The entire corpus” and select the partition you are interested in. You can change at any moment, at this works also for the other functions of the website (like searching, cooccurrence etc.).
If you have “The entire corpus” activated, the Read page will give you information about the most and less (clicking on frequence) frequent forms, lemmata and grammatical categories used. For instance in our case, “the” would be the most frequent form and lemma, and nouns the most frequent category. If you look at the rest of the list of frequent lemmata, you see that “people” is the first word with full meaning appearing: that’s rather interesting!
If you select one of the partitions, you will get information about the most specific words, lemmata and grammatical categories of the partition when compared to the full corpus. Specific elements are those who appear in the partition more frequently than you would expect when looking at their average frequency in the corpus. You can have the same information for underrepresented features, so those who appear less frequently than expected (if you click on écart+, and it gets transformed to écart-) . “écart” gives the score of their specificity, and then you have the frequency in the whole corpus and in the partition. If you select the Queen, “Christmas” and “Commowealth” are her most specific world: for the first term, this is clearly linked to her Christmas broadcasts! What about the under-represented words? for Her Majesty the Queen, they are a punctuation sign, “you” and “health”: quite triggering!
At the top of the page you can also click on “Distance” and “Cooccurrences”. “Distance” allows to represent the “Corpus” as a tree with branches and leaves to highlight the distance between the partitions: here, if we are focused on the “authors” level, the tree has obviously only few branches, and it is rather funny to notice that royal “couples” are associated, as you can see in the image below.
Finally, the “Cooccurrence” tool displays the proximity/distance of the words depending on how frequently they appear in the same span of text of a certain size. You can define the size, as well as the number of words you want to see displayed, through the “settings icons”. The colors represent “clusters” of words always established thanks to their cooccurrence. You can decide how many clusters you want to have. In this example we see appearing some clear thematic clusters.
2. Searching the corpus
Now, you want to find a specific word, lemma, grammatical code or sequence in you corpus. Well, good news! You can do whatever you like. If you search for a form, you just need to write it down. If you want a lemma, you need to search it preceeded by “LEM:”. For instance LEM:see will return all the forms of “to see” (see, seen, saw etc.). If you want a precise sequence, you need to put it between inverted commas.
When you click on the top-right menu, and on the “help” section, you have a list of grammatical codes that can be searched. In this way, you can create combinations as “JJ LEM:day”, meaning that you search for an adjective followed by day or days (results: “old days, dark days, last day”). If you want to insert an unknown word, then you can use the “???” sequence: “the ??? NNS” returns “the Brazilian people”, “an entire generation”, where NNS stands for plural and collective nouns. This is how the results page looks like.
If you click on the left icon with the book image, you access directly the text from which the result is taken.
You can play with the settings here as well: you can define what context should appear, if the words should be shown as forms or as lemmata or as grammatical categories, if you want to filter only some specific categories (maybe you want only to see the verbs around your search term), and how many search results you want to display. Finally you can decide whether you want the metadata to appear. When you use the upper right icon to download, all of these parameters will be maintained in the .csv resulting file.
Via this option you can discover which words tend to appear frequently around a search-term. What does it mean “to occur frequently around a search term”? Well, let’s suppose I am interested in knowing what words tend to appear in the same sentence of “people”. Then, the program splits the corpus in two parts: the sentence containing people on the one hand, and those not containing people. For each word appearing in the first group of sentences (those with people), it will calculate whether it is over-represented in this “subset” of the corpus (the “people” subcorpus), and those with higher specificity score will be ranked as cooccurrent with “people”. In fact this means that they tend to appear more often in sentence with “people” than what you would expect seen their general frequency in the corpus.
HyperbaseWeb also shows the terms that tend to cooccur with the pair represent by the search term and the first cooccurrent. This image illustrates the principle:
So apparently, when the English Royal family talks about people, Scotland is strongly involved!
Via the settings option you can decide if you are interested in the lemmata, forms or grammatical classes that tend to cooccur with your search term; if you want to impose some filters (for instance consider only the adjectives, nouns etc.); how you want to define the context? The sentence, the paragraph, a certain amount of words? And do you want it to be only on the left, on the right, or symmetrical?
Here you can both take a screenshot and download the numerical information in .csv with the download icon (top right).
This section allows you to unravel the secrets of how words are distributed in your corpus. The key-concept is one we have already introduced, the statistical specificity of a linguistic feature to a corpus partition. I will thus spend some words on it. Let’s suppose every member of the Royal family represents 1/6 of the corpus (which is not the case, but it will simplify our task), and that half of the occurrences of the world “country” appear in the Queen’s speeches, this hints to the fact that the term is particularly associated with the Queen. A way to measure that, is to calculate the probability that the word appears exactly that number of times in each partition: if the probability is high, then the word is randomly distributed, if the probability is low, then there is a statistic specificity.
A low probability can mean either that the word appears more frequently than expected (so it is characteristic of that partition) or that the word appears less frequently than expected (so it is “avoided” in this partition). For those who are interested, the values given in HyperbaseWeb result from the probability calculation based on an hypergeometric distribution, and then scaled via logarithms to match the scale of z-score values. That means that if a word gets a value higher than 2 (or 5 on more restricted corpora as ours), then it will be specific to this partition. If it gets a value lower than -2 (resp. -5) it will be “deficient” in that partition.
So via the distribution tool (Histogram option) you can retrieve those values. Here the importance of partitions is particularly evident. Let’s consider the distribution of the lemma “Europe” by authors, the result is the following
This might not seem particularly revealing, we just notice that the word is of some interest for the Queen and her husband. Now, let’s make the same evaluation chronologically, by combining information on authors and years, as we saw in the previous post. Here (despite this not been particularly visible on the screenshot), we see that the occurrences are strongly associated to 2004 speeches of the Queen and Prince Philip. And, we might wonder: what happened exactly that year? (the answer is to be found when searching in the texts, but I will tell you: 60th anniversay of Dunkerque, and a visit to France…not exactly a look towards the future 😉 )
The parameters allow also to display simply the count of every feature (frequency) or the ratio in respect of the total words of the partition (relative frequency). If you are interested in a particular grammatical category, and you want to know what are the most frequent terms of this category, you can use the 10/20/30 etc. most frequent option that will calculate the specifity/frequency/relative frequency of these terms. Again, you can screenshot or download the results in csv.
In the upper right menu, two more options are displayed: the AFC and the Branching analysis. We might want to look at the distribution of many features in the corpus and thus get an overview of how the partitions are associated/distinguished on the ground of these features. This overview can be obtained with the AFC tool (Factorial Analysis of Correspondences) that distinguishes along the axes the texts according to their features. I will first propose an example and then comment on it.
The percentages given on the upper right corner correspond to the amount of information shared by each axis: the first one is the horizontal one, so the associations and oppositions along this dimension are the strongest ones: here we clearly see a generational divide, with the younger talking about social themes, clearly distinguished from the more “national” approach of the older. The second level of analysis is the vertical axis, that somehow associates two women (the Queen and Kate Middleton) vs two men (Prince William and Prince Philip). It is worth mentioning that the importance of the association/opposition depends on how fare the words/texts are from the center of the graphic, relatively to the dimension analyzed.
As we already saw in the previous post, the tree analysis allows to display the same information about proximity and distance of the partitions or the words without focusing on the other component. If we chose the columns (texts), we then get a very similar result as we saw in the previous post.
I think we got an overview of the main functionalities of HyperbaseWeb. If you want to play around, you might consider starting with one of the many corpora publicly available in the library. For English you have Queen Elizabeth II Christmas speeches (a subset of ours) and various presidential speeches of the US. For French, many many more!