Part I: creating, uploading and editing a corpus
HyperbaseWeb is a software developed by Laurent Vanni at the Laboratory UMR 7320 : Bases, Corpus, Langage. It is a free online software that allows the user to easily manipulate textual corpora. It provides the possibility of executing complex searches, extracting cooccurrence information, and running statistical analyses returned with great visualizations. For all of these functions, the use can download the data.
The user can either charge its own corpus, or use those provided by the project: most of them are in French, but other languages (English, Latin, Italian…) are available as well. In this post, I will show step by step how to upload a corpus, and the how to extract as much information as possible from that.
Step 1: Create your own corpus
In this post I am not going to focus on how to effectively download material from the Web (scraping), because there are already plenty of good explanations on how to do it. On this GitHub repository, you can find both the Jupyter notebook I used to download the speeches of members of the English Royal Family from 1990 to 2021 (publicly available on the official Website, under The Royal Household © Crown Copyright), and the files with the speeches. If you want to do the exercise you just need to download the whole directory Royal_family_speeches, unzip it, and then follow the indications of this post.
In general, if you have a folder with .txt files (or .csv, but this option still needs to be refined), when choosing how to name them, consider that HyperbaseWeb can detect different levels of metadata, if in the title of your file they are separated by an underscore (‘_’). I will make an example to show concretely of that works:
when uploading the file named ‘HerMajestyTheQueen_1990_Christmas Broadcast.txt’, HyperbaseWeb will distinguish three levels:
metadata0: Royal Member;
Once the corpus will be created, you will be able to manipulate this three levels, as we will see in the next paragraph. Therefore, you can think of how you would like to split your corpus for your analysis and create the metadata consistently. For a literary corpus you might want to consider:
DISCLAIMER: This Royal speeches corpus is not a particularly good one: it is rather small, and titles are cut short after 20 characters in order to avoid problems linked to the presence of special characters. However, since this is only for demonstration purposes, it shouldn’t be a major issue.
Step 2: Upload your corpus
Once your have gathered and structured your files, you can click on the “New Project” icon that you find on the Homepage of the HyperbaseWeb website. There you are asked information about the corpus (here: base) that you want to create. You can give a unique identifier of your corpus, and the name with which you want to share it. Then you can indicate the source of the corpus, the language and the visibility. If you set it to private, you will be given a password to access it. For sharing your private corpus, you just need to share the link that will be sent to you upon creation and this password.
The available options for language are: French, English, German, Italian, Portuguese, Spanish, Ancient French, Other, already lemmatized text. For most of the languages, if the corpus is not lemmatized, HyperbaseWeb will provide a lemmatization with TreeTagger, that includes Part-of-Speech tagging.
If you want to work with Latin, for now you need to rely on the L.A.S.L.A. corpora offered by HyperbaseWeb. However, soon files should be accessible. For more information you can contact me, and please consider this small guide to check the conventions used.
In our case the page would look like this:
Once you click on “Load files”, and the “Add files”, you just need to select the files in the folders. Beware that once you have selected them, you also have to click on “Begin loading”, otherwise they will not be uploaded. When you have completed the task, you can click on the bottom-icon “Create”, and proceed to the validation of the corpus
As you see, the software will detect the three levels of metadata. The page should look something like that:
Once you click on continue you will be asked to give your email and your password, and the adventure begins!
Step 3: organize your corpus
When you enter your project section, the first thing you want to do is make sure that the corpus is structured as you wish. HyperbaseWeb allozs to easily manipulate it: if you click on the menu icon on the left top corner, you can access the ‘Edit’ section. As you can see you can immediately delete some of parts of the corpus if you want to restrict the results. By clickling on the name of the partition, you can rename it with more readable choices. But Hyperbase allows to do much more.
As we saw in the previous section we had three levels of metadata. If, instead of grouping the speeches by author, we want to group them by year they were pronounced, to have an overall idea of the chronological evolution, we can delete the partitions proposed automatically by the interface (authors), by clicking on the red cross next to “Move”, then we can click on the icon on the right “Add a partition” and select meta1 (year level), select “All values” and confirm. This is how the screen should appear:
In the section about the study of the distribution of linguistic features, we will see how this is relevant. Of course the user can decide that he wants only the most recent speeches, and select only the last 10 years: every addition or subtraction is straightforward.
HyperbaseWeb provides also additional tools. Let’s suppose we still want to distinguish the speeches of the various authors, but we would like to group them by year (all the 2005 speeches by the Queen would be grouped together, vs the 2005 speeches by Prince Harry and the 2005 speeches by Kate Middleton). This can be useful if we want to compare speeches, for instance, about a specific event. Then we can use the functions that allow to manipulate metadata: namely “merge metadata”. Here we want to merge information about level 0 (author) and level 1 (year): the resulting metadata level would be of the form “author_year”, and would lead to the combination we were looking for (HerMajestyTheQueen_2005 etc.).
If we want to group together some specific partitions, we can use the function “combine partitions”: in our case, for instance, if we start from the “authors” partition, we might want to group together speeches of the couples to contrast “royal pairs” among them. The only Warning you need to consider is that partitions have to be one after the other to be combined, and this you can achieve by moving them with the arrows in the ‘Move’ column. So for instance, if we want to analyze together the speeches of Kate Middleton and Prince William, we can easily do so, as the following screenshot demonstrate:
Finally, let’s suppose some speeches are shared by more than one family member: for instance some speeches were addressed jointly by the Queen and her husband. We might want to focus on those common speeches. Then all we need to do is use the function “create an intersection of partitions” and we would have a new partition containing only the common speeches.
If you have designed a specific partitioning of the corpus and you want to save it, you can use the “save a partinionig”. It will be downloaded in .json format, so next time you just need to delete all default partitions and use the “load a partitioning” option to get back to the same partitioning.