An introduction to network analysis. Part II. What can I do with a network?

In the first part of this post, we learned what is a network, what features can it have, and how it can be represented. In this second part, we are going to introduce the concepts and operations used to “extract information” from a network. The slides that (more or less) correspond to this explanation can be downloaded here:

slides_network_analysis Download

The network can be analyzed from two perspectives: the global and local analysis. Global analysis provides information about the general structure of the network, while the local perspective gives information about the role of the single nodes within the network. We will start by the latter.

Local perspective

The most natural question you can ask yourself about a node in the network is: how many contact ties it have, i.e. how many edges are incident to that node? The answer is called the “degree“. If our network is directed (and you can read Part I if you are not sure of what does this mean), we can distinguish ‘in-degree’ (number of edges that directed to our node) and ‘out-degree’ (number of edges that start from our node). If our network is weighted we can sum the weights of the edges instead of simply counting them (weighted degree). Now I would just like to come back to the adjacency matrix (again, you should check Part I if you are not sure of what is an adjacency matrix), because it is a easy way to have a look at it. Here we see the adjacency matrix of the network represented on the right:

Adjacency matrix of undirected and unweighted graph

Since every element of the matrix indicates whether there is or not an edge between the row-node and column-node, to count all the edges coming out of a node, you just need to sum all the number on the its row: in this way you are adding up a one every time it is connected to another node. If the network was weighted, we would have had also other numbers and not just 0 and 1, but the operation would remain the same: summing along the row. Since the network in undirected, it is logical that all the edges that come out of a node are the same as those ‘entering’ the node: if we want to check that, we just have to add-up the values of the column of the node. Indeed this will count all the cases in which an edge starts from another node and arrives to our (to be honest, it could also start from the node itself and be a loop). Here, if we look at the node A, we see clearly that both the correspondent row and column have a total value of 5.

In the case of directed graphs, this symmetry doesn’t hold, as we see in the next example.

Adjacency matrix of directed and unweighted graph

Here you see that along the row of the node A we have a value of 5 (outgoing edged), while along the column a value of 1 (incoming edges)

This is already a very important indication of the role of a node in the network. However, if we have a look at the undirected network above, we see that the nodes A, I, and L (for instance) have the same degree of 5, but clearly occupy different kind of positions inside the network. How can we capture such differences? We can use different measures of centrality that take into account not only the number of connection of each node, but also their “position” in the network.

Centrality measures

Let’s imagine a situation that could be described by our network. Suppose you have a very authoritarian country and a very democratic one. They don’t speak much with each other, but each of them sends a representative to the other country. The node A is the representative of the democratic country in the authoritarian one, and can only speak with the 5 highest charges of the State, who cannot talk with anyone else: it is represented in the left part of the graph. Also the representative of the authoritarian country in the democratic one (node I) can only speak with the 5 highest charges of the State, but they are well embedded in their own government and communicate with a net of colleagues, as you see in the right part of the graph. The two representatives exchange messages through a set of intermediaries (nodes E, F, G).

Betweenness centrality

Some nodes play essential role in linking the other nodes, because they are on the shortest way between two otherwise disconnected nodes. Betweenness centrality of a node measures how many times a node is found on the shortest path between two other nodes. Following Wikipedia, here is a clear way of how to compute it:

The betweenness of a vertex v in a graph G:= (V,E), is computed as follows:

1.for each pair of vertices (s, t) compute the shortes paths between them

2.For each pair of vertices (s, t) determine the fraction of shortest paths that pass through the vertex in question (here, v)

3. Sum this fraction over all pairs of vertices (s, t)

Without surprise, if we calculate the betweenness centrality on our network shows very high values for the representatives and the intermediators, as you see in the image below, where the size is proportional to the score.

Betweenness centrality scores for the network

In the formula, C_B (v) indicates the betweenness centrality of v, σ_st indicates the number of shortest paths going from s to t, while σ_st (v) indicates the number of shortest paths going from s to t and passing from the node v.

Closeness centrality

Another way of looking at nodes is asking how close they are to the rest of the graph, in other words how long would it take from them to “reach’ the other nodes (however are the links defined). This is evaluated by “closeness centrality”, that is computed in the following way:

The closeness centrality of a vertex v in a graph G:= (V,E), is computed as follows:

1.For each node t present in the graph, you compute the shortest path from v to t, and thus calculate the distance.

2.You then add-up the inverse of the distance d of each of these nodes (1/d)

3.You can normalize with respect to the size of the graph

Closeness centrality shows immediately the difference between the two representatives: if the node I wants to “spread a word”, he will reach very quickly most of the nodes in the network. On the contrary, for the node A, there is a long way to go before reaching most of the other nodes. From this point of view the node I is very central, while the node A is peripherical. Again, the image gives a very clear idea of the situation.

Closeness centrality scores for the network.

In the formula, C(x) indicates the closeness centrality of x, and d(y, x) the shortest path, i.e. the distance, from y to x.

Eigenvector centrality

Finally, you might wonder not only how many connections has a node in the network, but how “influential” those are: in our case, for instance, it is clear that the node A has 5 connections, but that those are very isolated, so they will not make it easier for the node A to be in contact with other actors in the network. On the contrary, the node L has, as neighbors, many valuable allies, very well embedded in the network. Again, as Wikipedia says: relative scores are assigned to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. A high eigenvector score means that a node is connected to many nodes who themselves have high scores.

Computations here require an understanding of eigenvalues and eigenvectors…maybe that will be for another time! For now, let’s stick to visualizing this on our network:

Eigenvector centrality scores for the network

A similar mechanism is applied in Pagerank by Google, on directed graphs, to measure the influence of pages.

Community detection

Let’s move now to another question. It is clear that in our network there are nodes highly connected between themselves and only loosely connected to the rest of the network. It might useful in a number of situations to acknowledge the existence of these communities in a network: whether it is to find collaborative groups of people, or to identify thematic unities by looking at words frequently co-occurring, we can really improve our insight of the relations by determining how a network is best clustered. There are many techniques that are used, and let’s say that in general they attempt to partition the network so that the density of links inside a community is maximized and that in between communities in minimized. Here I applied the Louvain algorithm (via Gephi), and this is the result:

Clustering of the graph based on the Louvain algorithm

The results are speak for themselves in this case: the people communicating in the two countries form separate groups, and the intermerdiaries a third one.

The global level

Networks can have very different shapes: they can be very compact, very loose, the nodes can all have similar number of edges or the edges can be distributed in a very unbalanced way. All of these aspects can be measured in different ways.

Cliques and components

It is often interesting to acknowledge whether there are sets of nodes that are fully connected with each other, meaning that each node is linked to all the others, and to identify the largest possible set of nodes entirely connected in a network. It is a very difficult problem to solve and to compute, but it can be very interesting to identify such sets because they might reflect the existence of teams/alliances in situation of conflict etc. Another aspect that strongly characterize the network is the identification of connected components. In some reconstructed networks some nodes might not have any connection with the others, and they can be treated as separate networks. The images below illustrate what I just explained:

Fully connected graph, connected subgraph, connected components

Density

A fully-connected network is of course a very dense network: all the possible links exist. It is however in general interesting to measure the density of the graph, by calculating the ratio between the existing edges and the maximum number of edges (those of the ‘fully connected version’). The formula is the following

where m indicates the number of edges, and n the number of nodes.

Average degree and degree distribution

Very similar to the density is the average degree, where we calculate the number of edges per node. You divide the number of edges by the number of nodes, with the difference that for undirected graphs you multiply the number of edges by two.

Two graphs can have a similar average degree but a very different shape. The following two graphs illustrate this sentence:

Graphs with similar average degree but different distribution of the edges.

A way to capture this difference is looking at the degree distribution, i.e. for each possible degree looking at what percentage of nodes has exactly that degree. The following image shows the different degree distribution of the two nodes.

Degree distribution of the two graphs. On the vertical axis is represented the percentage of nodes having the degree specified on the horizontal axis.

Diameter

The diameter of a network wants to capture how “extended” it is, by calculating the shortest longest path in the network. You calculate all the shortest paths between two nodes and then select the highest value, and that will be the diameter. Again, the image will give a clear intuition of the interest of this measure:

Well, this was a very quick overlook on the most common terms and tools used in network analysis. There are many others, but once you get familiar with the area, it becomes rather quick and easy to understand what measurement should you use in what context.

Margherita Fantoli

An introduction to network analysis. Part II. What can I do with a network?