We previously introduced the concept of a graph to describe networks. To further analyze networks we will discuss network centrality and clustering.
Motivation
A common question in the analysis of networks is to understand the relative importance of the nodes in the network.
For example:
in a social network, who are the most influential individuals?
on the Web, which pages are more informative?
in road network, which intersections are most heavily used?
The key idea is that the structure of the network should give us some information about the relative importance of the nodes in the network.
Centrality and Clustering
To analyze networks we want to be able to classify:
important nodes, and
important groups of nodes.
Determining which nodes are important nodes leads to the notion of centrality.
Determining which groups of nodes are important leads to the notion of clustering.
In order to classify important nodes and groups of nodes, we will use graph theory and linear algebra.
Centrality
Do some nodes in the network have a special role? Are some nodes more important than others?
These are questions of centrality (or prestige).
Definition
Centrality in graph theory and network analysis refers to measures that identify the most important vertices (nodes) within a graph. These measures assign numbers or rankings to nodes based on their position and influence within the network. Centrality can help determine which nodes are most influential, critical for connectivity, or central to the flow of information.
We will study three basic notions of centrality:
Closeness Centrality: A central node is close to all others.
Betweenness Centrality: A central node is on many paths through the network.
Eigenvector Centrality: A central node is connected to other central nodes (sometimes called status centrality).
To introduce concepts here, we’ll look at a very famous dataset in the history of network analysis: Zachary’s karate club.
Karate Club Graph
The back story: from 1970 to 1972 the anthropologist Wayne Zachary studied the social relationships inside a university karate club.
While he was studying the club, a factional division led to a splitting of the club in two.
The club became split between those who rallied around the club president and those who rallied around the karate instructor.
You can read the story of the Karate club here. This dataset has become so famous that it has spawned its own academic traditions.
Here’s a view of the social network of the karate club.
Large Closeness Centrality: Indicates that a node is, on average, close to all other nodes in the network. This means it can quickly interact with or reach other nodes. Nodes with high closeness centrality are often considered central or influential because they can efficiently spread information or resources throughout the network.
Small Closeness Centrality: Indicates that a node is, on average, farther away from all other nodes in the network. These nodes are less central and may take longer to interact with or reach other nodes. Nodes with low closeness centrality are often on the periphery of the network and are less influential in terms of spreading information or resources.
Code
# Assuming Gk is already defined as a graph# Calculate closeness centralitycent =list(nx.closeness_centrality(Gk).values())# Set random seed for reproducibilitynp.random.seed(9)# Create figure and subplotsfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))# Draw the first subplot with circular layoutpos1 = nx.circular_layout(Gk)nodes1 = nx.draw_networkx_nodes(Gk, pos=pos1, ax=ax1, node_color=cent, cmap=plt.cm.plasma)nx.draw_networkx_edges(Gk, pos=pos1, ax=ax1)ax1.set_title('Closeness Centrality')ax1.axis('off')# Draw the second subplot with spring layoutpos2 = nx.spring_layout(Gk)nodes2 = nx.draw_networkx_nodes(Gk, pos=pos2, ax=ax2, node_color=cent, cmap=plt.cm.plasma)nx.draw_networkx_edges(Gk, pos=pos2, ax=ax2)ax2.set_title('Closeness Centrality')ax2.axis('off')# Add colorbar to the figurecbar = fig.colorbar(nodes1, ax=[ax1, ax2], orientation='horizontal', fraction=0.05, pad=0.05)cbar.set_label('Closeness Centrality')# Show the plotplt.show()
In this graph, most nodes are close to most other nodes.
However we can see that some nodes are slightly more central than others.
Code
plt.figure(figsize = (6, 4))plt.hist(cent, bins=np.linspace(0, 1, 30))plt.xlabel('Closeness Centrality', size =14)plt.ylabel('Number of Nodes', size =14)plt.title('Distribution of Closeness Centrality', size =16)plt.show()
Betweenness Centrality
Alternatively we might be interested in the question, is the node on many paths?
If we picture the network as a conduit for information, then betweenness captures how important a node is to the communication process, or how much information passes through the node.
First, let’s consider the case in which there is only one shortest path between any pair of nodes.
Then, the betweenness centrality of node \(i\) is the number of shortest paths that pass through \(i\).
Mathematically we have
\[
\text{betweenness}(i) = \sum_{i \neq j \neq k \in V}
\begin{cases}
1 &\text{if path from }j\text{ to }k\text{ goes through }i, \\
0 &\text{otherwise}.
\end{cases}
\]
For a graph with \(n\) nodes, we can convert this to a value between 0 and 1 by dividing by \({n \choose 2} = n(n-1)/2\).
In a general graph, there may be multiple shortest paths between \(j\) and \(k\).
To handle this, we define:
\(\sigma(i \mid j,k)\) is the number of shortest paths between \(j\) and \(k\) that pass through \(i\), and
\(\sigma(j,k)\) is the total number of shortest paths between \(j\) and \(k\).
Then we define the dependency of \(i\) on the paths between \(j\) and \(k\):
# Create the graphGk = nx.karate_club_graph()# Calculate betweenness centralitycent =list(nx.betweenness_centrality(Gk).values())# Set random seed for reproducibilitynp.random.seed(9)# Create figure and subplotsfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))# Draw the first subplot with circular layoutpos1 = nx.circular_layout(Gk)nodes1 = nx.draw_networkx_nodes(Gk, pos=pos1, ax=ax1, node_color=cent, cmap=plt.cm.plasma)nx.draw_networkx_edges(Gk, pos=pos1, ax=ax1)ax1.set_title('Betweenness Centrality')ax1.axis('off')# Draw the second subplot with spring layoutpos2 = nx.spring_layout(Gk)nodes2 = nx.draw_networkx_nodes(Gk, pos=pos2, ax=ax2, node_color=cent, cmap=plt.cm.plasma)nx.draw_networkx_edges(Gk, pos=pos2, ax=ax2)ax2.set_title('Betweenness Centrality')ax2.axis('off')# Add colorbar to the figurecbar = fig.colorbar(nodes1, ax=[ax1, ax2], orientation='horizontal', fraction=0.05, pad=0.05)cbar.set_label('Betweenness Centrality')# Show the plotplt.show()
We start to see with this metric the importance of two or three key members of the karate club.
Note that many nodes will have a betweenness centrality of zero – no shortest paths go through them.
Code
plt.figure(figsize = (6, 4))plt.hist(cent, bins=np.linspace(0, 1, 30))plt.xlabel('Betweenness Centrality', size =14)plt.ylabel('Number of Nodes', size =14)plt.title('Distribution of Betweenness Centrality', size =16)plt.show()
Adjacency Matrices
To define the next centrality, we need to represent graph as a matrix.
Given an \(n\)-node undirected graph \(G = (V, E)\), the adjacency matrix \(A\) is defined as
An important way to think about adjacency matrices is that column \(j\) holds \(j\)’s neighbors.
The adjacency matrix is nonnegative and symmetric.
Eigenvector Centrality
Eigenvector centrality is a measure used in network analysis to determine the influence of a node within a network. The centrality score of a node is proportional to the sum of the centrality scores of its neighbors. The main idea of eigenvector (or status) centrality is that high status nodes are connected to high status nodes.
Given a graph with \(n\) nodes, let \(A\) be the adjacency matrix of the graph, where \(A_{ij}\) is 1 if there is an edge from node \(i\) to node \(j\), and 0 otherwise.
The eigenvector centrality \(x_i\) of node \(i\) is defined as
\(x_i\) is the eigenvector centrality of node \(i\).
\(\lambda\) is a constant (an eigenvalue).
\(A_{ij}\) is the element of the adjacency matrix corresponding to the edge from node \(i\) to node \(j\).
The above equation says that the importance of node \(i\) is proportional to the sum of the other nodes \(j\) when \(i\) is connected to \(j\).
In matrix form, this can be written as
\[
\mathbf{Ax} = \lambda \mathbf{x},
\]
where \(\mathbf{x}\) is the eigenvector corresponding to the the eigenvalue \(\lambda\) of the adjacency matrix \(\mathbf{A}\). The eigenvector \(\mathbf{x}\) gives the centrality scores for all nodes in the network.
In general, there may be multiple eigenvalues \(\lambda\) for which the above equation holds. In practice, the largest eigenvalue is used because it captures the most significant mode of influence propagation in the network.
Geometric Intuition
Eigenvector as a direction:
The eigenvector is a direction in the network space. Applying the adjacency matrix (representing connections between nodes) to this vector, stretches the vector along that direction. The amount of stretch is determined by the eigenvalue.
High centrality, high stretch:
A node with a large eigenvector centrality corresponds to an eigenvector that experiences a large stretch when transformed by the adjacency matrix. This indicates that it is connected to many other nodes that are also considered high-status.
No direction change:
The eigenvector doesn’t change its direction when transformed, only its magnitude. This means that a node with a high eigenvector centrality remains relatively central even when considering its connections to other central nodes.
Key Points on Eigenvector Centrality
Eigenvector: The vector \(\mathbf{x}\) represents the centrality scores.
Largest Eigenvalue: The centrality scores are derived from the eigenvector corresponding to the largest eigenvalue of the adjacency matrix.
Perron-Frobenius Theorem: For a connected graph, the largest eigenvalue is positive. In addition, there is only 1 corresponding eigenvector with all positive entries.
The Perron-Frobenius Theorem is an important theoretical tool when working with adjacency matrices, which are by definition nonnegative and positive.
It appears that we have identified the teacher and club president as important nodes based on our centrality measures.
Clustering and Partitioning
We now turn to the question of finding important groups of nodes.
Why might we want to cluster graph nodes?
Assigning computations to processors in a parallel computer
Segmenting images (finding boundaries between objects)
Clustering words found together in documents, or documents with similar words
Divide and conquer algorithms
Circuit layout in VLSI
Community detection in social networks
Min \(s\)-\(t\) cut
We’ll start with a problem that is fundamental to many other problems in graph analysis. It involves finding the smallest set of edges that, if removed, would disconnect a specified source node (s) from a target node (t) in a graph.
The goal of a min \(s\)-\(t\) cut is to partition the graph into two disjoint subsets such that (s) is in one subset and (t) is in the other, and the total weight (or number) of the edges that need to be removed to achieve this separation is minimized.
Below is a figure depicting a schematic diagram of the railway network of the Western Soviet Union and Eastern European countries.
This is from a historical analysis paper of Alexander Schrijver in Math Programming, 91: 3, 2002. The idea was to determine a minimum cut, which would best halt the flow of materials in the railway network.
As mentioned, when Wayne Zachary studied the club, a conflict arose between the instructor and the president (nodes 0 and 33, respectively).
Zachary predicted the way the club would split based on an \(s\)-\(t\) min cut.
In fact, he correctly predicted every single member’s eventual association except for node 8.
Code
Gcopy = G.copy()Gcopy.remove_edges_from(cut_edges)cc = nx.connected_components(Gcopy)node_set = {node: i for i, s inenumerate(cc) for node in s}colors = ['dodgerblue', 'tomato']node_colors = [colors[node_set[v]-1] for v in G.nodes()]fig = plt.figure(figsize=(12,6))nx.draw_networkx(G, node_color=node_colors, pos=pos, with_labels='True', node_size=1000, font_size=16)plt.axis('off')plt.show()
Minimum Cuts
In partitioning a graph, we may not have any particular \(s\) and \(t\) in mind.
Rather, we may want to simply find the minimal way to disconnect the graph.
Clearly, we can do this using an \(s\)-\(t\) min cut, by simply trying all \(s\) and \(t\) pairs.
Let’s try this approach of finding the minimum \(s-t\) cut over all possibilities in the karate club graph.
Code
Gcopy = G.copy()Gcopy.remove_edges_from(nx.minimum_edge_cut(G))cc = nx.connected_components(Gcopy)node_set = {node: i for i, s inenumerate(cc) for node in s}#colors = ['tomato', 'dodgerblue']node_colors = [colors[node_set[v]] for v in G]fig = plt.figure(figsize=(12,6))nx.draw_networkx(G, node_color=node_colors, pos=pos, with_labels='True', node_size=1000, font_size=16)plt.axis('off')plt.show()
This is in fact the minimum cut. Node 11 only has one edge to the rest of the graph, so the min cut is 1.
As this example shows, minimum cut is not, in general, a good approach for clustering or partitioning.
To get a more useful partition, we need to define a new goal. The new goal will be to find a balanced cut.
Balanced Cuts
The idea to avoid the problem above is to normalize the cut by the size of the smaller of the two components. The problem above would be avoided because the smaller of the two cuts is just a single node.
Spectral graph theory is the use of linear algebra to study the properties of graphs.
To introduce spectral graph theory, we define some terms.
Let \(G\) be an undirected graph with \(n\) nodes. We define the \(n\times n\) matrix \(D\) as a diagonal matrix of node degrees, i.e., \(D = \text{diag}(d_1, d_2, d_3, \dots)\) where \(d_i\) is the degree of node \(i\).
Assuming \(G\) has adjacency matrix \(A\), we define the Laplacian of \(G\) as
\[
L = D - A.
\]
Below we show the Laplacian matrix \(L\) for the karate club network as a heatmap.
Code
L = nx.laplacian_matrix(nx.karate_club_graph()).todense()plt.figure(figsize = (7, 7))sns.heatmap(L, cmap = plt.cm.tab20)plt.axis('equal')plt.axis('off')plt.show()
Now let us think about an \(n\)-component vector \(\mathbf{x} \in \mathbb{R}^n\) as an assignment of values to nodes in the graph \(G\).
For example, \(\mathbf{x}\) could encode node importance or strength or even a more concrete notion like temperature or altitude.
Here is an amazing fact about the Laplacian of \(G\).
Now, taking into account the values in \(L\), we see that in the sum we will have the term \(d_i x_i^2\) for each \(i\), and also 2 terms of \(-x_ix_j\) whenever \((i,j)\in E\).
If we think of \(x_i\) as a 1-D coordinate for node \(i\) in the graph, then choosing \(\mathbf{x} = \mathbf{w}_2\) (the eigenvector corresponding to \(\lambda_2\)) puts each node in a position that minimizes the sum of the squared stretching of each edge.
Recall from physics that the energy in a stretched spring is proportional to the square of its stretched length.
If we use the entries in \(\mathbf{w}_2\) to position the nodes of the graph along a single dimension, then using the Fiedler vector \(\mathbf{w}_2\) for node coordinates is exactly the spring layout of nodes that we discussed in the last lecture – except that it is in one dimension only.
This is the basis for the spectral layout that we showed in the last lecture.
In the spectral layout, we use \(\mathbf{w}_2\) for the first dimension, and \(\mathbf{w}_3\) for the second dimension.
\(\mathbf{w}_3\) is the eigenvector corresponding to
What is the difference between the spectral layout and the spring layout?
In one dimension, they are the same, but in multiple dimensions, the spectral layout optimizes each dimension separately.
Spectral Partitioning
This leads to key ideas in node partitioning.
The basic idea is to partition nodes according to the Fiedler vector \(\mathbf{w}_2\).
This can be shown to have provably good performance for the balanced cut problem.
See Spectral and Algebraic Graph Theory by Daniel Spielman, Chapter 20, where it is proved that for every \(U \subset V\) with \(|U| \leq |V|/2\), \(\alpha(G) \geq \lambda_2 (1-s)\) where \(s = |U|/|V|\). In particular, \(\alpha(G) \geq \lambda_2/2.\)
There are a number of options for how to split based on the Fiedler vector.
If \(\mathbf{w}_2\) is the Fiedler vector, then split nodes according to a value \(s\):
bisection: \(s\) is the median value in \(\mathbf{w}_2\)
ratio cut: \(s\) is the value that maximizes \(\alpha\)
sign: separate positive and negative vaues (\(s = 0\))
gap: separate according to the largest gap in the values of \(\mathbf{w}_2\)
Here is a spectral parititioning for the karate club graph.
Code
G = nx.karate_club_graph()f = nx.fiedler_vector(G)s = np.zeros(len(f), dtype='int')s[f >0] =1#fig = plt.figure(figsize=(12,6))colors = ['tomato', 'dodgerblue']np.random.seed(9)pos = nx.spring_layout(G)node_colors = [colors[s[v]] for v in G]nx.draw_networkx(G, pos=pos, node_color=node_colors, with_labels='True', node_size=1000, font_size=16)plt.axis('off')plt.show()
Interestingly, this is almost the same as the \(s\)-\(t\) min cut based on the president and instructor.
Spectral Clustering
In many cases we would like to move beyond graph partitioning, to allow for clustering nodes into, say, \(k\) clusters.
The idea of spectral clustering takes the observations about the Fiedler vector and extends them to more than one dimension.
Let’s look again at the spectral layout of the football dataset.
Here we’ve labelled the nodes according to their conference, which we will think of as ground-truth labels.
Code
import matplotlib.patches as mpatchesimport recmap = plt.cm.tab20## data from http://www-personal.umich.edu/~mejn/netdata/football = nx.readwrite.gml.read_gml('data/football.gml')conf_name = {}withopen('data/football.txt', 'r') as fp:for line in fp: m = re.match(r'\s*(\d+)\s+=\s*([\w\s-]+)\s*\n', line)if m: conf_name[int(m.group(1))] = m.group(2)conf = [d['value'] for i, d in football.nodes.data()]##plt.figure(figsize = (12, 12))nx.draw_networkx(football, pos = nx.spectral_layout(football), node_color = conf, with_labels =False, cmap = cmap)plt.title('Conference Membership in Football Network')patches = [mpatches.Patch(color = cmap(i/11), label = conf_name[i]) for i inrange(12)]plt.legend(handles = patches)plt.axis('off')plt.show()
Now, the key idea is that using the spectral layout, we have placed nodes into a Euclidean space.
This means we could use a standard clustering algorithm in that space.
Code
plt.figure(figsize = (12, 12))nx.draw_networkx(football, pos = nx.spectral_layout(football), node_color = conf, with_labels =False, edgelist = [], cmap = cmap)plt.title('Conference Membership in Football Network')patches = [mpatches.Patch(color = cmap(i/11), label = conf_name[i]) for i inrange(12)]plt.tick_params(left=True, bottom=True, labelleft=True, labelbottom=True)plt.legend(handles = patches)plt.show()
This plot shows that many clusters are well-separated in this space, but some still overlap.
To address this, we can use additional eigenvectors of the Laplacian, i.e., \(\mathbf{w}_4, \mathbf{w}_5, \dots\).
The idea of spectral clustering is:
use enough of the smallest eigenvectors of \(L\) to sufficiently “spread out” the nodes,
cluster the nodes in the Euclidean space created by this embedding.
More specifically, given a graph \(G\):
Compute \(L\), the Laplacian of \(G\)
Compute the smallest \(d\) eigenvectors of \(L\), excluding the smallest eigenvector (the ones vector)
Let \(U \in \mathbb{R}^{n\times d}\) be the matrix containing the eigenvectors \(\mathbf{w}_2, \mathbf{w}_3, \dots, \mathbf{w}_{d+1}\) as columns
Let the position of each node \(i\) be the point in \(\mathbb{R}^d\) given by row \(i\) of \(U\)
Cluster the points into \(k\) clusters using \(k\)-means
Let’s explore the results of spectral clustering using \(d = 2\) dimensions.
Code
# Here is a complete example of spectral clustering## The number of dimensions of spectral layoutk =2## Obtain the graphfootball## Compute the eigenvectors of its LaplacianL = nx.laplacian_matrix(football).todense()w, v = np.linalg.eig(L)v = np.array(v)# # scale each eigenvector by its eigenvalueX = v @ np.diag(w)## consider the eigenvectors in increasing order of their eigenvaluesw_order = np.argsort(w)X = X[:, w_order]## run kmeans using k top eigenvectors as coordinatesfrom sklearn.cluster import KMeanskmeans = KMeans(init='k-means++', n_clusters=11 , n_init=10)kmeans.fit_predict(X[:, 1:(k+1)])centroids = kmeans.cluster_centers_labels = kmeans.labels_error = kmeans.inertia_## visualize the resultplt.figure(figsize = (14, 7))ax1 = plt.subplot(121)nx.draw_networkx(football, ax = ax1, pos = nx.spectral_layout(football), node_color = labels, with_labels =False, edgelist = [], cmap = cmap, node_size =100)plt.tick_params(left=True, bottom=True, labelleft=True, labelbottom=True)plt.title(f'Spectral Clustering, 11 Clusters, dimension = {k}')ax2 = plt.subplot(122)nx.draw_networkx(football, ax = ax2, pos = nx.spectral_layout(football), node_color = conf, with_labels =False, edgelist = [], cmap = cmap, node_size =100)plt.title('Conference Membership in Football Network')patches = [mpatches.Patch(color = cmap(i/11), label = conf_name[i]) for i inrange(12)]plt.tick_params(left=True, bottom=True, labelleft=True, labelbottom=True)plt.legend(handles = patches, loc='center left', bbox_to_anchor=(1, 0.5))plt.show()
This is pretty good, but we can see that in some cases the clustering is not able to separate clusters that overlap in the visualization.
Which makes sense, as for the case \(d = 2\), we are running \(k\)-means on the points just as we see them in the visualization.
Let’s try \(d = 3\). Now there will be another dimension available to the clustering, which we can’t see in the visualization.
Code
# Here is a complete example of spectral clustering## The number of dimensions of spectral layoutk =3### Compute the eigenvectors of its LaplacianL = nx.laplacian_matrix(football).todense()w, v = np.linalg.eig(L)v = np.array(v)# # scale each eigenvector by its eigenvalueX = v @ np.diag(w)## consider the eigenvectors in increasing order of their eigenvaluesw_order = np.argsort(w)X = X[:, w_order]## run kmeans using k top eigenvectors as coordinatesfrom sklearn.cluster import KMeanskmeans = KMeans(init='k-means++', n_clusters=11 , n_init=10)kmeans.fit_predict(X[:, 1:(k+1)])centroids = kmeans.cluster_centers_labels = kmeans.labels_error = kmeans.inertia_## visualize the resultplt.figure(figsize = (14, 7))ax1 = plt.subplot(121)nx.draw_networkx(football, ax = ax1, pos = nx.spectral_layout(football), node_color = labels, with_labels =False, edgelist = [], cmap = cmap, node_size =100)plt.tick_params(left=True, bottom=True, labelleft=True, labelbottom=True)plt.title(f'Spectral Clustering, 11 Clusters, dimension = {k}')ax2 = plt.subplot(122)nx.draw_networkx(football, ax = ax2, pos = nx.spectral_layout(football), node_color = conf, with_labels =False, edgelist = [], cmap = cmap, node_size =100)plt.title('Conference Membership in Football Network')patches = [mpatches.Patch(color = cmap(i/11), label = conf_name[i]) for i inrange(12)]plt.tick_params(left=True, bottom=True, labelleft=True, labelbottom=True)plt.legend(handles = patches, loc='center left', bbox_to_anchor=(1, 0.5))plt.show()
We can see visually that using 3 dimensions is giving us a better clustering than 2 dimensions.
What happens as we increase the dimension further?
To evaluate this question we can use Adjusted Rand Index.
ARI Spectral Clustering
Code
import sklearn.metrics as metrics## Compute the eigenvectors of its LaplacianL = nx.laplacian_matrix(football).todense()w, v = np.linalg.eig(L)v = np.array(v)# # scale each eigenvector by its eigenvalueX = v @ np.diag(w)## consider the eigenvectors in increasing order of their eigenvaluesw_order = np.argsort(w)X = X[:, w_order]#max_dimension =15ri = np.zeros(max_dimension -1)for k inrange(1, max_dimension):# run kmeans using k top eigenvectors as coordinates kmeans = KMeans(init='k-means++', n_clusters =11, n_init =10) kmeans.fit_predict(X[:, 1:(k+1)]) ri[k -1] = metrics.adjusted_rand_score(kmeans.labels_, conf)#plt.figure(figsize = (8, 6))plt.plot(range(1, max_dimension), ri, 'o-')plt.xlabel('Number of Dimensions (Eigenvectors)', size =14)plt.title('Spectral Clustering of Football Network Compared to Known Labels', size =16)plt.ylabel('Adjusted Rand Index', size =14)plt.show()
Based on this plot, it appears that the football graph can be best described as approximately six-dimensional.
When we embed it in six dimensions and cluster there we get an extremely high Adjusted Rand Index.
Code
print(f'Maximum ARI is {np.max(ri):0.3f}, using {1+ np.argmax(ri)} dimensions for spectral embedding.')
Maximum ARI is 0.868, using 6 dimensions for spectral embedding.
Recap
Networks are present in many interesting applications.
We have covered the following topics
Graph representations
Cluster coefficients
Centrality measures
Clustering
In addition we used the NewtorkX package to visualize our networks in the following formats