Introduction to Networks

Code
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import matplotlib as mp
import sklearn
import networkx as nx
from IPython.display import Image, HTML

import laUtilities as ut

%matplotlib inline

Now we will encounter a new kind of data: networks, also called graphs.

(We’re going to use the two terms interchangeably).

Basic Definitions

A graph is a pair \((V, E)\) where \(V\) is a set of vertices and \(E\) is a set of vertex pairs called edges.

We’ll also use the term nodes for vertices,

and we’ll use the term links for edges.

We’ll distinguish between undirected graphs and directed graphs.

In an underdirected graph, an edge \((u, v)\) is an unordered pair. The edge \((v, u)\) is the same thing.

In a directed graph, \((u, v)\) is an ordered pair, and it is different from \((v, u)\).

Code
G = nx.Graph()
G.add_edges_from([(1,2), (1,3), (2, 3), (3, 4), (3, 5)])
DG = nx.DiGraph()
DG.add_edges_from([(1,2), (1,3), (2, 3), (3, 4), (3, 5)])
pos = {
    1: [0, 0],
    2: [-1, -1],
    3: [1, -1],
    4: [1, 0],
    5: [2, 0]
}
fig = plt.figure(figsize = (12, 4))
ax1 = fig.add_subplot(121)
nx.draw_networkx(G, 
                 ax = ax1,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightcoral', 
                 pos = nx.spring_layout(G, seed = 1),
                 with_labels=True, 
                 alpha=1, linewidths=2)
plt.axis('off')
plt.title('Undirected Graph', size = 16)
ax2 = fig.add_subplot(122)
nx.draw_networkx(DG, 
                 ax = ax2,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightblue',
                 pos = pos,
                 with_labels=True, 
                 arrowsize = 25,
                 alpha=1, linewidths=2)
plt.title('Directed Graph', size = 16)
plt.axis('off');

A path in a graph from \(u\) to \(v\) is a sequence of edges that starts at \(u\) and ends at \(v\).

In a directed graph, all of the edges in a path need to be oriented head-to-tail.

If there is a path from \(u\) to \(v\), we say that \(v\) is reachable from \(u\).

Degree

The degree of a node is the number of edges that connect to it.

In a directed graph, we distinguish between: * in-degree: the number of incoming edges to the node * out-degree: the number of outgoing edges to the node

Code
G = nx.Graph()
G.add_edges_from([(1,2), (1,3), (2, 3), (3, 4), (3, 5)])
DG = nx.DiGraph()
DG.add_edges_from([(1,2), (1,3), (2, 3), (3, 4), (3, 5)])
pos = {
    1: [0, 0],
    2: [-1, -1],
    3: [1, -1],
    4: [1, 0],
    5: [2, 0]
}
fig = plt.figure(figsize = (12, 4))
ax1 = fig.add_subplot(121)
nx.draw_networkx(G, 
                 ax = ax1,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightcoral', 
                 pos = nx.spring_layout(G, seed = 1),
                 with_labels=True, 
                 alpha=1, linewidths=2)
plt.axis('off')
plt.title('Undirected Graph', size = 16)
ax2 = fig.add_subplot(122)
nx.draw_networkx(DG, 
                 ax = ax2,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightblue',
                 pos = pos,
                 with_labels=True, 
                 arrowsize = 25,
                 alpha=1, linewidths=2)
plt.title('Directed Graph', size = 16)
plt.axis('off');

The neighbors of a node are the nodes to which it is connected.

So the degree of a node is the number of neighbors it has.

Node that in a network with \(n\) nodes and \(e\) edges, the average node degree is \(2e/n\).

Make sure that this is obvious to you!

Connectivity

The first question to ask about a graph is: is it connected?

For an undirected graph to be connected, for each pair of nodes \((u, v)\), \(u\) is reachable from \(v\).

We say a directed graph is connected when its undirected version is connected.

Code
G = nx.Graph()
G.add_edges_from([(1,2), (1,3), (2, 3)])
G.add_node(4)
G.add_node(5)
DG = nx.Graph()
DG.add_edges_from([(1,2), (3, 1), (2, 3), (3, 4), (3, 5)])
pos = {
    1: [0, 0],
    2: [-1, -1],
    3: [1, -1],
    4: [1, 0],
    5: [2, 0]
}
fig = plt.figure(figsize = (12, 4))
ax1 = fig.add_subplot(121)
nx.draw_networkx(G, 
                 ax = ax1,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightblue', 
                 pos = pos,
                 with_labels=True, 
                 arrowsize = 25,
                 alpha=1, linewidths=2)
plt.axis('off')
plt.title('Not Connected', size = 16)
ax2 = fig.add_subplot(122)
nx.draw_networkx(DG, 
                 ax = ax2,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightcoral',
                 pos = pos,
                 with_labels=True, 
                 arrowsize = 25,
                 alpha=1, linewidths=2)
plt.title('Connected', size = 16)
plt.axis('off');
/Users/tomg/Source/courses/tools4ds/DS701-Course-Notes/.venv/lib/python3.12/site-packages/networkx/drawing/nx_pylab.py:312: UserWarning: 

The arrowsize keyword argument is not applicable when drawing edges
with LineCollection.

To make this warning go away, either specify `arrows=True` to
force FancyArrowPatches or use the default values.
Note that using FancyArrowPatches may be slow for large graphs.

  draw_networkx_edges(G, pos, arrows=arrows, **edge_kwds)

If the graph is not connected, it may contain connected components.

A connected component is a subgraph that is connected.

In a directed graph, we can also ask if it is strongly connected.

A directed graph is strongly connected if there is a (directed) path between any two nodes.

That is, any node is reachable from any other node.

Within a directed graph, only a subset of nodes may be strongly connected.

These are called the strongly connected component (SCC).

Code
G = nx.DiGraph()
G.add_edges_from([(1,2), (1,3), (2, 3), (3, 4), (3, 5)])
DG = nx.DiGraph()
DG.add_edges_from([(1,2), (3, 1), (2, 3), (3, 4), (3, 5)])
pos = {
    1: [0, 0],
    2: [-1, -1],
    3: [1, -1],
    4: [1, 0],
    5: [2, 0]
}
fig = plt.figure(figsize = (12, 4))
ax1 = fig.add_subplot(121)
nx.draw_networkx(G, 
                 ax = ax1,
                 node_size=300, 
                 edge_color='k',
                 node_color = 'lightblue', 
                 pos = pos,
                 with_labels=True, 
                 arrowsize = 25,
                 alpha=1, linewidths=2)
plt.axis('off')
plt.title('No SCC', size = 16)
ax2 = fig.add_subplot(122)
nx.draw_networkx(DG, 
                 ax = ax2,
                 node_size=300, 
                 edge_color='k',
                 node_color = ['lightcoral', 'lightcoral', 'lightcoral', 'lightblue', 'lightblue'],
                 pos = pos,
                 with_labels=True, 
                 arrowsize = 25,
                 alpha=1, linewidths=2)
plt.title('Has a SCC', size = 16)
plt.axis('off');

Characterizing Graphs

When you first start working with a network, there are some basic questions that are very helpful to ask.

It is helpful to characterize the degrees, components, and other structures in the graph.

Comparison Case: the \(G(n, p)\) Random Graph

In studying real-world graphs, it is helpful to have a case to compare against.

The most common comparison is the \(G(n, p)\) random graph. It is also called the Erdős–Rényi graph, after the two mathematicians who developed and studied it.

The \(G(n, p)\) random graph model is very simple:
* we start with a set of \(n\) nodes * for each pair of nodes, we connect them with probability \(p\).

It is easy to see that in this graph, the average node degree is \(np\).

Code
n = 35
p = 0.15
er = nx.erdos_renyi_graph(n, p, seed = 0)
plt.figure(figsize = (12, 8))
nx.draw_networkx(er, node_size=45, 
                 edge_color='gray', 
                 pos = nx.spring_layout(er, seed = 1),
                 with_labels=False, alpha=1, linewidths=2)
plt.axis('off')
plt.title(f'$G(n, p)$ with $n$ = {n} and $p$ = {p:0.2f}', size = 16);

In this graph, the average degree is \(np = 35 \cdot 0.15 = 5.25\)

As it turns out, most real-world graphs do not match the properties of \(G(n, p)\) graphs.

But it is useful to have a comparison to a “random” graph.

Degree Distributions

Understanding connectivity starts with asking about observed degrees in the graph.

This is captured in the degree distribution.

\[ P[D > x] = \text{probability that a node has degree at least } x \]

We typically focus our attention on large values of \(x\) – nodes that are highly connected.

Power Law Degree Distributions

It’s common for a degree distribution to approximately follow a power-law.

The simplest power-law distribution is called the Pareto distribution:

\[ P[D > x] = k^{\alpha} x^{-\alpha}\;\;\; k \leq x,\; \;0 < \alpha \leq 2. \]

It takes on values in the range \([k, \infty]\).

Code
from scipy.stats import pareto
alpha = 1.3
x = np.linspace(pareto.ppf(0.005,alpha), pareto.ppf(0.995,alpha), 100)
plt.figure(figsize = (8, 6))
plt.plot(x, pareto.pdf(x,alpha),'b-', lw = 5, alpha = 0.6, label='pareto pdf')
plt.title(r'Pareto PDF.  $\alpha$ = {}'.format(alpha), size=16)
plt.xlabel('$x$', size=14)
plt.ylabel('$p(x)$', size=14);

We’ve seen distributions like this before when looking at recommender system data.

In a distribution like this, almost all values are very small,

but there is a non-negligible fraction of values that are very large.

This is like “grains of sand” mixed with a few “boulders.”

This is also referred to as “elephants and mice.”

What does this mean for node degree?

It means that * most nodes have few neighbors, but * an important small subset of nodes have many, many neighbors.

To capture such high-variable degree distributions, a common strategy is to plot them on log-log axes.

On log-log axes, a Pareto distribution appears as a straight line:

Code
from scipy.stats import pareto
alpha = 1.3
plt.figure(figsize = (8, 6))
x = np.linspace(pareto.ppf(0.005,alpha), pareto.ppf(0.995,alpha), 100)
plt.plot(np.log10(x), np.log10(pareto.pdf(x,alpha)),'b-', lw = 5, alpha = 0.6, label='pareto pdf')
plt.title(r'Pareto PDF.  $\alpha$ = {}'.format(alpha), size=16)
plt.xlabel('$\log_{10}(x)$', size=14)
plt.ylabel('$\log_{10}(p(x))$', size=14);
<>:7: SyntaxWarning: invalid escape sequence '\l'
<>:8: SyntaxWarning: invalid escape sequence '\l'
<>:7: SyntaxWarning: invalid escape sequence '\l'
<>:8: SyntaxWarning: invalid escape sequence '\l'
/var/folders/ly/jkydg4dj2vs93b_ds7yp5t7r0000gn/T/ipykernel_16909/817710779.py:7: SyntaxWarning: invalid escape sequence '\l'
  plt.xlabel('$\log_{10}(x)$', size=14)
/var/folders/ly/jkydg4dj2vs93b_ds7yp5t7r0000gn/T/ipykernel_16909/817710779.py:8: SyntaxWarning: invalid escape sequence '\l'
  plt.ylabel('$\log_{10}(p(x))$', size=14);

Power Law Degree Distributions are Ubiquitous

Figure

The networks shown are: (a) the collaboration network of mathematicians [182]; (b) citations between 1981 and 1997 to all papers cataloged by the Institute for Scientific Information [351]; (c) a 300 million vertex subset of the World Wide Web, circa 1999 [74]; (d) the Internet at the level of autonomous systems, April 1999 [86]; (e) the power grid of the western United States [416]; (f) the interaction network of proteins in the metabolism of the yeast S. Cerevisiae [212].

The structure and function of complex networks, M. E. J. Newman

https://arxiv.org/abs/cond-mat/0303516

Note that \(G(n, p)\) random graphs do not have power-law degree distributions.

Their degree distributions have exponential tails (like the power grid example above.)

Clustering

The next important property of a network to understand is clustering.

In the context of networks, clustering refers to the tendency for groups of nodes to have higher connectivity within the group than the network-wide average.

The simplest measure of local clustering is clustering coefficient.

The clustering coefficient is the answer to the question “are my friends themselves friends?”

In other words, do the neighbors of a node tend to be neighbors?

More specifically, clustering coefficient measures the probability that two of your neighbors are connected.

There are two ways one can measure this:

\[C^{(1)} = \frac{\sum_i \text{number of triangles that include }i}{\sum_i \text{number of pairs of neigbors of }i} \]

This is the ratio of the mean triangle count to mean neighbor pair count.

It is the probability that a random pair of neighbors are connected.

Code
G = nx.Graph()
G.add_edges_from([(1,2), (1,3), (2, 3), (3, 4), (3, 5)])
plt.figure(figsize = (6, 4))
nx.draw_networkx(G, node_size=300, 
                 edge_color='k', 
                 pos = nx.spring_layout(G, seed = 1),
                 with_labels=False, alpha=1, linewidths=2)
plt.axis('off');

\[C^{(1)} = \frac{3}{1 + 1 + 6} = \frac{3}{8} = 0.375\]

The second way to measure clustering coefficient is the mean of the ratios:

\[ C^{(2)} = \frac{1}{n} \sum_i \frac{\text{number of triangles that include }i}{\text{number of pairs of neigbors of }i} \]

This is the probability that neighbors are connected for a random node.

\[ C^{(2)} = \frac{1}{5} (1 + 1 + \frac{1}{6}) = \frac{13}{30} = 0.433\]

What is the clustering coefficient in a \(G(n, p)\) random graph?

This is easy: the probability that two of your neighbors are connected is the same as the probability that any two nodes are connected.

In other words, \(C^{(1)} = C^{(2)} = p\).

Real World Graphs Tend to Show Large Amounts of Clustering.

In practice one finds that real world graphs show strong clustering.

For example, consider a social network. Your friends are much more likely to be themselves friends than two randomly chosen people.

Figure

Clustering and Path Length: Small Worlds

The strong presence of clustering in networks leads to a question: how long is a typical shortest path between nodes?

The average shortest path length between nodes is one way to measure a network’s “diameter.”

Let’s consider a highly clustered graph:

Figure

In this graph, each node has four neighbors, and of the 6 pairs of neighbors, 3 are connected.

So the clustering coefficient is 0.5, quite high.

What is the average shortest path between nodes?

It seems to be around \(n/8\) - on average you go 1/4 of the way around the circle, in hops of 2.

So the path length grows linearly with \(n\).

If the number of nodes is large, the average path length is large.

Now, we know that real-world social networks are highly clustered,

So, based on this model, we might assume that the average path length between two people in a large social network is going to be quite large.

Is this true? If you choose two people at random from the population of the United States (say), is the shortest path length between them long?

In 1967 the social psychologist Stanley Milgram set out to empirically answer this question.

Figure

Milgram picked 160 people at random in Omaha, Nebraska. (It helped that there used to be phone books.)

He asked them to get a letter to a particular person, a stockbroker in Boston.

The rules that he set were that they could only pass the letter between friends that were known on a first-name basis.

Surprisingly, 62 of the letters made it to the stockbroker!

More surprising was the fact that the average path length was 6.2 people!

This statistic became famous when John Guare wrote a play called Six Degrees of Separation.

Given what we know about clustering in social networks, this is quite surprising. How can we explain it?

The first clue comes from another classic social science paper, called The Stength of Weak Ties, by Mark Granovetter.

This is sometimes referred to as the most famous paper in sociology (60,000 citations!).

Granovetter interviewed people about how the found their jobs. He found that most people did not get a job through someone that was a close friend, but rather through a distant acquaintance.

This suggests that an important way that information travels in a social network is via the rare connections that exist outside of the local clustering of friendships.

This was all put on an experimental basis in another classic paper, by the social scientist Duncan Watts and the mathematician Steve Strogatz.

Figure Figure

In their paper Collective Dynamics of Small-World Networks, Watts and Strogatz perfomed an elegant experiment.

They asked: if we take a highly clustered network, and slightly perturb it (randomize it), what happens?

Specifically, they started with a network in which each node is connected to a fixed number of neighbors, and connections are made in a highly clustered way.

Figure

Then, with probability \(p\), rewire each edge: change it to connect to a random destination.

So, as \(p\) varies, what happens to: * Average path length \(L(p)\) and * Clustering coefficient \(C(p)\)?

And here is the famous figure from that paper:

Figure

Notice the log scale on the \(p\) axis.

What Watts and Strogatz showed is that it only takes a small amount of long-range connections to dramatically shrink the average path length between nodes.

They showed that high clustering and short path lengths can coexist. They called networks with high clustering and short path lengths small world networks.

Once this phenomenon is understood, we see that it shows up repeatedly.

For example, consider the network of movie actors: two actors are connected if they appear in the same movie.

Thus we have the phenomenon of the six degrees of Kevin Bacon:

Figure

You can try your luck at The Oracle of Bacon.

For example, Elvis Presley:

Figure

What’s special about Kevin Bacon?

Not really anything.

Because movie co-appearance forms a small world network, most any path between two actors is a short one.

Implications of Small Worlds.

Viruses spread rapidly in small worlds.

One of the goals of pandemic lockdowns is to keep people circulating in their local social groups, and to eliminate the effect of long-range (weak-tie) connections.

This is the idea behind travel bans and mandatory quarantining after travel.

Here is a figure showing the effect on virus propagation of deleting most of the long-range edges from a small-world network:

Figure

Different curves correspond to when the lockdown is deployed (after how many deaths).

From Mitigating COVID-19 on a Small-World Network, Marvin Du, Scientific Reports Oct 2021.

Another question concerns how “shortcuts” arise.

Remember that node degree distributions are typically power-law.

So although most people have small acquaintance sets, a small subset of people are very highly connected.

Figure

These “high social capital” individuals play a big role in creating long-range, path-shortening connections in social networks.

Analyzing Graphs

When first starting to work with a network, understanding its properties can be difficult.

Clustering and the presence of high-degree nodes both contribute to the problem.

Nonetheless, there are a number of strategies that are used to “get to know” a new network dataset.

They include: * visualize the network in a way that communicates as much insight as possible, and * compute important metrics of the network

Visualizing Networks

As an example, we’ll use the following network.

It records American football games between NCAA Div IA colleges in Fall 2000 (available here).

Each vertex represents a football team, which belongs to a specific conference (Big Ten, Conference USA, Pac-10, etc.).

An edge between two vertices \(v_1\) and \(v_2\) means that the two teams played each other; the weight of the edge (\(v_1\), \(v_2\)) is equal to the number of times they played each other.

(Data from M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).)

Code
# data from http://www-personal.umich.edu/~mejn/netdata/                                
football = nx.readwrite.gml.read_gml('data/football.gml')
Code
print(f'The football network has {len(football.nodes())} nodes and {len(football.edges())} edges')
The football network has 115 nodes and 613 edges

To get a sense of what is unusual here, we can compare this network to a \(G(n, p)\) random network with the same number of nodes and edges:

Code
n = len(football.nodes())
e = len(football.edges())
p = e / ((n * (n-1))/2)
F_random = nx.erdos_renyi_graph(n, p, seed = 0)

One way to visualize is to use a circular layout, which keeps all the edges in the interior.

This can make things easier to see sometimes.

Code
plt.figure(figsize = (18, 8))
ax1 = plt.subplot(121)
nx.draw_networkx(football, ax = ax1,
                 node_size=35, 
                 edge_color='gray', 
                 pos = nx.circular_layout(football),
                 with_labels=False, alpha=.8, linewidths=2)
plt.axis('off')
plt.title('Title 1 Football -- Circular Layout', size = 16)
ax2 = plt.subplot(122)
nx.draw_networkx(F_random, ax = ax2,
                 node_size=35, 
                 edge_color='gray', 
                 pos = nx.circular_layout(F_random),
                 with_labels=False, alpha=.8, linewidths=2)
plt.axis('off')
plt.title('Same Density Random -- Circular Layout', size = 16);

There is some non-random structure here, but it’s not clear exactly what it is.

So we will turn to a more informative layout.

The standard networkx routine uses what is called a ‘spring’ layout.

Here is how ‘spring’ layout works: * Each edge has a weight parameter (could be 1 for all edges).
* The layout routine fixes * a spring of length = 1/weight between the nodes, and * a repulsive force between each pair of nodes, * and then lets the set of all forces reach its minimum energy state.

This is a kind of minimal distortion in a least-squares sense.

Code
plt.figure(figsize = (18, 8))
ax1 = plt.subplot(121)
nx.draw_networkx(football, ax = ax1,
                 node_size=35, 
                 edge_color='gray', 
                 pos = nx.spring_layout(football, seed = 0),
                 with_labels=False, alpha=.8, linewidths=2)
plt.axis('off')
plt.title('Title 1 Football -- Spring Layout', size = 16)
ax2 = plt.subplot(122)
nx.draw_networkx(F_random, ax = ax2,
                 node_size=35, 
                 edge_color='gray', 
                 pos = nx.spring_layout(F_random, seed = 0),
                 with_labels=False, alpha=.8, linewidths=2)
plt.axis('off')
plt.title('Same Density Random -- Spring Layout', size = 16);

Notice how the spring layout tends to bring clusters of densely connected nodes close to each other.

Finally, we can try the spectral layout.

(We will define the spectral layout in the next lecture.)

Code
plt.figure(figsize = (18, 8))
ax1 = plt.subplot(121)
nx.draw_networkx(football, ax = ax1,
                 node_size=35, 
                 edge_color='gray', 
                 pos = nx.spectral_layout(football),
                 with_labels=False, alpha=.8, linewidths=2)
plt.axis('off')
plt.title('Title 1 Football -- Spectral Layout', size = 16)
ax2 = plt.subplot(122)
nx.draw_networkx(F_random, ax = ax2,
                 node_size=35, 
                 edge_color='gray', 
                 pos = nx.spectral_layout(F_random),
                 with_labels=False, alpha=.8, linewidths=2)
plt.axis('off')
plt.title('Same Density Random -- Spectral Layout', size = 16);

With this layout, we can start to understand the structure of the network –

see clusters of teams that correspond to conferences, –

and understand that there are not too many high-degree nodes.

Characterizing Graphs

Another way to undersand network structured data is to look at important metrics.

For example, we can start with clustering coefficient:

Code
clustering_coefficient = nx.average_clustering(football)
print(f'The clustering coefficient of the Football network is {clustering_coefficient:0.3f}')
cc_random = nx.average_clustering(F_random)
print(f'The clustering coefficient for the equivalent random network is {cc_random:0.3f}')
The clustering coefficient of the Football network is 0.403
The clustering coefficient for the equivalent random network is 0.088

Another useful metric is diameter:

Code
print(f'The diameter of the Football network is {nx.diameter(football)}' +
      f' and the average shortest path length is {nx.average_shortest_path_length(football):0.3f}')
print(f'The diameter of the equivalent random network is {nx.diameter(F_random)}' +
      f' and the average shortest path length is {nx.average_shortest_path_length(F_random):0.3f}')
The diameter of the Football network is 4 and the average shortest path length is 2.508
The diameter of the equivalent random network is 4 and the average shortest path length is 2.215

The next property we can look at is the degree distribution:

Code
degree_freq = nx.degree_histogram(football)
degrees = np.array(range(len(degree_freq)))
Code
plt.figure(figsize = (8, 6))
plt.bar(degrees, degree_freq)
plt.xlabel('Degree', size = 14)
plt.ylabel('Number of Nodes', size = 14)
plt.title('Degree Distribution of Football Network', size = 16);

To get a sense of what is unusual here, we can again compare this to the equivalent random network:

Code
rand_degree_freq = nx.degree_histogram(F_random)
rand_degrees = range(len(rand_degree_freq))
plt.figure(figsize = (8, 6))
plt.bar(rand_degrees, rand_degree_freq, 0.425, label = 'Random')
plt.bar(degrees+0.5, degree_freq, 0.425, label = 'Actual')
plt.xlabel('Degree', size = 14)
plt.ylabel('Number of Nodes', size = 14)
plt.legend(loc = 'best', fontsize = 14)
plt.title('Degree Distribution of Football Network\nCompared to Random', size = 16);

We can see the evidence of scheduling of games in this distribution: a much larger number of teams plays 11 games than would occur by chance.

Back to top