Updated September 26, 2022 bugfix

The goal of this Python notebook is to download and prepare a suite of benchmark networks for some shortest path algorithms. We would like to experiment with some simple directed graphs with non-negative weights. We are specially interested in road networks.

The files are available on the Universita Di Roma website. It was created for the 9th DIMACS implementation challenge : Implementation Challenge about Shortest Paths. This challenge dates back to 2006, but the files are still there. Here is the download web page : http://www.diag.uniroma1.it//challenge9/download.shtml

The networks correspond to different parts of the USA road network, with various region sizes. Here are the different network names, from smaller to larger:

  • NY : New York
  • BAY : Bay area
  • COL : Colorado
  • FLA : Florida
  • NW : North West
  • NE : North East
  • CAL : Califorinia
  • LKS : Great Lakes
  • E : Eastern region
  • W : Western region
  • CTR : Central region
  • USA : contiguous United States

There is an interesting warning on the download page :

Known issues: the data has numerous errors, in particular gaps in major highways and bridges. This may result in routes that are very different from real-life ones. One should take this into consideration when experimenting with the data.

This does not really matter to us because the plan is to implement a shortest path algorithm in Python/Cython and to compare various implementations, but not to find realistic routes.

The networks are available as compressed text files. So here are the very simple steps followed in this notebook:

  1. create a folder structure
  2. download the compressed network files
  3. uncompress them with gzip
  4. Create a function to load the edges into a Pandas dataframe
  5. Create a function to load the node coordinates into a Pandas dataframe
  6. Save the networks into parquet files
  7. Query the parquet files with DuckDB
  8. Plot the networks with Datashader

Imports

import gzip
import os
import shutil

import datashader as ds
from datashader.bundling import connect_edges
import datashader.transfer_functions as tf
import duckdb
import geopandas as gpd
import numpy as np
import pandas as pd
import wget

DATA_DIR_ROOT_PATH = "/home/francois/Data/Disk_1/"  # root data dir

Package versions:

Python    : 3.10.6
datashader: 0.14.2
duckdb    : 0.5.0
geopandas : 0.11.1
numpy     : 1.22.4
pandas    : 1.4.4
pyarrow   : 9.0.0
pygeos    : 0.13
wget      : 3.2

Create a folder structure

We start by creating a main directory DIMACS_road_networks, and then one directory per network.

data_dir_path = os.path.join(DATA_DIR_ROOT_PATH, "DIMACS_road_networks")
if not os.path.exists(data_dir_path):
    os.makedirs(data_dir_path)
names = [
    "NY",
    "BAY",
    "COL",
    "FLA",
    "NW",
    "NE",
    "CAL",
    "LKS",
    "E",
    "W",
    "CTR",
    "USA",
]
network_dir_paths = {}
for name in names:
    network_dir_path = os.path.join(data_dir_path, name)
    if not os.path.exists(network_dir_path):
        os.makedirs(network_dir_path)
    network_dir_paths[name] = network_dir_path
!tree -d {data_dir_path}
/home/francois/Data/Disk_1/DIMACS_road_networks
├── BAY
├── CAL
├── COL
├── CTR
├── E
├── FLA
├── LKS
├── NE
├── NW
├── NY
├── USA
└── W

12 directories

Download the compressed network files

Three types of files are available from the DIMACS challenge web site:

  • Distance graph: edges with distance weight
  • Travel time graph: edges with travel time weight
  • Coordinates: node coordinates (latitude, longitude)

However, we are only going to download the coordinates and travel time graph files in the present notebook. We only require one type of edge weight in order run shortest path algorithms. So between weight=distance or weight=travel_time, we chose the latter.

The file URL has a neat pattern.

  • travel time graph : http://www.diag.uniroma1.it//challenge9/data/USA-road-t/USA-road-t.XXX.gr.gz
  • coordinates : http://www.diag.uniroma1.it//challenge9/data/USA-road-d/USA-road-d.XXX.co.gz

where XXX is the network name. So we download each of these files with wget and save them in the respective network folder.

travel_time_graph_file_paths = {}
coordinates_file_paths = {}
for name in names:

    # travel time graph
    travel_time_graph_url = f"http://www.diag.uniroma1.it//challenge9/data/USA-road-t/USA-road-t.{name}.gr.gz"
    travel_time_graph_file_path = os.path.join(
        network_dir_paths[name], f"USA-road-t.{name}.gr.gz"
    )
    travel_time_graph_file_paths[name] = travel_time_graph_file_path

    # coordinates
    coordinates_url = f"http://www.diag.uniroma1.it//challenge9/data/USA-road-d/USA-road-d.{name}.co.gz"
    coordinates_file_path = os.path.join(
        network_dir_paths[name], f"USA-road-d.{name}.co.gz"
    )
    coordinates_file_paths[name] = coordinates_file_path

    if (not os.path.exists(travel_time_graph_file_path)) & (
        not os.path.exists(travel_time_graph_file_path.removesuffix(".gz"))
    ):
        wget.download(travel_time_graph_url, travel_time_graph_file_path)
    if (not os.path.exists(coordinates_file_path)) & (
        not os.path.exists(coordinates_file_path.removesuffix(".gz"))
    ):
        wget.download(coordinates_url, coordinates_file_path)

Uncompress the network files with gzip

We first create a small function that is looking for a zipped file. If found, the zipped file is uncompressed and removed.

def extract_and_cleanup(file_path_gz):
    file_path = file_path_gz.split(".gz")[0]
    try:
        with gzip.open(file_path_gz, "rb") as f_in:
            with open(file_path, "wb") as f_out:
                shutil.copyfileobj(f_in, f_out)
        os.remove(file_path_gz)
    except FileNotFoundError:
        pass
    return file_path

Now we call the above function extract_and_cleanup for each node and edge file:

for name in names:

    # travel time graph
    file_path_gz = travel_time_graph_file_paths[name]
    file_path = extract_and_cleanup(file_path_gz)
    travel_time_graph_file_paths[name] = file_path

    # coordinates
    file_path_gz = coordinates_file_paths[name]
    file_path = extract_and_cleanup(file_path_gz)
    coordinates_file_paths[name] = file_path

Create a function to load the edges into a Pandas dataframe

Let’s have a look at one of the edge file:

! head -n 10 {travel_time_graph_file_paths["NY"]}
c 9th DIMACS Implementation Challenge: Shortest Paths
c http://www.dis.uniroma1.it/~challenge9
c TIGER/Line graph USA-road-t.NY
c
p sp 264346 733846
c graph contains 264346 nodes and 733846 arcs
c
a 1 2 2008
a 2 1 2008
a 3 4 395

So we need to skip the header lines, starting either with c or p. Then, on each edge line, we have the letter a, the source node index, the target node index and edge travel time. The edge weight has an int type here and we do not know the time unit. We assume that it corresponds to hundreds of seconds. This also does not matter regarding the comparison of various shortest path implementations.

So once the file is loaded with pandas.read_csv, we perform a few transformation steps:

  • set the weight column to float type, and convert the weight from hundreds of seconds to seconds
  • remove parallel edges by keeping a single edge with the min weight
  • remove loops, if there is any
  • shift the source and target indices by -1 in order to be 0-based

The point of removing parallel arcs, is to get a simple directed graphs, that we can represent with an adjacency matrix in a sparse format. Loops could be described in an adjacency matrix but they would be completely useless regarding shortest paths, the edge weights being non-negative.

def read_travel_time_graph(file_path):

    # read the header
    with open(file_path) as f:
        lines = f.readlines(10_000)
    header_size = 0
    for line in lines:
        header_size += 1
        if line.startswith("p"):
            # we read the edge count from the header
            edge_count = int(line.split(" ")[-1])
        elif line.startswith("a"):
            header_size -= 1
            break

    # read the data
    df = pd.read_csv(
        file_path,
        sep=" ",
        names=["a", "source", "target", "weight"],
        usecols=["source", "target", "weight"],
        skiprows=header_size,
    )

    assert len(df) == edge_count

    # data preparation and assertions
    assert not df.isna().any().any()  # no missing values
    df.weight = df.weight.astype(float)  # convert to float type
    df.weight *= 0.01  # convert to seconds
    assert df.weight.min() >= 0.0  # make sure travel times are non-negative
    df = (
        df.groupby(["source", "target"], sort=False).min().reset_index()
    )  # remove parallel edges and keep the one with shortest weight
    df = df[df["source"] != df["target"]]  # remove loops
    df[["source", "target"]] -= 1  # switch to 0-based indices

    return df

We can try read_travel_time_graph on the NY network file:

file_path = travel_time_graph_file_paths["NY"]
edges_df = read_travel_time_graph(file_path)
edges_df.head(3)
source target weight
0 0 1 20.08
1 0 11 21.05
2 0 1362 60.70

Create a function to load the node coordinates into a Pandas dataframe

The node coordinate files are similar to the edge files:

! head -n 10 {coordinates_file_paths["NY"]}
c 9th DIMACS Implementation Challenge: Shortest Paths
c http://www.dis.uniroma1.it/~challenge9
c TIGER/Line nodes coords for graph USA-road-d.NY
c
p aux sp co 264346
c graph contains 264346 nodes
c
v 1 -73530767 41085396
v 2 -73530538 41086098
v 3 -73519366 41048796

We also need to skip the header lines, starting either with c or p. Then, on each edge line, we have the letter v, the node index, the longitude and latitude. From what I understand, the coordinates are expressed in WGS 84, but as int type, in millionth of degrees. So we divide the int longitude and latitude numbers by a million to obtain the coordinates in degrees.

def read_coords(file_path, epsg=4326):

    # read the header
    with open(file_path) as f:
        lines = f.readlines(10_000)
    header_size = 0
    for line in lines:
        header_size += 1
        if line.startswith("p"):
            vertex_count = int(line.split(" ")[-1])
        elif line.startswith("v"):
            header_size -= 1
            break

    # read the data
    df = pd.read_csv(
        file_path,
        sep=" ",
        names=["v", "id", "lng", "lat"],
        usecols=["id", "lng", "lat"],
        skiprows=header_size,
    )

    df["id"] -= 1  # 0-based indices
    df[["lng", "lat"]] /= 10**6  # convert the coordinates to degrees

    if epsg != 4326:

        gpd.options.use_pygeos = True

        # load the vertices into a geodataframe
        gdf = gpd.GeoDataFrame(
            df, geometry=gpd.points_from_xy(df.lng, df.lat), crs="EPSG:4326"  # WGS 84
        )
        gdf.drop(["lng", "lat"], axis=1, inplace=True)
        gdf.set_index("id", inplace=True)
        nodes_gs = gdf.geometry
        nodes_gs = nodes_gs.to_crs(epsg=epsg)
        nodes_gdf = nodes_gs.to_frame("geometry")
        nodes_gdf["x"] = nodes_gdf.geometry.x
        nodes_gdf["y"] = nodes_gdf.geometry.y
        nodes_df = nodes_gdf[["x", "y"]]

    else:

        nodes_df = df.rename(columns={"lng": "x", "lat": "y"})
        nodes_df.set_index("id", inplace=True)

    assert len(nodes_df) == vertex_count

    return nodes_df

Let’s run read_coords on the NY network file:

file_path = coordinates_file_paths["NY"]
nodes_df = read_coords(file_path)
nodes_df.head(3)
x y
id
0 -73.530767 41.085396
1 -73.530538 41.086098
2 -73.519366 41.048796

We added an argument epsg with a target CRS. This is, for the case where we would like to use a different CRS than EPSG:4326. GeoPandas is used to transform the data in that case.

nodes_df = read_coords(file_path, epsg=3857)
nodes_df.head(3)
x y
id
0 -8.185408e+06 5.024946e+06
1 -8.185382e+06 5.025049e+06
2 -8.184138e+06 5.019542e+06

Save the networks into parquet files

Now we call both functions, read_travel_time_graph and read_coords, for all the networks and write the corresponding dataframes as parquet files.

parquet_tt_file_paths = {}
for name in names:
    file_path = travel_time_graph_file_paths[name]
    parquet_tt_file_path = file_path + ".parquet"
    parquet_tt_file_paths[name] = parquet_tt_file_path
    if not os.path.exists(parquet_tt_file_path):
        edges_df = read_travel_time_graph(file_path)
        edges_df.to_parquet(parquet_tt_file_path)
parquet_coord_file_paths = {}
for name in names:
    file_path = coordinates_file_paths[name]
    parquet_coord_file_path = file_path + ".parquet"
    if not os.path.exists(parquet_coord_file_path):
        nodes_df = read_coords(file_path)
        nodes_df.to_parquet(parquet_coord_file_path)
    parquet_coord_file_paths[name] = parquet_coord_file_path

We now have all the parquet files ready for use on the disk!

!tree -P '*.parquet' {data_dir_path}
/home/francois/Data/Disk_1/DIMACS_road_networks
├── BAY
│   ├── USA-road-d.BAY.co.parquet
│   └── USA-road-t.BAY.gr.parquet
├── CAL
│   ├── USA-road-d.CAL.co.parquet
│   └── USA-road-t.CAL.gr.parquet
├── COL
│   ├── USA-road-d.COL.co.parquet
│   └── USA-road-t.COL.gr.parquet
├── CTR
│   ├── USA-road-d.CTR.co.parquet
│   └── USA-road-t.CTR.gr.parquet
├── E
│   ├── USA-road-d.E.co.parquet
│   └── USA-road-t.E.gr.parquet
├── FLA
│   ├── USA-road-d.FLA.co.parquet
│   └── USA-road-t.FLA.gr.parquet
├── LKS
│   ├── USA-road-d.LKS.co.parquet
│   └── USA-road-t.LKS.gr.parquet
├── NE
│   ├── USA-road-d.NE.co.parquet
│   └── USA-road-t.NE.gr.parquet
├── NW
│   ├── USA-road-d.NW.co.parquet
│   └── USA-road-t.NW.gr.parquet
├── NY
│   ├── USA-road-d.NY.co.parquet
│   └── USA-road-t.NY.gr.parquet
├── USA
│   ├── USA-road-d.USA.co.parquet
│   └── USA-road-t.USA.gr.parquet
└── W
    ├── USA-road-d.W.co.parquet
    └── USA-road-t.W.gr.parquet

12 directories, 24 files

Query the parquet files with DuckDB

Although we could have done it earlier when we had the dataframes in our hands, we are now going to perform some basic analysis of the networks. The motivation is to clearly separate the different steps.

We want to count the number of edges and vertices in each network, from the parquet files, without loading all the data into memory, and rather in an efficient way. We are going to compute these network features using some SQL, with DuckDB.

%%time
network_info = []
for name in names:
    parquet_tt_file_path = parquet_tt_file_paths[name]
    query = f"""SELECT COUNT(*), MAX(source), MAX(target) FROM parquet_scan('{parquet_tt_file_path}')"""
    res = duckdb.query(query).fetchall()[0]
    edge_count = res[0]
    vertex_count = np.max(res[1:3]) + 1
    query = f"""
        WITH edges AS (SELECT
            source,
            target 
        FROM
            parquet_scan('{parquet_tt_file_path}')) 
        SELECT 
            COUNT(DISTINCT node) 
        FROM
            (     SELECT
                source AS node     
            FROM
                edges     
            UNION ALL     
                  SELECT
                target AS node     
            FROM
                edges       
        )"""
    used_vertices = duckdb.query(query).fetchone()[0]
    mean_degree = edge_count / used_vertices
    network_info.append(
        {
            "name": name,
            "vertex_count": vertex_count,
            "used_vertices": used_vertices,
            "edge_count": edge_count,
            "mean_degree": mean_degree,
        }
    )
CPU times: user 1min 16s, sys: 3.96 s, total: 1min 20s
Wall time: 43.6 s
network_info_df = pd.DataFrame(network_info).set_index("name")
network_info_df = network_info_df.sort_values(by=["edge_count"])
pd.set_option("display.precision", 2)
network_info_df
vertex_count used_vertices edge_count mean_degree
name
NY 264346 264346 730100 2.76
BAY 321270 321270 794830 2.47
COL 435666 435666 1042400 2.39
FLA 1070376 1070376 2687902 2.51
NW 1207945 1207945 2820774 2.34
NE 1524453 1524453 3868020 2.54
CAL 1890815 1890815 4630444 2.45
LKS 2758119 2758119 6794808 2.46
E 3598623 3598623 8708058 2.42
W 6262104 6262104 15119284 2.41
CTR 14081816 14081816 33866826 2.41
USA 23947347 23947347 57708624 2.41

We can observe that all the vertices are actually used in the graph. The mean degree does not vary too much, although it it a little larger in the NY area, which is densely populated.

Plot the networks with Datashader

Finally, we are going to plot some of these networks using Datashader.

But before that, we also want to print a rough approximation of the network width and height. Because we have the coordinates expressed in WGS84, let’s write a small function to compute an approximate distance in meters between two points, given their lat/lon coordinates. This is a straight implementation of Haversine formula.

def haversine_distance(lon_1, lat_1, lon_2, lat_2):
    """Calculate the distance between two points using their latitude and longitude."""

    phi_1, phi_2 = np.radians(lat_1), np.radians(lat_2)
    delta_phi = np.radians(lat_2 - lat_1)
    delta_lambda = np.radians(lon_2 - lon_1)
    a = (
        np.sin(0.5 * delta_phi) ** 2
        + np.cos(phi_1) * np.cos(phi_2) * np.sin(0.5 * delta_lambda) ** 2
    )
    c = 2.0 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    R = 6.371e6  # approximate earth radius
    d_m = R * c

    return d_m

Now the following load_and_plot function is actually loading the vertex and edge dataframes from the respective parquet files. Then both are used in Datashader’s connect_edges function. The Datashader part is strongly inspired from its documentation about networks.

def load_and_plot(
    name,
    parquet_coord_file_paths,
    parquet_tt_file_paths,
    network_info_df,
    plot_width=1200,
):

    # load the network
    parquet_coord_file_path = parquet_coord_file_paths[name]
    nodes_df = pd.read_parquet(parquet_coord_file_path)
    parquet_tt_file_path = parquet_tt_file_paths[name]
    edges_df = pd.read_parquet(parquet_tt_file_path, columns=["source", "target"])
    edges_df = edges_df.astype(np.uintc)

    # compute the network width and height
    xr = nodes_df.x.min(), nodes_df.x.max()
    yr = nodes_df.y.min(), nodes_df.y.max()
    width_km = 0.001 * haversine_distance(xr[0], yr[0], xr[1], yr[0])
    height_km = 0.001 * haversine_distance(xr[0], yr[0], xr[0], yr[1])
    print(f"Network : {name}")
    print(
        f"width : {int(round(width_km)):6d} km, height : {int(round(height_km)):6d} km"
    )
    vertex_count = network_info_df.loc[name, "vertex_count"]
    edge_count = network_info_df.loc[name, "edge_count"]
    print(f"vertex count : {vertex_count}, edge count : {edge_count}")

    # plot the network
    edges_ds = connect_edges(nodes_df, edges_df)
    plot_height = int(plot_width * (yr[1] - yr[0]) / (xr[1] - xr[0]))
    cvsopts = dict(plot_height=plot_height, plot_width=plot_width)
    canvas = ds.Canvas(x_range=xr, y_range=yr, **cvsopts)
    ep = tf.spread(
        tf.shade(canvas.line(edges_ds, "x", "y", agg=ds.count()), cmap=["#F03811"]),
        px=0,
    )

    return ep
load_and_plot("NY", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : NY
width :     85 km, height :    111 km
vertex count : 264346, edge count : 730100

NY

load_and_plot("BAY", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : BAY
width :    178 km, height :    222 km
vertex count : 321270, edge count : 794830

BAY

load_and_plot("COL", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : COL
width :    621 km, height :    445 km
vertex count : 435666, edge count : 1042400

COL

load_and_plot("FLA", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : FLA
width :    753 km, height :    680 km
vertex count : 1070376, edge count : 2687902

FLA

load_and_plot("NW", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : NW
width :    720 km, height :    779 km
vertex count : 1207945, edge count : 2820774

NW

load_and_plot("NE", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : NE
width :    520 km, height :    389 km
vertex count : 1524453, edge count : 3868020

NE

load_and_plot("CAL", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : CAL
width :    976 km, height :   1056 km
vertex count : 1890815, edge count : 4630444

CAL

load_and_plot("LKS", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : LKS
width :   1591 km, height :    827 km
vertex count : 2758119, edge count : 6794808

LKS

load_and_plot("E", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : E
width :   1116 km, height :   1543 km
vertex count : 3598623, edge count : 8708058

E

load_and_plot("W", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : W
width :   2422 km, height :   2331 km
vertex count : 6262104, edge count : 15119284

W

load_and_plot("CTR", parquet_coord_file_paths, parquet_tt_file_paths, network_info_df)
Network : CTR
width :   2114 km, height :   2669 km
vertex count : 14081816, edge count : 33866826

CTR

We note that it is hard to distinguish anything in the last plot because of the network large size, with a fixed size plot and a fixed edge width. Also I couldn’t plot the largest one, USA, for some memory reasons.

The networks are now ready to be use by some Shortest Path algorithms, which will be the subject of some future posts.