Main

Well I love the datashader Python package and I am always happy to use it on some interesting datasets. I recently came across a traffic injury database for french roads, which happens to have some geographical coordinates. This comes from the open platform for french public data. So in this notebook, we are going to plot all the georeferenced collisions that ocurred over 13 years in mainland France.

Disclaimer: this notebook was written during lunch break (Q&D style).

Imports

import subprocess

import pandas as pd
import datashader as ds
from datashader import transfer_functions as tf
from colorcet import palette
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%load_ext watermark
%watermark -v
CPython 3.8.1
IPython 7.11.1
%watermark --iversions
matplotlib 3.1.2
pandas     1.0.0
datashader 0.10.0

Collect the CSV files

The coordinates can be found in the caracteristiques_20??.csv files. I did not try to scrap the web page but just gathered the urls for each year into a list:

files = [
    ['https://www.data.gouv.fr/fr/datasets/r/6eee0852-cbd7-447e-bd70-37c433029405', 2018],
    ['https://www.data.gouv.fr/fr/datasets/r/9a7d408b-dd72-4959-ae7d-c854ec505354', 2017],
    ['https://www.data.gouv.fr/fr/datasets/r/96aadc9f-0b55-4e9a-a70e-c627ed97e6f7', 2016],
    ['https://www.data.gouv.fr/fr/datasets/r/185fbdc7-d4c5-4522-888e-ac9550718f71', 2015],    
    ['https://www.data.gouv.fr/fr/datasets/r/85dfe8c6-589f-4e76-8a07-9f59e49ec10d', 2014],
    ['https://www.data.gouv.fr/fr/datasets/r/18b1a57a-57bf-4bf1-b9ee-dfa5a3154225', 2013],
    ['https://www.data.gouv.fr/fr/datasets/r/b2518ec1-6529-47bc-9d55-40e2effeb0e7', 2012],
    ['https://www.data.gouv.fr/fr/datasets/r/37991267-8a15-4a9d-9b1c-ff3e6bea3625', 2011],
    ['https://www.data.gouv.fr/fr/datasets/r/decdfe8c-38ff-4a06-b7fc-615785f2914d', 2010],
    ['https://www.data.gouv.fr/fr/datasets/r/fdfacdb9-f48e-4759-bae5-48d063216acb', 2009],
    ['https://www.data.gouv.fr/fr/datasets/r/722ebb99-c8b2-4635-bf8d-125dd280ee42', 2008],
    ['https://www.data.gouv.fr/fr/datasets/r/6fc7b169-4dfe-442c-8c28-8bd773aeddf8', 2007],
    ['https://www.data.gouv.fr/fr/datasets/r/fafa33cf-50cb-4092-a819-d5209f684089', 2006],
    ['https://www.data.gouv.fr/fr/datasets/r/a47866f7-ece1-4de8-8d31-3a1b4f477e08', 2005],
]

Let’s download these files:

%%time
file_paths = []
for item in files:
    file_path = f'./data/caracteristiques-{str(item[1])}.csv'
    file_paths.append(file_path)
    cmd = f'wget --output-document {file_path} {item[0]}'
    subprocess.call(cmd, shell=True)
CPU times: user 322 µs, sys: 100 ms, total: 100 ms
Wall time: 20.4 s

Load the CSVs into a dataframe

If we look at the encoding of each file, we get a little surprise:

for file_path in file_paths:
    cmd = f'uchardet {file_path}'
    encoding = subprocess.check_output(cmd, shell=True).decode("utf-8")[:-1]
    print(encoding)
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
ISO-8859-1
UTF-8
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252
WINDOWS-1252

Also, the separator may vary, and is not always detected by pandas (csv.Sniffer?), so the seperator is hard-coded for each file. Fortunately, all the loaded dataframes have the same column names, so we concantenate them all:

df = pd.DataFrame()
for file_path in file_paths:
    cmd = f'uchardet {file_path}'
    encoding = subprocess.check_output(cmd, shell=True).decode("utf-8")[:-1]
    if file_path.split('-')[-1].split('.')[0] == '2009':
        sep= '\t'
    else:
        sep = ','
    df = pd.concat([df, pd.read_csv(file_path, encoding=encoding, sep=sep)], axis=0)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 958469 entries, 0 to 87025
Data columns (total 16 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Num_Acc  958469 non-null  int64  
 1   an       958469 non-null  int64  
 2   mois     958469 non-null  int64  
 3   jour     958469 non-null  int64  
 4   hrmn     958469 non-null  int64  
 5   lum      958469 non-null  int64  
 6   agg      958469 non-null  int64  
 7   int      958469 non-null  int64  
 8   atm      958396 non-null  float64
 9   col      958450 non-null  float64
 10  com      958467 non-null  float64
 11  adr      816550 non-null  object 
 12  gps      480052 non-null  object 
 13  lat      471401 non-null  float64
 14  long     471397 non-null  object 
 15  dep      958469 non-null  int64  
dtypes: float64(4), int64(9), object(3)
memory usage: 124.3+ MB

Now we create a datetime index:

df.rename(columns={'an': 'year', 'mois': 'month', 'jour': 'day'}, inplace=True)
df.year = df.year.map(lambda x: int('20' + str(x).zfill(2)))
df['hour'] = df.hrmn.map(lambda x: int(str(x).zfill(4)[0:2]))
df['minute'] = df.hrmn.map(lambda x: int(str(x).zfill(4)[2:4]))
df.drop('hrmn', axis=1, inplace=True)
df['dt'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
df.set_index('dt', inplace=True)

And we can plot the daily count of traffic injuries:

ax = df.index.to_frame().resample('D').count().plot(legend=False, figsize=(15, 7))
ax.set_xlabel('Date');
ax.set_ylabel('Daily count');
ax.set_title('Daily count of traffic injuries');

Daily count

Map of traffic injuries

Now we focus on the coordinate columns lat and long. Also we select only the mainland france area (gps = M):

coords = df[df.gps == 'M'][['lat', 'long']].dropna(how='any')
coords.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 456090 entries, 2018-01-24 15:05:00 to 2005-12-11 12:55:00
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   lat     456090 non-null  float64
 1   long    456090 non-null  object 
dtypes: float64(1), object(1)
memory usage: 10.4+ MB

It appears that the long column contain some numbers (a lot of zeros actually!) and some str values:

coords.long.value_counts()
0.0         97426
0           28975
-            3209
345699.0      162
228547.0      131
            ...  
247394.0        1
441588.0        1
247392.0        1
441590.0        1
241601.0        1
Name: long, Length: 205130, dtype: int64

So we remove the str values and convert the long column to float:

coords = coords.loc[coords.long != '-'].copy(deep=True)
coords['long'] =  coords['long'].astype(float)
coords.head(2)
lat long
dt
2018-01-24 15:05:00 5055737.0 294992.0
2018-02-12 10:15:00 5052936.0 293151.0

The coordinates are expressed in “decimal degrees”… We convert them to WGS84 by dividing by 100000:

coords /= 100000

Also, we filter the points with a bounding box of mainland france:

bbox = (-5., 39., 10., 54.)
coords = coords[(coords.long > bbox[0]) & (coords.lat > bbox[1]) & (coords.long < bbox[2]) & (coords.lat < bbox[3])]

And that’s it! We can eventually call datashader:

cmap = palette['CET_L19']
bg_col = 'white'
size = 1200
cvs = ds.Canvas(plot_width=size, plot_height=size)
agg = cvs.points(coords, 'long', 'lat')
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img

White map

This confirms my intuition that driving in Paris is dangerous :)