Pizzerias Around the World
In November 2024, Foursquare released a new dataset called Foursquare Open Source Places. This dataset is a comprehensive global collection of Points of Interest (POIs), providing detailed information about venues, including their categories, attributes, and geospatial details.
In this post, we are going to play a little bit with this dataset by fetching and plotting all the pizzerias of the world.
Imports
import os
import contextily as cx
import datashader as ds
import duckdb
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import xyzservices.providers as xyz
from colorcet import palette
from datashader import transfer_functions as tf
PARQUET_FP = "./pizzerias.parquet" # file path
FS = (12, 7)
Querying and Visualizing Pizzerias
Using DuckDB, we analyze the FSQ OS Places dataset stored on Amazon s3 to extract worldwide pizzeria locations.
%%time
conn = duckdb.connect()
query = "SELECT COUNT(*) FROM 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet';"
conn.sql(query)
CPU times: user 351 ms, sys: 25.7 ms, total: 376 ms
Wall time: 1.16 s
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 104511073 │
└──────────────┘
Inspired by a LinkedIn post about Belgian fritteries by
Julien Colot, we query pizzeria locations using the category pizzeria : ID 4bf58dd8d48988d1ca941735
. Given that the dataset organizes POI categories hierarchically, we filter the data using the list_contains
function to directly target the relevant category ID, avoiding the need to unnest the entire dataset. Although this method is painly slow, it is a one-time operation. We save the data to a Parquet file for further reuse:
if not os.path.isfile(PARQUET_FP):
query = f"""
COPY (
SELECT * FROM read_parquet('s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet') AS t
WHERE list_contains(t.fsq_category_ids, '4bf58dd8d48988d1ca941735')
) TO '{PARQUET_FP}';"""
conn.sql(query)
df = pd.read_parquet(PARQUET_FP)
conn.close()
CPU times: user 6min 31s, sys: 6min 6s, total: 12min 38s
Wall time: 40min 35s
df.shape
(650873, 24)
We have 650873 pizzerias listed, one pizzeria for each row.
df.head(3)
fsq_place_id | name | ... | fsq_category_labels | dt | |
---|---|---|---|---|---|
0 | 4b4a33e9f964a5207b7e26e3 | Pizza Vesuvio | ... | [Dining and Drinking > Restaurant > Pizzeria, ... | 2024-11-19 |
1 | 4d0ff14e38bb6ea88488c1aa | Pizzaville | ... | [Dining and Drinking > Restaurant > Pizzeria] | 2024-11-19 |
2 | 520a776111d2995bb605242b | Pizzaria jet pizza | ... | [Dining and Drinking > Restaurant > Pizzeria] | 2024-11-19 |
3 rows × 24 columns
Let’s have a look at the columns we got:
df.columns
Index(['fsq_place_id', 'name', 'latitude', 'longitude', 'address', 'locality',
'region', 'postcode', 'admin_region', 'post_town', 'po_box', 'country',
'date_created', 'date_refreshed', 'date_closed', 'tel', 'website',
'email', 'facebook_id', 'instagram', 'twitter', 'fsq_category_ids',
'fsq_category_labels', 'dt'],
dtype='object')
We see that there is a date_closed
column:
df.date_closed.isna().sum(axis=0)
590316
We select pizzerias that are still open, without a date_closed
value:
df = df.loc[df.date_closed.isna()]
We have 590316 open pizzerias. Let’s transform the pandas DataFrame into a geopandas GeoDataFrame:
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="EPSG:4326"
)
To get a quick overview of the spatial distribution of the pizzerias, we can visualize the GeoDataFrame:
ax = gdf.plot(markersize=4, alpha=0.5, figsize=FS)
ax.grid()
Plot all the pizzerias with Datashader
World view
We visualize the global distribution of pizzerias using Datashader. We start by filtering the dataset to exclude extreme latitudes, focusing on points between -75 and 80 degrees latitude.
df_world = df.loc[(df.latitude > -75) & (df.latitude < 80)]
cmap = palette["fire"]
bg_col = "black"
size_x, size_y = 1200, 600
cvs = ds.Canvas(plot_width=size_x, plot_height=size_y)
agg = cvs.points(df_world, "longitude", "latitude")
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img
A large number of pizzerias are located in Europe and the US, as well as in Uruguay and Argentina, regions with a significant Italian diaspora.
Europe view
bbox = -13.876174, 35.238452, 40.253716, 60.966935
df_europe = df.loc[
(df.longitude > bbox[0])
& (df.longitude < bbox[2])
& (df.latitude > bbox[1])
& (df.latitude < bbox[3])
]
size_x, size_y = 1200, 800
cvs = ds.Canvas(plot_width=size_x, plot_height=size_y)
agg = cvs.points(df_europe, "longitude", "latitude")
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img
Many pizzerias are concentrated along the Mediterranean and Adriatic coasts.
Italy
bbox = 6.492820,36.389024,18.663258,47.206285
df_italy = df.loc[
(df.longitude > bbox[0])
& (df.longitude < bbox[2])
& (df.latitude > bbox[1])
& (df.latitude < bbox[3])
]
size_x, size_y = 1000, 1200
cvs = ds.Canvas(plot_width=size_x, plot_height=size_y)
agg = cvs.points(df_italy, "longitude", "latitude")
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img
Major italian cities like Naples, Rome, Milan, and Turin stand out as pizzeria hotspots.
Lyon Croix-Rousse
Let’s examine my district, Lyon Croix-Rousse, to validate the dataset in an area where I am familiar with all the pizzerias.
bbox = 4.815526,45.765160,4.845303,45.780993
df_lyon = df.loc[
(df.longitude > bbox[0])
& (df.longitude < bbox[2])
& (df.latitude > bbox[1])
& (df.latitude < bbox[3])
]
df_lyon.head(3)
fsq_place_id | name | ... | fsq_category_labels | dt | |
---|---|---|---|---|---|
8822 | 4cc1df5b01fb236a265a9bba | Pizza Des Canuts | ... | [Dining and Drinking > Restaurant > Pizzeria] | 2024-11-19 |
11326 | 553f6623498e53d4de32f6de | Caffe San Remo | ... | [Dining and Drinking > Restaurant > Pizzeria] | 2024-11-19 |
18889 | de751b48f717441a2c7c1768 | Djama Badis | ... | [Dining and Drinking > Bar, Dining and Drinkin... | 2024-11-19 |
3 rows × 24 columns
gdf_lyon = gpd.GeoDataFrame(
df_lyon,
geometry=gpd.points_from_xy(df_lyon.longitude, df_lyon.latitude),
crs="EPSG:4326",
)
ax = gdf_lyon.plot(markersize=4, figsize=(12, 8))
cx.add_basemap(
ax,
source=xyz.CartoDB.VoyagerNoLabels,
crs=gdf_lyon.crs.to_string(),
alpha=0.8,
)
_ = plt.axis("off")
We can create an interactive plot with the GeoDataFrame .explore()
method:
gdf_lyon[["name", "address", "geometry"]].explore()
The dataset for Lyon Croix-Rousse appears reasonably accurate, but there are some discrepancies:
- Incorrect location: one pizzeria is mapped to the wrong place.
- Fictional entry: at least one pizzeria seems entirely fictional.
- Missing entries: two pizzerias are missing.
Overall, I estimate the dataset’s accuracy at around 90% for the downtown area. However, it’s possible that the data is more reliable in other regions, or countries.