In November 2024, Foursquare released a new dataset called Foursquare Open Source Places. This dataset is a comprehensive global collection of Points of Interest (POIs), providing detailed information about venues, including their categories, attributes, and geospatial details.

In this post, we are going to play a little bit with this dataset by fetching and plotting all the pizzerias of the world.

Imports

import os

import contextily as cx
import datashader as ds
import duckdb
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import xyzservices.providers as xyz
from colorcet import palette
from datashader import transfer_functions as tf

PARQUET_FP = "./pizzerias.parquet"  # file path
FS = (12, 7)

Querying and Visualizing Pizzerias

Using DuckDB, we analyze the FSQ OS Places dataset stored on Amazon s3 to extract worldwide pizzeria locations.

%%time
conn = duckdb.connect()
query = "SELECT COUNT(*) FROM 's3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet';"
conn.sql(query)
CPU times: user 351 ms, sys: 25.7 ms, total: 376 ms
Wall time: 1.16 s

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    104511073 │
└──────────────┘

Inspired by a LinkedIn post about Belgian fritteries by Julien Colot, we query pizzeria locations using the category pizzeria : ID 4bf58dd8d48988d1ca941735. Given that the dataset organizes POI categories hierarchically, we filter the data using the list_contains function to directly target the relevant category ID, avoiding the need to unnest the entire dataset. Although this method is painly slow, it is a one-time operation. We save the data to a Parquet file for further reuse:

if not os.path.isfile(PARQUET_FP):
    query = f"""
    COPY (
      SELECT * FROM read_parquet('s3://fsq-os-places-us-east-1/release/dt=2024-11-19/places/parquet/places-*.snappy.parquet') AS t
      WHERE list_contains(t.fsq_category_ids, '4bf58dd8d48988d1ca941735')
    ) TO '{PARQUET_FP}';"""
    conn.sql(query)
df = pd.read_parquet(PARQUET_FP)
conn.close()
CPU times: user 6min 31s, sys: 6min 6s, total: 12min 38s
Wall time: 40min 35s
df.shape
(650873, 24)

We have 650873 pizzerias listed, one pizzeria for each row.

df.head(3)
fsq_place_id name ... fsq_category_labels dt
0 4b4a33e9f964a5207b7e26e3 Pizza Vesuvio ... [Dining and Drinking > Restaurant > Pizzeria, ... 2024-11-19
1 4d0ff14e38bb6ea88488c1aa Pizzaville ... [Dining and Drinking > Restaurant > Pizzeria] 2024-11-19
2 520a776111d2995bb605242b Pizzaria jet pizza ... [Dining and Drinking > Restaurant > Pizzeria] 2024-11-19

3 rows × 24 columns

Let’s have a look at the columns we got:

df.columns
Index(['fsq_place_id', 'name', 'latitude', 'longitude', 'address', 'locality',
       'region', 'postcode', 'admin_region', 'post_town', 'po_box', 'country',
       'date_created', 'date_refreshed', 'date_closed', 'tel', 'website',
       'email', 'facebook_id', 'instagram', 'twitter', 'fsq_category_ids',
       'fsq_category_labels', 'dt'],
      dtype='object')

We see that there is a date_closed column:

df.date_closed.isna().sum(axis=0)
590316

We select pizzerias that are still open, without a date_closed value:

df = df.loc[df.date_closed.isna()]

We have 590316 open pizzerias. Let’s transform the pandas DataFrame into a geopandas GeoDataFrame:

gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="EPSG:4326"
)

To get a quick overview of the spatial distribution of the pizzerias, we can visualize the GeoDataFrame:

ax = gdf.plot(markersize=4, alpha=0.5, figsize=FS)
ax.grid()

quick

Plot all the pizzerias with Datashader

World view

We visualize the global distribution of pizzerias using Datashader. We start by filtering the dataset to exclude extreme latitudes, focusing on points between -75 and 80 degrees latitude.

df_world = df.loc[(df.latitude > -75) & (df.latitude < 80)]
cmap = palette["fire"]
bg_col = "black"
size_x, size_y = 1200, 600
cvs = ds.Canvas(plot_width=size_x, plot_height=size_y)
agg = cvs.points(df_world, "longitude", "latitude")
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img

World

A large number of pizzerias are located in Europe and the US, as well as in Uruguay and Argentina, regions with a significant Italian diaspora.

Europe view

bbox = -13.876174, 35.238452, 40.253716, 60.966935
df_europe = df.loc[
    (df.longitude > bbox[0])
    & (df.longitude < bbox[2])
    & (df.latitude > bbox[1])
    & (df.latitude < bbox[3])
]
size_x, size_y = 1200, 800
cvs = ds.Canvas(plot_width=size_x, plot_height=size_y)
agg = cvs.points(df_europe, "longitude", "latitude")
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img

Europe

Many pizzerias are concentrated along the Mediterranean and Adriatic coasts.

Italy

bbox = 6.492820,36.389024,18.663258,47.206285
df_italy = df.loc[
    (df.longitude > bbox[0])
    & (df.longitude < bbox[2])
    & (df.latitude > bbox[1])
    & (df.latitude < bbox[3])
]
size_x, size_y = 1000, 1200
cvs = ds.Canvas(plot_width=size_x, plot_height=size_y)
agg = cvs.points(df_italy, "longitude", "latitude")
img = tf.shade(agg, cmap=cmap)
img = tf.set_background(img, bg_col)
img

Italy

Major italian cities like Naples, Rome, Milan, and Turin stand out as pizzeria hotspots.

Lyon Croix-Rousse

Let’s examine my district, Lyon Croix-Rousse, to validate the dataset in an area where I am familiar with all the pizzerias.

bbox = 4.815526,45.765160,4.845303,45.780993
df_lyon = df.loc[
    (df.longitude > bbox[0])
    & (df.longitude < bbox[2])
    & (df.latitude > bbox[1])
    & (df.latitude < bbox[3])
]
df_lyon.head(3)
fsq_place_id name ... fsq_category_labels dt
8822 4cc1df5b01fb236a265a9bba Pizza Des Canuts ... [Dining and Drinking > Restaurant > Pizzeria] 2024-11-19
11326 553f6623498e53d4de32f6de Caffe San Remo ... [Dining and Drinking > Restaurant > Pizzeria] 2024-11-19
18889 de751b48f717441a2c7c1768 Djama Badis ... [Dining and Drinking > Bar, Dining and Drinkin... 2024-11-19

3 rows × 24 columns

gdf_lyon = gpd.GeoDataFrame(
    df_lyon,
    geometry=gpd.points_from_xy(df_lyon.longitude, df_lyon.latitude),
    crs="EPSG:4326",
)
ax = gdf_lyon.plot(markersize=4, figsize=(12, 8))
cx.add_basemap(
    ax,
    source=xyz.CartoDB.VoyagerNoLabels,
    crs=gdf_lyon.crs.to_string(),
    alpha=0.8,
)
_ = plt.axis("off")

Lyon

We can create an interactive plot with the GeoDataFrame .explore() method:

gdf_lyon[["name", "address", "geometry"]].explore()

Lyon

The dataset for Lyon Croix-Rousse appears reasonably accurate, but there are some discrepancies:

  • Incorrect location: one pizzeria is mapped to the wrong place.
  • Fictional entry: at least one pizzeria seems entirely fictional.
  • Missing entries: two pizzerias are missing.

Overall, I estimate the dataset’s accuracy at around 90% for the downtown area. However, it’s possible that the data is more reliable in other regions, or countries.