More than 3 years ago, we posted a comparative study about Looping over Pandas data using a CPU. Because a lot of things evolved since 2018, this post is kind of an update. For example Pandas tag version was 0.23.3 at that time, it is now 1.4.0. Also, we added some more options.

Here is a list of all the options tested in the following:

  • Pandas built-in vectorization
  • pandas.DataFrame.iterrows
  • pandas.DataFrame.apply
  • pandas.DataFrame.itertuples
  • numpy.apply_along_axis
  • numpy.vectorize
  • map
  • swifter
  • dask.dataframe.map_partitions
  • polars.DataFrame.apply
  • polars built-in vectorization
  • Numba
  • Numba parallel
  • Cython
  • Cython parallel

Introduction

Motivation

Row-wise function

The focus is on looping over the rows of a Pandas dataframe holding some numerical data. All the elements of the dataframe are of np.float64 data type. A function is applied to each row, taking the row elements as input, either as distinct scalar arguments, as an array, or as a Pandas Series. The computation returns a scalar value per row, so that the process eventually returns a numeric Pandas series with same index as the original dataframe. For this post, a toy function computing the determinant of a 3-by-3 symmetric real matrix is used:

\[\begin{equation*} M = \begin{pmatrix} m_{1,1} & m_{1,2} & m_{1,3} \\ m_{1,2} & m_{2,2} & m_{2,3} \\ m_{1,3} & m_{2,3} & m_{3,3} \end{pmatrix} \end{equation*}\] \[|M| = m_{1,1} (m_{2,2} \, m_{3,3} - m_{2,3}^2) + m_{1,2} \, (2 \, m_{2,3} \, m_{1,3} - m_{1,2} \, m_{3,3}) - m_{2,2} \, m_{1,3}^2\]

Here is a pure python implementation of the row-wise function:

def det_sym33(m11, m12, m13, m22, m23, m33):
    """Compute the determinant of a real symmetric 3x3 matrix 
    given its 6 upper triangular coefficients.
    """
    return (
        m11 * (m22 * m33 - m23**2)
        + m12 * (2.0 * m23 * m13 - m12 * m33)
        - m22 * m13**2
    )

Imports

from itertools import cycle
from time import perf_counter

import cython
import dask.dataframe as dd
from numba import jit, float64, njit, prange
import numpy as np
import perfplot
import pandas as pd
import polars as pl
import swifter

%load_ext Cython
%load_ext line_profiler

SD = 124  # random seed
rng = np.random.default_rng(SD)  # random number generator
Python version       : 3.9.10
IPython version      : 8.0.1
polars    : 0.12.19
numpy     : 1.21.5
swifter   : 1.0.9
pandas    : 1.4.1
cython    : 0.29.28
numba     : 0.55.1
perfplot  : 0.10.1
dask      : 2022.2.1

Computations are performed on a laptop with an 8 cores Intel i7-7700HQ CPU @ 2.80GHz, running Linux.

Timing function

This function is returning the best elapsed time over r trials, and the result of the computation:

def timing(func, df, r=10):
    timings = []
    for i in range(r):
        start = perf_counter()
        s = func(df)
        end = perf_counter()
        elapsed_time = end - start
        timings.append(elapsed_time)
    return s, np.amin(timings)

Create a dataframe with random floats

We start by creating a medium-sized dataframe to perform a first comparison. We will later compare the most efficient methods with longer dataframes.

%%time
n = 100_000  # dataframe length
column_names = ['m11', 'm12', 'm13', 'm22', 'm23', 'm33']
df = pd.DataFrame(data=rng.random((n, 6)), columns=column_names)
CPU times: user 10.2 ms, sys: 143 µs, total: 10.4 ms
Wall time: 7.99 ms
df.head(3)
m11 m12 m13 m22 m23 m33
0 0.785253 0.785859 0.969136 0.748060 0.655551 0.938885
1 0.178614 0.588647 0.442799 0.348847 0.330929 0.159369
2 0.989463 0.257111 0.715765 0.505885 0.664111 0.702342

Row-wise functions

We now create 3 different row-wise functions, with different argument types for the input row values:

  • distinct scalar arguments
  • an array
  • a Pandas Series
def det_sym33_scalars(m11, m12, m13, m22, m23, m33):
    return (
        m11 * (m22 * m33 - m23**2)
        + m12 * (2.0 * m23 * m13 - m12 * m33)
        - m22 * m13**2
    )
def det_sym33_array(m):
    return (
        m[0] * (m[3] * m[5] - m[4] ** 2)
        + m[1] * (2.0 * m[4] * m[2] - m[1] * m[5])
        - m[3] * m[2] ** 2
    )
def det_sym33_series(s):
    return (
        s.m11 * (s.m22 * s.m33 - s.m23**2)
        + s.m12 * (2.0 * s.m23 * s.m13 - s.m12 * s.m33)
        - s.m22 * s.m13**2
    )

In the following, we may try several of these row-wise functions for a given dataframe looping method, depending on how the dataframe rows are returned by the method.

Pandas built-in vectorization

First we are going to use the built-in vectorization operations from Pandas. In the present case the row-wise computation is straightforward and can be performed with basic universal functions applied to the entire columns. This does not make use of any row-wise function, but allows us to have a reference timing.

def pandas_vectorize(df):
    return (
        df.m11 * (df.m22 * df.m33 - df.m23**2)
        + df.m12 * (2.0 * df.m23 * df.m13 - df.m12 * df.m33)
        - df.m22 * df.m13**2
    )

We store the resulting Pandas Series into the det_ref variable, in order to check that the later computations lead to the same result:

det_ref, t = timing(pandas_vectorize, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0044875 s

pandas.DataFrame.iterrows

We know that the iterrows method is kind of slow (see the previous post for example). It loops over dataframe rows as (index, Series) pairs. Here, we are going to compare the 3 different argument types of the row-wise function, given that the row returned by iterrows is a Pandas Series:

  • scalar values
  • a Numpy array
  • a Pandas Series
function name method returning rows as row-wise function argument type
iterrows_scalars pd.DataFrame.iterrows pd.Series df_iterrows_scalars float64
iterrows_array pd.DataFrame.iterrows pd.Series df_iterrows_array array
iterrows_series pd.DataFrame.iterrows pd.Series df_iterrows_series pd.Series

Scalar arguments

def iterrows_scalars(df):
    det = np.zeros(len(df), dtype=np.float64)
    i = 0
    for _, row in df.iterrows():
        det[i] = det_sym33_scalars(row.m11, row.m12, row.m13, row.m22, row.m23, row.m33)
        i += 1
    return pd.Series(det, index=df.index)
det, t_scalars = timing(iterrows_scalars, df, r=3)
print(f"Elapsed time: {t_scalars:8.7f} s")
Elapsed time: 5.6840575 s
pd.testing.assert_series_equal(det, det_ref)

numpy.ndarray argument

def iterrows_array(df):
    det = np.zeros(len(df), dtype=np.float64)
    i = 0
    for _, row in df.iterrows():
        det[i] = det_sym33_array(row.values)
        i += 1
    return pd.Series(det, index=df.index)
det, t_array = timing(iterrows_array, df, r=3)
print(f"Elapsed time: {t_array:8.7f} s")
Elapsed time: 2.4847183 s
pd.testing.assert_series_equal(det, det_ref)

pandas.Series argument

def iterrows_series(df):
    det = np.zeros(len(df), dtype=np.float64)
    i = 0
    for _, row in df.iterrows():
        det[i] = det_sym33_series(row)
        i += 1
    return pd.Series(det, index=df.index)
det, t_series = timing(iterrows_series, df, r=3)
print(f"Elapsed time: {t_series:8.7f} s")
Elapsed time: 7.8285359 s
pd.testing.assert_series_equal(det, det_ref)

Comparison

ax = (
    pd.DataFrame(
        data=[{"scalars": t_scalars, "array": t_array, "Series": t_series}],
        index=["elapsed_time"],
    )
    .T.sort_values(by="elapsed_time", ascending=False)
    .plot.bar(legend=False, alpha=0.75, rot=45)
)
_ = ax.set(
    title="iterrows() with 3 different argument types",
    xlabel="Argument type",
    ylabel="Elapsed time (s)",
)

iterrows

We observe that using the array of values from the pandas.Series as argument of the row-wise function, and accessing the data with indices is the fastest method when using .iterrows(). However, all three iterrows methods are very slow.

pandas.DataFrame.apply

The apply method also iterates over dataframe rows (with the axis=1 argument), returning either a Series (default) or an array (with raw=True). Here is what Pandas’ documentation says about it:

the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

function name method returning rows as row-wise function argument type
apply_series pd.DataFrame.apply pd.Series det_sym33_series pd.Series
apply_array pd.DataFrame.apply np.ndarray det_sym33_array array

pandas.Series argument

def apply_series(df):
    return df.apply(det_sym33_series, raw=False, axis=1)
det, t_series = timing(apply_series, df, r=3)
print(f"Elapsed time: {t_series:8.7f} s")
Elapsed time: 5.1164269 s
pd.testing.assert_series_equal(det, det_ref)

numpy.ndarray argument

def apply_array(df):
    return df.apply(det_sym33_array, raw=True, axis=1)
det, t_array = timing(apply_array, df, r=3)
print(f"Elapsed time: {t_array:8.7f} s")
Elapsed time: 0.2638231 s
pd.testing.assert_series_equal(det, det_ref)

Comparison

ax = (
    pd.DataFrame(
        data=[{"array": t_array, "Series": t_series}],
        index=["elapsed_time"],
    )
    .T.sort_values(by="elapsed_time", ascending=False)
    .plot.bar(legend=False, alpha=0.75, rot=45)
)
_ = ax.set(
    title="apply() with 2 different argument types",
    xlabel="Argument type",
    ylabel="Elapsed time (s)",
)

apply

Indeed using arrays instead of Series is way faster! But still a lot slower than pandas built-in vectorization.

pandas.DataFrame.itertuples

The itertuples method allows the iteratation over dataframe rows, returning them as namedtuples. Thus, the row values can either be accessed by name or by index. The function det_sym33_scalars and det_sym33_series are used in the former case, and det_sym33_array in the latter.

function name method returning rows as row-wise function argument type
itertuples_scalars pd.DataFrame.itertuples namedtuple det_sym33_scalars float64
itertuples_array pd.DataFrame.itertuples namedtuple det_sym33_array array
itertuples_series pd.DataFrame.itertuples namedtuple det_sym33_series pd.Series

Scalar arguments

def itertuples_scalars(df):
    det = np.zeros(len(df), dtype=np.float64)
    i = 0
    for row in df.itertuples():
        det[i] = det_sym33_scalars(row.m11, row.m12, row.m13, row.m22, row.m23, row.m33)
        i += 1
    return pd.Series(det, index=df.index)
det, t_scalars = timing(itertuples_scalars, df)
print(f"Elapsed time: {t_scalars:8.7f} s")
Elapsed time: 0.1051879 s
pd.testing.assert_series_equal(det, det_ref)

numpy.array argument

When row values are accessed by index, we need to account for the dataframe index, which is indexed by 0.

def itertuples_array(df):
    det = np.zeros(len(df), dtype=np.float64)
    i = 0
    for row in df.itertuples():
        det[i] = det_sym33_array(row[1:])
        i += 1
    return pd.Series(det, index=df.index)
det, t_array = timing(itertuples_array, df)
print(f"Elapsed time: {t_array:8.7f} s")
Elapsed time: 0.1136810 s
pd.testing.assert_series_equal(det, det_ref)

pandas.Series argument

def itertuples_series(df):
    det = np.zeros(len(df), dtype=np.float64)
    i = 0
    for row in df.itertuples():
        det[i] = det_sym33_series(row)
        i += 1
    return pd.Series(det, index=df.index)
det, t_series = timing(itertuples_series, df)
print(f"Elapsed time: {t_series:8.7f} s")
Elapsed time: 0.1084384 s
pd.testing.assert_series_equal(det, det_ref)

Comparison

ax = (
    pd.DataFrame(
        data=[{"scalars": t_scalars, "array": t_array, "Series": t_series}],
        index=["elapsed_time"],
    )
    .T.sort_values(by="elapsed_time", ascending=False)
    .plot.bar(legend=False, alpha=0.75, rot=45)
)
_ = ax.set(
    title="itertuples() with 3 different argument types",
    xlabel="Argument type",
    ylabel="Elapsed time (s)",
)

itertuples

We get kind of similar results with the 3 different argument types. Once again, we can say than itertuples is to be prefered over iterrows and apply.

numpy.apply_along_axis

Let’s try out this method allowing to apply a function to a 1D slice along a given axis (rows of a 2D array in our case).

function name method returning rows as row-wise function argument type
np_apply_along_axis np.apply_along_axis np.ndarray det_sym33_array array
def np_apply_along_axis(df):

    return pd.Series(
        np.apply_along_axis(func1d=det_sym33_array, axis=1, arr=df.values),
        index=df.index,
    )
det, t = timing(np_apply_along_axis, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.2435897 s
pd.testing.assert_series_equal(det, det_ref)

Well this is rather slow and disapointing…

numpy.vectorize

Numpy vectorize evaluates the row-wise function over each element of the input numpy array(s). However, note the warning on the NumPy documentation:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

function name method returning rows as row-wise function argument type
np_vectorize_scalars np.vectorize float64 det_sym33_scalars float64
np_vectorize_array np.vectorize np.ndarray det_sym33_array array

Scalar arguments

def np_vectorize_scalars(df):

    return pd.Series(
        np.vectorize(det_sym33_scalars)(df.m11, df.m12, df.m13, df.m22, df.m23, df.m33),
        index=df.index,
    )
det, t_scalars = timing(np_vectorize_scalars, df)
print(f"Elapsed time: {t_scalars:8.7f} s")
Elapsed time: 0.0498839 s
pd.testing.assert_series_equal(det, det_ref)

numpy.ndarray argument

In this case we need to add a signature argument in order to specify the input shape of the row-wise function.

def np_vectorize_array(df):

    return pd.Series(
        np.vectorize(det_sym33_array, signature="(n)->()")(
            df[["m11", "m12", "m13", "m22", "m23", "m33"]].values
        ),
        index=df.index,
    )
det, t_array = timing(np_vectorize_array, df)
print(f"Elapsed time: {t_array:8.7f} s")
Elapsed time: 0.2720111 s
pd.testing.assert_series_equal(det, det_ref)

Comparison

ax = (
    pd.DataFrame(
        data=[{"scalars": t_scalars, "array": t_array}],
        index=["elapsed_time"],
    )
    .T.sort_values(by="elapsed_time", ascending=False)
    .plot.bar(legend=False, alpha=0.75, rot=45)
)
_ = ax.set(
    title="numpy.vectorize() with 2 different argument types",
    xlabel="Argument type",
    ylabel="Elapsed time (s)",
)

np.vectorize

The version with scalar arguments is quite interesting, faster than itertuples.

map

Let’s use the standard map() Python method.

function name method returning rows as row-wise function argument type
map_scalars map float64 det_sym33_scalars float64
def map_scalars(df):
    return pd.Series(
        map(det_sym33_scalars, df.m11, df.m12, df.m13, df.m22, df.m23, df.m33),
        index=df.index,
    )
det, t = timing(map_scalars, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0679905 s
pd.testing.assert_series_equal(det, det_ref)

Performance seems to be similar between map_scalars and np_vectorize_scalars.

Swifter

Swifter is a “package which efficiently applies any function to a pandas dataframe or series in the fastest available manner”. we use the raw=True argument. Here is a quote from the documentation :

raw : bool, default False
False : passes each row or column as a Series to the function.
True : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

function name method returning rows as row-wise function argument type
swifter_apply swifter.apply np.ndarray det_sym33_array array
def swifter_apply(df):
    return df.swifter.progress_bar(False).apply(det_sym33_array, raw=True, axis=1)
det, t = timing(swifter_apply, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.2725223 s
pd.testing.assert_series_equal(det, det_ref)

The computation is rather slow, we are probably missing something here and did not use this package correctly?

dask.dataframe.map_partitions

We tried to use the great dask library with the map_partitions method. Unfortunately, we did not completely figure out how to handle the meta argument. Here is the description from the documentation:

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

The following implementation does work but there might be more efficient ways to do it…

function name method returning rows as row-wise function argument type
dask_df_map_partitions dd.map_partitions pd.Series det_sym33_series pd.Series
def dask_df_map_partitions(df, n_jobs=8):
    ddf = dd.from_pandas(df, npartitions=n_jobs)
    return pd.Series(
        ddf.map_partitions(
            det_sym33_series, meta=pd.Series(dtype=np.float64)
        ).compute(),
        index=df.index,
    )
det, t = timing(dask_df_map_partitions, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0254871 s
pd.testing.assert_series_equal(det, det_ref)

Not so bad but still slower than Pandas built_in vectorization. Here is what can be found on dask’s documentaion:

Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.

Also, we guess that there might be a dataframe copy from Pandas to Dask? This method is probably more recommended when dealing with very large dataframes distributed on clusters.

polars.DataFrame.apply

Polars is a fast multi-threaded DataFrame library written in Rust but also available in Python and Node.js. Here we are going to use DataFrame.apply() which allows to apply a custom function over the rows of a Polars dataFrame. The rows are passed as tuple. However, note this warning from Polars’ documentation:

Beware, this is slow.

function name method returning rows as row-wise function argument type
polars_apply polars.DataFrame.apply() tuple det_sym33_array array
def polars_apply(df):
    df_pl = pl.from_pandas(df)
    det_pl = df_pl.apply(det_sym33_array)
    det_pd = det_pl.to_pandas()["apply"]
    det_pd.index = df.index
    det_pd.name = None
    return det_pd
det, t = timing(polars_apply, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0902959 s
pd.testing.assert_series_equal(det, det_ref)

This is not so bad but one can wonder if a large part of the elapsed time is not spent converting the dataframe from Pandas to Polars and backward. Let’s measure the elepased time per line with a line profiler:

%lprun -f polars_apply polars_apply(df)
Timer unit: 1e-06 s

Total time: 0.229353 s
File: /tmp/ipykernel_16266/3940900507.py
Function: polars_apply at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def polars_apply(df):
     2         1       6545.0   6545.0      2.9      df_pl = pl.from_pandas(df)
     3         1     221243.0 221243.0     96.5      det_pl = df_pl.apply(det_sym33_array)
     4         1       1542.0   1542.0      0.7      det_pd = det_pl.to_pandas()["apply"]
     5         1         14.0     14.0      0.0      det_pd.index = df.index
     6         1          9.0      9.0      0.0      det_pd.name = None
     7         1          0.0      0.0      0.0      return det_pd

Most of the time is actually spent within the apply method. However, we believe that there is a copy process in the from_pandas step, which would not be not optimal regarding memory usage?

Polars vectorize

We can also use Polars built-in vectorization the same way we did with Pandas. This does not make use of the previous row-wise functions.

def polars_vectorize(df):
    df_pl = pl.from_pandas(df)
    det_pl = (
        df_pl.m11 * (df_pl.m22 * df_pl.m33 - df_pl.m23**2)
        + df_pl.m12 * (2.0 * df_pl.m23 * df_pl.m13 - df_pl.m12 * df_pl.m33)
        - df_pl.m22 * df_pl.m13**2
    )
    det_pd = det_pl.to_frame().to_pandas()["m11"]
    det_pd.index = df.index
    det_pd.name = None
    return det_pd
det, t = timing(polars_vectorize, df)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0090223 s
pd.testing.assert_series_equal(det, det_ref)

If we run the line profiler, we can observe that most of the elapsed time is now spent converting the dataframe between Pandas and polars:

%lprun -f polars_vectorize polars_vectorize(df)
Timer unit: 1e-06 s

Total time: 0.014961 s
File: /tmp/ipykernel_16266/1430159236.py
Function: polars_vectorize at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def polars_vectorize(df):
     2         1       7295.0   7295.0     48.8      df_pl = pl.from_pandas(df)
     3         1          1.0      1.0      0.0      det_pl = (
     4         3       2915.0    971.7     19.5          df_pl.m11 * (df_pl.m22 * df_pl.m33 - df_pl.m23**2)
     5         1        551.0    551.0      3.7          + df_pl.m12 * (2.0 * df_pl.m23 * df_pl.m13 - df_pl.m12 * df_pl.m33)
     6         1       2363.0   2363.0     15.8          - df_pl.m22 * df_pl.m13**2
     7                                               )
     8         1       1807.0   1807.0     12.1      det_pd = det_pl.to_frame().to_pandas()["m11"]
     9         1         17.0     17.0      0.1      det_pd.index = df.index
    10         1         11.0     11.0      0.1      det_pd.name = None
    11         1          1.0      1.0      0.0      return det_pd

Numba

Numba makes Python code fast! Here we use the Numba @jit decorator with nogil=True for the Numba dedicated row-wise function. In the apply_func_numba function, we basically implement the loop over the dataframe rows.

@jit(
    float64(float64, float64, float64, float64, float64, float64),
    nogil=True,
)
def det_sym33_numba(m11, m12, m13, m22, m23, m33):
    return (
        m11 * (m22 * m33 - m23**2)
        + m12 * (2.0 * m23 * m13 - m12 * m33)
        - m22 * m13**2
    )


@jit
def apply_func_numba(col_m11, col_m12, col_m13, col_m22, col_m23, col_m33):
    n = len(col_m11)
    det = np.empty(n, dtype="float64")
    for i in range(n):
        det[i] = det_sym33_numba(
            col_m11[i], col_m12[i], col_m13[i], col_m22[i], col_m23[i], col_m33[i]
        )
    return det


def numba_loop(df):
    det = apply_func_numba(
        df["m11"].to_numpy(),
        df["m12"].to_numpy(),
        df["m13"].to_numpy(),
        df["m22"].to_numpy(),
        df["m23"].to_numpy(),
        df["m33"].to_numpy(),
    )
    return pd.Series(det, index=df.index)
det, t = timing(numba_loop, df, r=100)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0002950 s
pd.testing.assert_series_equal(det, det_ref)

Well this is the fastest implementation so far.

Numba parallel

Now we use the Numba @njit decorator along with prange looping. By default, all available cores are used. The previous row-wise function apply_func_numba is used here.

@njit(parallel=True)
def apply_func_numba_para(col_m11, col_m12, col_m13, col_m22, col_m23, col_m33):
    n = len(col_m11)
    det = np.empty(n, dtype="float64")
    for i in prange(n):
        det[i] = det_sym33_numba(
            col_m11[i], col_m12[i], col_m13[i], col_m22[i], col_m23[i], col_m33[i]
        )
    return det


def numba_loop_para(df):
    det = apply_func_numba_para(
        df["m11"].to_numpy(),
        df["m12"].to_numpy(),
        df["m13"].to_numpy(),
        df["m22"].to_numpy(),
        df["m23"].to_numpy(),
        df["m33"].to_numpy(),
    )
    return pd.Series(det, index=df.index)
det, t = timing(numba_loop_para, df, r=100)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0002439 s
pd.testing.assert_series_equal(det, det_ref)

The parallel version is not a large improvement over the sequantial Numba version. This might be because of the relatively small size of the dataframe.

Cython

Cython is a mix between C and Python. Writing Cython code is a little more involving than writing pure Python. Note that in the following we used %%cython magic command at the begining of the notebook cell to compile the Cython code (also we loaded the Cython extension at the begining of the notebook).

%%cython

cimport cython
import numpy as np
import pandas as pd

@cython.binding(False)
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False) 
cdef void loop_cython(double[:] det, double[:] m11, double[:] m12, double[:] m13, double[:] m22, double[:] m23, double[:] m33) nogil:

    cdef Py_ssize_t i

    for i in range(m11.shape[0]):
        det[i] = m11[i] * (m22[i] * m33[i] - m23[i] ** 2) + m12[i] * (2.0 * m23[i] * m13[i] - m12[i] * m33[i]) - m22[i] * m13[i] ** 2

        
cpdef cython_loop(df):
    
    det = np.zeros_like(df.m11.values)
    loop_cython(det, df.m11.values, df.m12.values, df.m13.values, df.m22.values, df.m23.values, df.m33.values)
    
    return pd.Series(det, index=df.index)
det, t = timing(cython_loop, df, r=100)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0004649 s
pd.testing.assert_series_equal(det, det_ref)

This is not as fast as Numba but close.

Cython parallel

Similarly to what we did with Numba, we use here the prange looping function. Here is a quote from the documentation:

cython.parallel.prange([start,] stop[, step][, nogil=False][, schedule=None[, chunksize=None]][, num_threads=None])
This function can be used for parallel loops. OpenMP automatically starts a thread pool and distributes the work according to the schedule used. Thread-locality and reductions are automatically inferred for variables.

%%cython --compile-args=-fopenmp --link-args=-fopenmp

cimport cython
from cython.parallel import prange

import numpy as np
import pandas as pd

@cython.binding(False)
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False) 
cdef void loop_cython_para(double[:] det, double[:] m11, double[:] m12, double[:] m13, double[:] m22, double[:] m23, double[:] m33) nogil:

    cdef Py_ssize_t i

    for i in prange(m11.shape[0], nogil=True):
        det[i] = m11[i] * (m22[i] * m33[i] - m23[i] ** 2) + m12[i] * (2.0 * m23[i] * m13[i] - m12[i] * m33[i]) - m22[i] * m13[i] ** 2

        
cpdef cython_loop_para(df):
    
    det = np.zeros_like(df.m11.values)
    loop_cython_para(det, df.m11.values, df.m12.values, df.m13.values, df.m22.values, df.m23.values, df.m33.values)
    
    return pd.Series(det, index=df.index)
det, t = timing(cython_loop_para, df, r=100)
print(f"Elapsed time: {t:8.7f} s")
Elapsed time: 0.0003041 s

The parallel version is a little faster that the sequential one but we should test it on larger dataframes.

Global comparison

We start by comparing all methods over small size dataframes.

Small size dataframes

def compare_all_methods(rng, funcs, df_sizes):

    column_names = ["m11", "m12", "m13", "m22", "m23", "m33"]
    time_df = pd.DataFrame()
    for n in df_sizes:

        df = pd.DataFrame(data=rng.random((n, 6)), columns=column_names)

        d = {}
        for func in funcs:
            _, t = timing(func, df)
            d[func.__name__] = t
        df_tmp = pd.DataFrame(d, index=[n])
        time_df = pd.concat((time_df, df_tmp), axis=0)

    return time_df
funcs = [
    pandas_vectorize,
    iterrows_array,
    apply_array,
    itertuples_scalars,
    np_apply_along_axis,
    np_vectorize_scalars,
    map_scalars,
    swifter_apply,
    dask_df_map_partitions,
    polars_apply,
    polars_vectorize,
    numba_loop,
    numba_loop_para,
    cython_loop,
    cython_loop_para,
]


time_df = compare_all_methods(rng, funcs, df_sizes=[1_000, 10_000])
c_max = time_df.iloc[-1]
c_max = c_max.sort_values(ascending=False)
columns = c_max.index.values.tolist()
time_df = time_df[columns]
ax = time_df.plot.bar(stacked=False, figsize=(16, 8), logy=True, rot=0)
_ = ax.set_ylim(1.0e-5)
_ = ax.set(
    title="Timing comparison of various dataframe looping methods",
    xlabel="Dataframe length",
    ylabel="Elapsed_time [s] (log scale)",
)

Timings small 1

Only the Numba and Cython methods are significantly faster than Pandas’ built_in vectorization (pandas_vectorize)!

pd.options.display.float_format = "{:,.2e}".format
time_df.T
1000 10000
iterrows_array 2.50e-02 2.76e-01
swifter_apply 4.08e-03 4.41e-02
apply_array 3.56e-03 3.75e-02
np_apply_along_axis 2.79e-03 3.71e-02
dask_df_map_partitions 1.11e-02 1.92e-02
itertuples_scalars 1.66e-03 1.56e-02
polars_apply 2.82e-03 1.21e-02
map_scalars 9.42e-04 9.08e-03
np_vectorize_scalars 7.03e-04 6.39e-03
polars_vectorize 2.39e-03 2.30e-03
pandas_vectorize 1.50e-03 1.52e-03
numba_loop_para 1.68e-04 1.84e-04
cython_loop_para 1.43e-04 1.81e-04
cython_loop 1.06e-04 1.20e-04
numba_loop 8.01e-05 9.84e-05

Medium to large size dataframes

funcs = [
    pandas_vectorize,
    numba_loop,
    numba_loop_para,
    cython_loop,
    cython_loop_para,
]


time_df = compare_all_methods(rng, funcs, df_sizes=[1_000_000, 10_000_000])
c_max = time_df.iloc[-1]
c_max = c_max.sort_values(ascending=False)
columns = c_max.index.values.tolist()
time_df = time_df[columns]
ax = time_df.plot.bar(stacked=False, figsize=(16, 8), logy=True, rot=0)
_ = ax.set_ylim(1.0e-5)
_ = ax.set(
    title="Timing comparison of various dataframe looping methods",
    xlabel="Dataframe length",
    ylabel="Elapsed_time [s] (log scale)",
)

Timings large 1

time_df.T
1000000 10000000
pandas_vectorize 3.95e-02 4.10e-01
numba_loop 3.92e-03 7.33e-02
cython_loop 4.40e-03 5.29e-02
cython_loop_para 3.21e-03 3.55e-02
numba_loop_para 2.35e-03 2.53e-02

Let’s focus on the Numba and Cython methods.

funcs = [
    numba_loop,
    numba_loop_para,
    cython_loop,
    cython_loop_para,
]

out = perfplot.bench(
    setup=lambda n: pd.DataFrame(data=rng.random((n, 6)), columns=column_names),
    kernels=[(lambda df: func)(df) for func in funcs],
    labels=[func.__name__ for func in funcs],
    n_range=[10**k for k in range(3, 9)],
)
def plot_timings(out, figsize=(12,12)):
    labels = out.labels
    ms = 10
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(1, 1, 1)

    for i, label in enumerate(labels):
        plt.loglog(out.n_range, out.timings_s[i], "o-", ms=ms, label=label)
    markers = cycle(
        ("", "o", "v", "^", "<", ">", "s", "p", "P", "*", "h", "X", "D", ".")
    )
    for i, line in enumerate(ax.get_lines()):
        marker = next(markers)
        line.set_marker(marker)
    plt.legend()
    plt.grid("on")
    _ = ax.set(
        title="Timing comparison of various dataframe looping methods",
        xlabel="Dataframe length (log scale)",
        ylabel="Elapsed_time [s] (log scale)",
    )

    return ax
ax = plot_timings(out)

Timings large 2

Conclusion

Pandas built-in vectorization is a very good solution when possible. If we want to be faster with no extra pain, Numba is the best solution. Cython is about as fast as Numba, flexible, but more involving.