More Heapsort in Cython

This post/notebook is the follow-up to a recent one : Heapsort with Numba and Cython, where we implemented heapsort in Python/Numba and Cython and compared the execution time with NumPy heapsort. However, heapsort in NumPy is written in C++ (wrapped in Python) and not exactly the same implementation as the one we used in the previous post. So in the following, we present two different heapsort implementations translated into Cython, which are a little bit more refined than the previous one:

the first one (Cython_1) is taken from the book Numerical recipes in C [1]
the second one (Cython_2) is taken from NumPy source code: heapsort.cpp

In both implementations, indices are 1-based and left bitwise shift (<<) and right bitwise shift (>>) operators are used. In order to use 1-based indices with NumPy arrays, we add an extra element at the beginning:

def heapsort(A_in):
    A = np.copy(A_in)
    A = np.concatenate((np.array([np.nan]), A))
    _heapsort(A)
    return A[1:]

Also, the algorithms are implemented in a single function, and without recursion.

Imports

import gc
from time import perf_counter

import cython
import numpy as np
import pandas as pd
import perfplot

%load_ext cython
SD = 124  # random seed
rng = np.random.default_rng(SD)  # random number generator

Float array creation

We still assume that we want to sort an NumPy array of 64-bit floating-point numbers between 0 and 1:

n = 10
A = rng.random(n, dtype=np.float64)
A

array([0.78525311, 0.78585936, 0.96913602, 0.74805977, 0.65555081,
       0.93888454, 0.17861445, 0.58864721, 0.44279917, 0.34884712])

Cython_1

The authors of Numerical recipes in C [1] use a “corporate hierarchy metaphor”:

[…] a heap has every “supervisor” greater than or equal to its two “supervisees,” down through the levels of the hierarchy.

If you have managed to rearrange your array into an order that forms a heap, then sorting it is very easy: You pull off the “top of the heap,” which will be the largest element yet unsorted. Then you “promote” to the top of the heap its largest underling. Then you promote its largest underling, and so on. The process is like what happens (or is supposed to happen) in a large corporation when the chairman of the board retires. You then repeat the whole process by retiring the new chairman of the board.

Imagine that the corporation starts out with N/2 employees on the production line, but with no supervisors. Now a supervisor is hired to supervise two workers. If he is less capable than one of his workers, that one is promoted in his place, and he joins the production line. After supervisors are hired, then supervisors of supervisors are hired, and so on up the corporate ladder. Each employee is brought in at the top of the tree, but then immediately sifted down, with more capable workers promoted until their proper corporate level has been reached.

One execution of the Heapsort function represents the entire life-cycle of a giant corporation: N/2 workers are hired; N/2 potential supervisors are hired; there is a sifting up in the ranks, a sort of super Peter Principle: in due course, each of the original employees gets promoted to chairman of the board.

So we have two phases:

hiring
retirement-and-promotion

This is very similar to what we described in the previous post: a first step where we build the heap in a bottom-up approach, a second step where we keep on extracting the root until the heap is empty. Both steps rely on a sift-down process. However, we have here a single main loop for both phases.

%%cython -3 --compile-args=-Ofast

# cython: boundscheck=False, cdivision=True, initializedcheck=False, wraparound=False

import cython
import numpy as np
from cython import double, ssize_t


@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def _heapsort_1(ra: double[::1]) -> cython.void:
    i: ssize_t
    ir: ssize_t
    j: ssize_t
    l: ssize_t
    n: ssize_t = ra.shape[0] - 1
    rra: double

    if n < 2:
        return
    l = (n >> 1) + 1
    ir = n

    # The index l will be decremented from its initial value down to 1 during the “hiring” (heap
    # creation) phase. Once it reaches 1, the index ir will be decremented from its initial value
    # down to 1 during the “retirement-and-promotion” (heap selection) phase.
    while 1:
        if l > 1:  # Still in hiring phase.
            l -= 1
            rra = ra[l]
        else:  # In retirement-and-promotion phase.
            rra = ra[ir]  # Clear a space at end of array.
            ra[ir] = ra[1]  # Retire the top of the heap into it.
            ir -= 1
            if ir == 1:  # Done with the last promotion
                ra[1] = rra  # The least competent worker of all!
                break
        i = l  # Whether in the hiring phase or promotion phase, we
        j = l + l  # here set up to sift down element rra to its proper level.
        while j <= ir:
            if (j < ir) & (ra[j] < ra[j + 1]):  # Compare to the better underling.
                j += 1
            if rra < ra[j]:  # Demote rra.
                ra[i] = ra[j]
                i = j
                j = j << 1
            else:
                # j = ir + 1  # This is rra's level. Set j to terminate the sift-down.
                break
        ra[i] = rra  # Put rra into its slot.


@cython.ccall
def heapsort_cython_1(A_in):
    A = np.copy(A_in)
    A = np.concatenate((np.array([np.nan]), A))
    _heapsort_1(A)
    return A[1:]

A_1 = heapsort_cython_1(A)
A_1

array([0.17861445, 0.34884712, 0.44279917, 0.58864721, 0.65555081,
       0.74805977, 0.78525311, 0.78585936, 0.93888454, 0.96913602])

np.testing.assert_array_equal(A_1, np.sort(A))

Cython_2

This is a straight translation of NumPy Source code from C++ to Cython. Here is what we can read in the source code about its origin:

 * These sorting functions are copied almost directly from numarray
 * with a few modifications (complex comparisons compare the imaginary
 * part if the real parts are equal, for example), and the names
 * are changed.
 *
 * The original sorting code is due to Charles R. Harris who wrote
 * it for numarray.

This time we have two distinct loops in the _heapsort_2() function. But this is very similar to the code above (Cython_1).

%%cython -3 --compile-args=-Ofast

# cython: boundscheck=False, cdivision=True, initializedcheck=False, wraparound=False

import cython
import numpy as np
from cython import double, ssize_t


@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def _heapsort_2(a: double[::1]) -> cython.void:
    i: ssize_t
    j: ssize_t
    l: ssize_t
    n: ssize_t = a.shape[0] - 1
    tmp: double

    for l in range(n >> 1, 0, -1):
        tmp = a[l]
        i = l
        j = l << 1
        while j <= n:
            if (j < n) & (a[j] < a[j + 1]):
                j += 1
            if tmp < a[j]:
                a[i] = a[j]
                i = j
                j += j
            else:
                break
        a[i] = tmp

    while n > 1:
        tmp = a[n]
        a[n] = a[1]
        n -= 1
        i = 1
        j = 2
        while j <= n:
            if (j < n) & (a[j] < a[j + 1]):
                j += 1
            if tmp < a[j]:
                a[i] = a[j]
                i = j
                j += j
            else:
                break
        a[i] = tmp


@cython.ccall
def heapsort_cython_2(A_in):
    A = np.copy(A_in)
    A = np.concatenate((np.array([np.nan]), A))
    _heapsort_2(A)
    return A[1:]

A_2 = heapsort_cython_2(A)

np.testing.assert_array_equal(A_2, np.sort(A))

Cython_3

For the sake of completeness, we also included the implementation from the previous post, with the same compiler directives as above:

%%cython -3 --compile-args=-Ofast

# cython: boundscheck=False, cdivision=True, initializedcheck=False, wraparound=False

import cython
import numpy as np
from cython import double, ssize_t


@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def _max_heapify(A: double[::1], size: ssize_t, node_idx: ssize_t) -> cython.void:
    largest: ssize_t = node_idx

    left_child = 2 * node_idx + 1
    right_child = 2 * (node_idx + 1)

    if left_child < size and A[left_child] > A[largest]:
        largest = left_child

    if right_child < size and A[right_child] > A[largest]:
        largest = right_child

    if largest != node_idx:
        A[node_idx], A[largest] = A[largest], A[node_idx]
        _max_heapify(A, size, largest)


@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def _heapsort_3(A: double[::1]) -> cython.void:
    i: cython.int
    size: ssize_t = len(A)
    node_idx: ssize_t = size // 2 - 1

    for i in range(node_idx, -1, -1):
        _max_heapify(A, size, i)

    for i in range(size - 1, 0, -1):
        A[i], A[0] = A[0], A[i]
        size -= 1
        _max_heapify(A, size, 0)


@cython.ccall
def heapsort_cython_3(A_in):
    A = np.copy(A_in)
    _heapsort_3(A)
    return A

A_3 = heapsort_cython_3(A)

np.testing.assert_array_equal(A_3, np.sort(A))

Performance comparison

Again, we compare the different versions against NumPy heapsort.

Perfplot

out = perfplot.bench(
    setup=lambda n: rng.random(n, dtype=np.float64),
    kernels=[
        lambda A: heapsort_cython_1(A),
        lambda A: heapsort_cython_2(A),
        lambda A: heapsort_cython_3(A),
        lambda A: np.sort(A, kind="heapsort"),
    ],
    labels=["Cython_1", "Cython_2", "Cython_3", "NumPy"],
    n_range=[10**k for k in range(1, 9)],
)

t_df = pd.DataFrame(out.timings_s.T, columns=out.labels, index=out.n_range)
t_df

	Cython_1	Cython_2	Cython_3	NumPy
10	3.915e-06	3.908e-06	1.663e-06	1.737e-06
100	5.111e-06	4.890e-06	5.054e-06	2.977e-06
1000	5.326e-05	5.319e-05	7.090e-05	6.266e-05
10000	8.942e-04	9.227e-04	8.871e-04	9.634e-04
100000	1.202e-02	1.200e-02	1.272e-02	1.315e-02
1000000	1.779e-01	1.735e-01	2.202e-01	1.973e-01
10000000	2.666e+00	2.654e+00	4.641e+00	3.066e+00
100000000	3.740e+01	3.695e+01	7.699e+01	4.161e+01

So is seems that the Cython_1 and Cython_2 implementations achieve similar speed as NumPy’s version. However, the Cython_3 implementation is clearly slower when the array gets larger.

A Larger array

It is possible to run the algorithms on arrays with 1 billion elements, but without using perfplot for reasons related to memory capacity of the computer used to run the tests.

%%time
k = 9
A = rng.random(10**k, dtype=np.float64)

CPU times: user 4.05 s, sys: 679 ms, total: 4.72 s
Wall time: 4.72 s

timings = []
for _ in range(3):
    start = perf_counter()
    A_1 = heapsort_cython_1(A)
    end = perf_counter()
    elapsed_time = end - start
    del A_1
    _ = gc.collect()
    timings.append(elapsed_time)
elapsed_time_1 = np.amin(timings)

timings = []
for _ in range(3):
    start = perf_counter()
    A_2 = heapsort_cython_2(A)
    end = perf_counter()
    elapsed_time = end - start
    del A_2
    _ = gc.collect()
    timings.append(elapsed_time)
elapsed_time_2 = np.amin(timings)

timings = []
for _ in range(3):
    start = perf_counter()
    A_3 = heapsort_cython_3(A)
    end = perf_counter()
    elapsed_time = end - start
    del A_3
    _ = gc.collect()
    timings.append(elapsed_time)
elapsed_time_3 = np.amin(timings)

timings = []
for _ in range(3):
    start = perf_counter()
    A_4 = np.sort(A, kind="heapsort")
    end = perf_counter()
    elapsed_time = end - start
    del A_4
    _ = gc.collect()
    timings.append(elapsed_time)
elapsed_time_4 = np.amin(timings)

t_df = pd.concat(
    [
        t_df,
        pd.DataFrame(
            [[elapsed_time_1, elapsed_time_2, elapsed_time_3, elapsed_time_4]],
            columns=out.labels,
            index=[10**k],
        ),
    ],
    axis=0,
)
t_df["n log(n)"] = 5.0e-8 * t_df.index * np.log(t_df.index)

Note that we also included the curve corresponding to $O(n\log n)$ in the following figure. However, we are aware that this is an asymptotic behavior on an abstract machine, but do not take into account some hardware effects, such has cache optimization.

t_df[['Cython_1', 'Cython_2', 'Cython_3', 'NumPy']]

	Cython_1	Cython_2	Cython_3	NumPy
10	3.915e-06	3.908e-06	1.663e-06	1.737e-06
100	5.111e-06	4.890e-06	5.054e-06	2.977e-06
1000	5.326e-05	5.319e-05	7.090e-05	6.266e-05
10000	8.942e-04	9.227e-04	8.871e-04	9.634e-04
100000	1.202e-02	1.200e-02	1.272e-02	1.315e-02
1000000	1.779e-01	1.735e-01	2.202e-01	1.973e-01
10000000	2.666e+00	2.654e+00	4.641e+00	3.066e+00
100000000	3.740e+01	3.695e+01	7.699e+01	4.161e+01
1000000000	4.547e+02	4.536e+02	1.059e+03	4.142e+02

ax = t_df.plot(loglog=True, figsize=(12, 12), ms=10, alpha=0.6)
markers = ("o", "v", "^", "<", ">", "s", "p", "P", "*", "h", "X", "D", ".")
for i, line in enumerate(ax.get_lines()):
    marker = markers[i]
    line.set_marker(marker)
plt.legend()
plt.grid("on")
_ = ax.set(
    title="Timing comparison between Cython and NumPy heapsort",
    xlabel="Array length (log scale)",
    ylabel="Elapsed_time [s] (log scale)",
)

Timings

Reference

[1] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. 1992. Numerical recipes in C (2nd ed.): the art of scientific computing. Cambridge University Press, USA.