Heapsort with Numba and Cython

Updated April 20, 2022 following usefull comments by @scoder. Thank you very much for your pull requests!

Heapsort is a classical sorting algorithm. We are going into a little bit of theory about the algorithm, but refer to Corman et al. [1] for more details, or the heapsort wikipedia page.

In this post, we are going to implement the classical heapsort in Python, Python/Numba and Cython. The regular implementation is array-based and performed in-place. We use 0-based indices. Note that this is not a stable sorting method (keeping items with the same key in the original order).

Imports

from itertools import cycle

from binarytree import build
import cython
from numba import njit
import numpy as np
import perfplot
%load_ext cython

SD = 124  # random seed
rng = np.random.default_rng(SD)  # random number generator

Language/package versions:

Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.2.0
binarytree           : 6.5.1
matplotlib           : 3.5.1
perfplot             : 0.10.2
numpy                : 1.21.5
cython               : 0.29.28
numba                : 0.55.1

Float array creation

We assume that we want to sort an NumPy array of 64-bit floating-point numbers :

n = 10
A = rng.random(n, dtype=np.float64)
A

array([0.78525311, 0.78585936, 0.96913602, 0.74805977, 0.65555081,
       0.93888454, 0.17861445, 0.58864721, 0.44279917, 0.34884712])

Binary trees

Heapsort is based on binary heap data structure, which relies on a nearly complete binary tree:

a [nearly] complete binary tree is a binary tree in which every level, except possibly the last, is completely filled, and all nodes in the last level are as far left as possible.

In a binary heap, the elements in an array are mapped to the tree nodes of a virtual binary tree. Assuming that we know the maximum number of elements in the heap, the array representation of the tree is more convenient than a linked structure. Note that it is also possible to use and array that can grow or shrink dynamically when threshold sizes are reached. Here is a figure showing the array elements and the corresponding tree nodes:

binary tree

So we can see the array as a representation of the tree (implicit data structure): the node at the top (A[0]) of the tree is called the root, and the ones at the bottom, without children, are called the leaves (A[5], A[6], A[7], A[8], A[9]). An advantage of binary trees is that it is easy to navigate among the nodes. Tree node indexing goes from the root to the leaves and from left to right. Level $k\geq 0$ goes from node $2^k -1$ to node $2 (2^k -1)$. Within a level, you can go left by decrementing, and right by incrementing the node index by one. A node may have two children : the left and right ones. Given a node index i, the parent node index can be easily found as (i - 1) // 2:

def parent(i):
    assert i > 0
    return (i - 1) // 2

parent(1)

parent(8)

The left child has index 2 * i + 1 and the right child 2 * (i + 1):

def left_child(i):
    return 2 * i + 1


def right_child(i):
    return 2 * (i + 1)

left_child(0)

right_child(0)

The depth of a node is the number of edges from the root node to the target node. The height $h$ of a binary tree is the number of edges from the root to the most distant leaf node (in the above example, the height is equal to 3). We can bound the number of elements $n$ of the nearly complete tree using its height:

\[2^h = \sum_{k=0}^{h-1} 2^k +1 \leq n \leq \sum_{k=0}^h 2^k = 2^{h+1}-1\]

So that we have $h = \lfloor \log_2 n \rfloor $

Binary heap

Here is the definition of a heap [2]:

A heap is a set of nodes with keys arranged in a complete heap-ordered binary tree, represented as an array.

A binary heap is binary tree data structure that satisfies a heap property. There are two kinds of heaps : min- and max-heaps satisfying respectively a min-heap or a max-heap property. In our case, we are going to use a max-heap to easily implement the heapsort algorithm. A min-heap would be used if sorting the array in a descending order, or would imply extra space/work to sort in ascending order.

max-heap property : A value of a given node i is not larger than the value of its parent node parent(i)

\[A[parent(i)] \geq A[i]\]

Thus value of the root node A[0] of a max-heap is greater than or equal to all the tree values. This is also true for any sub-tree of the binary heap: the root node of any sub-tree A[i] is greater than or equal to all the values in this sub-tree.

An important point is that not all the array elements might be in the heap. We differentiate the heap size from the array length $n$. We have $0 \leq size \leq n$. The element in heap are the size elements in the left part of the array: A[0:size] (with a Python slicing indexing). All remaining elements A[size:n] are not in the heap, implicitely.

Initially, given the above array A, it mostly likely does not satisfy the max-heap property, which would imply in our case: A[0] >= A[1], A[0] >= A[2], A[1] >= A[3], A[1] >= A[4], A[3] >= A[7], A[3] >= A[8 ], A[2] >= A[5], A[2] >= A[6]. Indeed we have:

A[0] >= A[1]

False

A[0] >= A[2]

False

We need to build the max-heap from an array by moving around the elements. In order to do that, we use a function called max_heapify from the leaves to the root of the tree, in a bottom-up manner. So let’s implement this function.

Max_heapify

The max_heapify(A, size, i) function presupposes that sub-trees rooted at the children nodes of i do satisfy the max-heap property. But the i-th node may violate it, i.e. A[i] < A[left_child(i)] or A[i] < A[right_child(i)], assuming that the children are in the heap. If this is the case, the i-th node is swapped with its child with largest value:

A[i], A[largest] = A[largest], A[i]

This process is then repeated from the largest children node. Eventually, the sub-tree rooted at i satisfies the max-heap property after a call to max_heapify(A, size, i).

This process is also refered to as sift down: move a value violating the max-heap property down the tree by exchanging it with the largest of its 2 children values. Now it would also be possible to use the exact opposite process: sift up, which would start from a leaf node, and move its value up the tree by exchanging it with the parent node value, if violating the max-heap property. However, building a max-heap with the sift up process has a larger complexity than with sift down. I found this explanation by @alestanis on Stack Overflow really clear (The question was Why siftDown is better than siftUp in heapify?):

Sift up

So here is a Python implementation:

def max_heapify(A, size, node_idx=0):

    largest = node_idx
    l = left_child(largest)
    r = right_child(largest)

    if (l < size) and (A[l] > A[largest]):
        largest = l

    if (r < size) and (A[r] > A[largest]):
        largest = r

    if largest != node_idx:
        A[node_idx], A[largest] = A[largest], A[node_idx]  # exchange 2 nodes
        max_heapify(A, size, largest)

Note that this function is recursive. Let’s have a look at the tree values of our example:

tree_values = build(list(A.round(5)))
print(tree_values)

                            ___________________0.78525___________
                           /                                     \
             __________0.78586___________                   ___0.96914___
            /                            \                 /             \
     ___0.74806__                   ___0.65555         0.93888         0.17861
    /            \                 /
0.58865         0.4428         0.34885

This array is random, but we can observe that the max-heap property is satisfied everywhere except at the root! The sub-trees rooted at nodes 1 and 2 do happen to satisfy the max-heap property already. So let’s call the max_heapify function on the root node.

max_heapify(A, len(A), node_idx=0)
tree_values = build(list(A.round(5)))
print(tree_values)

                            ___________________0.96914___________
                           /                                     \
             __________0.78586___________                   ___0.93888___
            /                            \                 /             \
     ___0.74806__                   ___0.65555         0.78525         0.17861
    /            \                 /
0.58865         0.4428         0.34885

The root node A[0] = 0.78525 moved from node index 0 to 2, and then from 2 to 5. Nodes 2 and 5 moved one step up in the process.

Steps:

node 0 is compared with nodes 1 and 2. Node 0 is smaller than one of its children (0.78525 < 0.96914). Node 2 is the child with largest value : exchange node 0 and 2.
node 2 is compared with nodes 5 and 6. Node 0 is smaller than one of its children (0.78525 < 0.93888). Node 5 is the child with largest value : exchange node 2 and 5.
Node 5 is a leaf, so stop.

Let’s generate a new random array to see another example:

A = rng.random(n, dtype=np.float64)
A

array([0.3309295 , 0.15936868, 0.98946349, 0.25711078, 0.71576487,
       0.50588512, 0.66411132, 0.70234247, 0.05208023, 0.06009649])

tree_values = build(list(A.round(5)))
print(tree_values)

                             __________________0.33093___________
                            /                                    \
             ___________0.15937__________                   ___0.98946___
            /                            \                 /             \
     ___0.25711___                  ___0.71576         0.50589         0.66411
    /             \                /
0.70234         0.05208         0.0601

If we start from the leaves toward the root, we can notice a max-heap violation at node 3. So we call max_heapify on that node:

max_heapify(A, len(A), node_idx=3)
tree_values = build(list(A.round(5)))
print(tree_values)

                             __________________0.33093___________
                            /                                    \
             ___________0.15937__________                   ___0.98946___
            /                            \                 /             \
     ___0.70234___                  ___0.71576         0.50589         0.66411
    /             \                /
0.25711         0.05208         0.0601

Now let’s take care of node 1:

max_heapify(A, len(A), node_idx=1)
tree_values = build(list(A.round(5)))
print(tree_values)

                             __________________0.33093___________
                            /                                    \
             ___________0.71576__________                   ___0.98946___
            /                            \                 /             \
     ___0.70234___                  ___0.15937         0.50589         0.66411
    /             \                /
0.25711         0.05208         0.0601

And finally we need to fix the root value:

max_heapify(A, len(A), node_idx=0)
tree_values = build(list(A.round(5)))
print(tree_values)

                             __________________0.98946___________
                            /                                    \
             ___________0.71576__________                   ___0.66411___
            /                            \                 /             \
     ___0.70234___                  ___0.15937         0.50589         0.33093
    /             \                /
0.25711         0.05208         0.0601

We actually built a max-heap manually. In the following let’s see the general method to build it.

Build_max_heap

The method used to build the max-heap with a sift-down-based heapify is called Floyd’s heap construction:

Floyd’s algorithm starts with the leaves, observing that they are trivial but valid heaps by themselves, and then adds parents. Starting with element n/2 and working backwards, each internal node is made the root of a valid heap by sifting down. The last step is sifting down the first element, after which the entire array obeys the heap property.

This bottom-up heap construction technique runs in $O(n)$ time. Here is a Python implementation:

def build_max_heap(A):
    size = len(A)
    node_idx = size // 2 - 1  # last non-leaf node index
    for i in range(node_idx, -1, -1):
        max_heapify(A, size, node_idx=i)
    return size

A = rng.random(n, dtype=np.float64)
A

array([0.94535273, 0.25035043, 0.40579395, 0.27596342, 0.30065296,
       0.36667218, 0.14878984, 0.34834079, 0.59713033, 0.99416357])

tree_values = build(list(A.round(5)))
print(tree_values)

                             ___________________0.94535___________
                            /                                     \
             ___________0.25035___________                   ___0.40579___
            /                             \                 /             \
     ___0.27596___                   ___0.30065         0.36667         0.14879
    /             \                 /
0.34834         0.59713         0.99416

_ = build_max_heap(A)
tree_values = build(list(A.round(5)))
print(tree_values)

                             ___________________0.99416___________
                            /                                     \
             ___________0.94535___________                   ___0.40579___
            /                             \                 /             \
     ___0.59713___                   ___0.30065         0.36667         0.14879
    /             \                 /
0.34834         0.27596         0.25035

Heapsort

The classical heapsort has two steps:

1 - build the heap
2 - destroy the heap by removing the root from the heap and moving it to the end of the heap $n$ times.

Step 2 corresponds to the following iterations:

swap the root (largest element) with the last leaf
remove the last leaf from the heap.
heapify the root on the reduced heap.

The average and worst-case performance are $O(n\log n)$. Here is a Python implementation:

def heapsort(A_in):

    A = np.copy(A_in)

    # build a max heap
    size = build_max_heap(A)

    for i in range(size - 1, 0, -1):

        # swap the root (largest element) with the last leaf
        A[i], A[0] = A[0], A[i]

        # removing largest element from the heap
        size -= 1

        # call _max_heapify from the root on the heap with remaining elements
        max_heapify(A, size, node_idx=0)

    return A

n = 10
A = rng.random(n, dtype=np.float64)
A

array([0.18525118, 0.99313433, 0.78561885, 0.44814329, 0.85044505,
       0.86088208, 0.96716993, 0.17096352, 0.69956773, 0.8288503 ])

A_sorted_python = heapsort(A)
A_sorted_python

array([0.17096352, 0.18525118, 0.44814329, 0.69956773, 0.78561885,
       0.8288503 , 0.85044505, 0.86088208, 0.96716993, 0.99313433])

A_ref = np.sort(A)
np.testing.assert_array_equal(A_sorted_python, A_ref)

Let’s use Numba to make this Python code running fast.

Numba

@njit
def max_heapify_numba(A, size, node_idx=0):

    largest = node_idx
    left_child = 2 * node_idx + 1
    right_child = 2 * (node_idx + 1)

    if (left_child < size) and (A[left_child] > A[largest]):
        largest = left_child

    if (right_child < size) and (A[right_child] > A[largest]):
        largest = right_child

    if largest != node_idx:
        A[node_idx], A[largest] = A[largest], A[node_idx]  # exchange 2 nodes
        max_heapify_numba(A, size, largest)

@njit
def heapsort_numba(A_in):

    A = np.copy(A_in)

    # build a max heap
    size = len(A)
    node_idx = size // 2 - 1  # last non-leaf node index
    for i in range(node_idx, -1, -1):
        max_heapify_numba(A, size, node_idx=i)

    for i in range(size - 1, 0, -1):
        # swap the root (largest element) with the last leaf
        A[i], A[0] = A[0], A[i]

        # removing largest element from the heap
        size -= 1

        # call _max_heapify from the root on the heap with remaining elements
        max_heapify_numba(A, size)
    return A

A_sorted_numba = heapsort_numba(A)
A_sorted_numba

array([0.17096352, 0.18525118, 0.44814329, 0.69956773, 0.78561885,
       0.8288503 , 0.85044505, 0.86088208, 0.96716993, 0.99313433])

np.testing.assert_array_equal(A_sorted_numba, A_ref)

Cython

%%cython --compile-args=-Ofast

# cython: boundscheck=False, initializedcheck=False, wraparound=False

import cython
import numpy as np

from cython import ssize_t, double


@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def _max_heapify(A: double[::1], size: ssize_t, node_idx: ssize_t) -> cython.void:
    largest: ssize_t = node_idx

    left_child = 2 * node_idx + 1
    right_child = 2 * (node_idx + 1)

    if left_child < size and A[left_child] > A[largest]:
        largest = left_child

    if right_child < size and A[right_child] > A[largest]:
        largest = right_child

    if largest != node_idx:
        A[node_idx], A[largest] = A[largest], A[node_idx]
        _max_heapify(A, size, largest)


@cython.exceptval(check=False)
@cython.nogil
@cython.cfunc
def _heapsort(A: double[::1]) -> cython.void:
    i: cython.int
    size: ssize_t = len(A)
    node_idx: ssize_t = size // 2 - 1

    for i in range(node_idx, -1, -1):
        _max_heapify(A, size, i)

    for i in range(size - 1, 0, -1):
        A[i], A[0] = A[0], A[i]
        size -= 1
        _max_heapify(A, size, 0)


@cython.ccall
def heapsort_cython(A_in):
    A = np.copy(A_in)
    _heapsort(A)
    return A

A_sorted_cython = heapsort_cython(A)
A_sorted_cython

array([0.17096352, 0.18525118, 0.44814329, 0.69956773, 0.78561885,
       0.8288503 , 0.85044505, 0.86088208, 0.96716993, 0.99313433])

np.testing.assert_array_equal(A_sorted_cython, A_ref)

Performance comparison

We do not include the Python version and compare the Numba and Cython versions with the NumPy heapsort implementation. It seems that this NumPy heapsort is written in C++ and is fairly well optimized.

out = perfplot.bench(
    setup=lambda n: rng.random(n, dtype=np.float64),
    kernels=[
        lambda A: heapsort_numba(A),
        lambda A: heapsort_cython(A),
        lambda A: np.sort(A, kind="heapsort"),
    ],
    labels=["Numba", "Cython", "NumPy"],
    n_range=[10**k for k in range(1, 9)],
)

out

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ n         ┃ Numba               ┃ Cython                ┃ NumPy                  ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10        │ 1.266e-06           │ 1.819e-06             │ 1.7500000000000002e-06 │
│ 100       │ 5.587e-06           │ 5.004000000000001e-06 │ 3.0430000000000003e-06 │
│ 1000      │ 0.000116266         │ 6.1561e-05            │ 6.2518e-05             │
│ 10000     │ 0.001819482         │ 0.0008792880000000001 │ 0.000939073            │
│ 100000    │ 0.028507956         │ 0.014659915           │ 0.013501216000000002   │
│ 1000000   │ 0.42401333100000005 │ 0.26564063200000004   │ 0.224838018            │
│ 10000000  │ 5.938197069         │ 5.285706523           │ 3.4547337810000003     │
│ 100000000 │ 78.544810064        │ 84.04674386900001     │ 49.002946636000004     │
└───────────┴─────────────────────┴───────────────────────┴────────────────────────┘

figsize = (14, 14)

labels = out.labels
ms = 10
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(1, 1, 1)
plt.loglog(
    out.n_range,
    out.n_range * np.log(out.n_range) * 5.0e-8,
    "o-",
    label="$c \; n \; ln(n)$",
)
for i, label in enumerate(labels):
    plt.loglog(out.n_range, out.timings_s[i], "o-", ms=ms, label=label)
markers = cycle(("o", "v", "^", "<", ">", "s", "p", "P", "*", "h", "X", "D", "."))
for i, line in enumerate(ax.get_lines()):
    marker = next(markers)
    line.set_marker(marker)
plt.legend()
plt.grid("on")
_ = ax.set(
    title="Timing comparison between Numba, Cython and NumPy heapsort",
    xlabel="Array length (log scale)",
    ylabel="Elapsed_time [s] (log scale)",
)

Timings

Conclusion

Going from Python to Numba is seamless and is allowing us to reach a similar level of efficiency as with Cython (that turned 20 years old last week, happy birthday!!). We can observe that the NumPy heapsort implementation is faster on larger arrays, but we do not know which optimizations did they implement.

References

[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition (3rd. ed.). The MIT Press.

[2] Robert Sedgewick. 2002. Algorithms in C (3rd. ed.). Addison-Wesley Longman Publishing Co., Inc., USA.