In-memory cosine distance calculation, NumPy vs DuckDB

This notebook compares the performance of NumPy and DuckDB for computing cosine distances on the same dataset.

The dataset is dbpedia_14 with 630,073 embedded text chunks of wikipedia text. The embedding model is distilbert-dot-tas_b-b256-msmarco with an embedding dimension of 768. Check this post to see how the embedding vectors were created: A Hybrid information retriever with DuckDB.

The data is located on the disk as a .duckdb database file. We are going to load the data as an in-memory DuckDB database and a NumPy array as well.

Outline

Imports and package versions
Query with DuckDB
Query with NumPy
Results

Imports and package versions

import duckdb
import numpy as np
import pandas as pd

Package versions:

Python               : 3.13.0  
duckdb               : 1.2.0
numpy                : 2.2.3
pandas               : 2.2.3

Query with DuckDB

Load the data into memory

We load the dataset from disk into an in-memory DuckDB database to enable fast queries and computations.

con = duckdb.connect()
query = "ATTACH '/home/francois/Workspace/Search_PDFs/dbpedia_14.duckdb' AS disk;"
con.sql(query)

%%time
query = """
CREATE TABLE dbpedia_14 AS (
  SELECT chunk_index, title, chunk_embedding 
  FROM disk.dbpedia_14 
  ORDER BY chunk_index);"""
con.sql(query)

CPU times: user 9.17 s, sys: 2.59 s, total: 11.8 s
Wall time: 3.66 s

query = "SELECT COUNT(*) FROM dbpedia_14"
con.sql(query)

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       630073 │
└──────────────┘

So we have 630073 vectors of size 768:

query = """
SELECT data_type 
FROM information_schema.columns
WHERE table_catalog = 'memory'
  AND table_name = 'dbpedia_14'
  AND column_name = 'chunk_embedding';"""
con.sql(query)

┌────────────┐
│ data_type  │
│  varchar   │
├────────────┤
│ FLOAT[768] │
└────────────┘

Normalize the embedding vectors in-place

We first normalize the embedding vectors by dividing them by their L2 norm. While this process is computationally expensive, it is performed only once and eliminates redundant calculations during later similarity searches.

Checking the L2 norm confirms that the vectors are unnormalized, with the first vector of the column:

query = "SELECT chunk_embedding FROM dbpedia_14 LIMIT 1"
search_embedding = con.sql(query).fetchnumpy()["chunk_embedding"][0]
np.linalg.norm(search_embedding, 2)

np.float32(11.047098)

The normalization is done here as a two-step process. First, we compute the L2 norm of each vector and store it in the l2_norm column:

%%time
query = """
ALTER TABLE dbpedia_14 ADD COLUMN IF NOT EXISTS l2_norm FLOAT;
UPDATE dbpedia_14 
SET l2_norm = sqrt(list_sum(list_transform(chunk_embedding, y -> y * y)));"""
con.sql(query)

CPU times: user 3.32 s, sys: 557 ms, total: 3.87 s
Wall time: 765 ms

Now, we divide each embedding by its precomputed norm:

%%time
query = """
UPDATE dbpedia_14 
SET chunk_embedding = list_transform(chunk_embedding, x -> x / l2_norm);"""
con.sql(query)

CPU times: user 4.04 s, sys: 385 ms, total: 4.43 s
Wall time: 2.44 s

Select the element to search against

To perform a similarity search, we first select an element from the dataset so that we can find the 5 most similar vectors in the table, i.e. with the smallest cosine distance value. We retrieve the entry with chunk_index = 0:

query = "SELECT chunk_index, title FROM dbpedia_14 WHERE chunk_index = 0"
con.sql(query)

┌─────────────┬──────────────────┐
│ chunk_index │      title       │
│    int64    │     varchar      │
├─────────────┼──────────────────┤
│           0 │ E. D. Abbott Ltd │
└─────────────┴──────────────────┘

Next, we extract its corresponding embedding:

query = "SELECT chunk_embedding FROM dbpedia_14 WHERE chunk_index = 0"
search_embedding = con.sql(query).fetchnumpy()["chunk_embedding"][0]

To confirm that our normalization step was successful, we check that the L2 norm of this embedding is 1:

np.linalg.norm(search_embedding, 2)

np.float32(1.0)

Cosine distance query

The cosine distance $D(u,v)$ between two vectors $u$ and $v$ is defined as:

\[D(u, v) = 1- \frac{u \cdot v}{\|u\|_2 \|v\|_2}\]

This is not strictly a “distance” in the mathematical sense, as it does not satisfy the triangle inequality. Since we have normalized our embeddings, their L2 norms are 1, simplifying the equation to:

\[D(u, v) = 1 - u \cdot v\]

Thus, computing the cosine distance reduces to computing the dot/inner product. We are going to use the array_negative_inner_product function to compute $(- u \cdot v)$.

%%timeit -r 10 -n 1
query = f"""
SELECT chunk_index, title, 1 + array_negative_inner_product(
  chunk_embedding,
  Cast(? AS FLOAT[768])) 
  AS cosine_distance 
FROM dbpedia_14
ORDER BY cosine_distance ASC
LIMIT 5;"""
con.execute(query, (search_embedding,)).df()

165 ms ± 6.46 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

	chunk_index	title	cosine_distance
0	0	E. D. Abbott Ltd	0.000000
1	219729	Abbott Farnham sailplane	0.112158
2	31145	Abbott-Baynes Sailplanes	0.139073
3	38684	Abbott-Detroit	0.142139
4	34947	East Lancashire Coachbuilders	0.153380

Query with NumPy

Load the data into NumPy

To perform similarity searches with NumPy, we first load all embeddings from the .duckdb file into a NumPy array.

%%time
sql = "SELECT chunk_embedding FROM disk.dbpedia_14 ORDER BY chunk_index"
embeddings = con.execute(sql).fetchnumpy()["chunk_embedding"]
embeddings = np.vstack(embeddings)

CPU times: user 5.14 s, sys: 1.65 s, total: 6.79 s
Wall time: 3.52 s

After loading, we verify the shape of the array:

embeddings.shape

(630073, 768)

To confirm that the embeddings are not yet normalized, we check the L2 norm of the first vector:

np.linalg.norm(embeddings[0, :], 2)

np.float32(11.047098)

Normalize the embedding vectors

Note that we could have used the normalized vectors from DuckDB that are already in memory. We instead started independently from the data on disk and normalize the NumPy way:

%%time
norms = np.linalg.norm(embeddings, ord=2, axis=1)
embeddings /= norms[:, np.newaxis]

CPU times: user 451 ms, sys: 2.64 s, total: 3.09 s
Wall time: 3.18 s

np.linalg.norm(embeddings[0, :], 2)

np.float32(0.99999994)

Select the element to search against

We use the same reference element as before, now extracted from the NumPy array of normalized embeddings:

search_embedding = embeddings[0]

Cosine distance query

We calculate the cosine distance using the formula $D(u,v)=1− u \cdot v$:

cosine_distances = 1.0 - np.dot(embeddings, search_embedding)

To efficiently retrieve the top 5 closest results, we use np.argpartition to find the indices of the smallest cosine distances, and then sort those indices to get the final results:

%%timeit -r 10 -n 1
cosine_distances = 1.0 - np.dot(embeddings, search_embedding)
k = 5
smallest_indices = np.argpartition(cosine_distances, k)[:k]
smallest_indices = smallest_indices[np.argsort(cosine_distances[smallest_indices])]
smallest_values = cosine_distances[smallest_indices]
pd.DataFrame({"chunk_index": smallest_indices, "cosine_distance": smallest_values})

36.1 ms ± 2.86 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

	chunk_index	cosine_distance
0	0	0.000000
1	219729	0.112157
2	31145	0.139073
3	38684	0.142139
4	34947	0.153380

The top 5 closest elements are the same chunk indexes as those found earlier with DuckDB, with same distance. We can directly compare the chunk index from the database table with the NumPy array index because the embeddings were loaded using ORDER BY chunk_index, and it contains contiguous integers starting at 0.

Finally, we ensure that the database is properly detached from DuckDB, and the connection is closed.

query = "DETACH disk;"
con.sql(query)
con.close()

Results

We compared the elapsed times for computing the cosine distance using both DuckDB and NumPy on the same in-memory data. As shown in the bar chart below, NumPy performs the computation faster than DuckDB.

timings = pd.Series({"DuckDB": 165, "NumPy": 36.1}).to_frame("elapsed_time_ms")
ax = timings.plot.bar(alpha=0.7, grid=True, legend=False, rot=45)
_ = ax.set(title="Comparing cosine distance computation in-memory:\nDuckDB vs NumPy on the dbpedia_14 Dataset", ylabel="Elapsed time (ms)")

Comparison

However, these are two different tools with very distinct purposes. NumPy is based on highly optimized linear algebra libraries, which likely explains the observed performance difference in this small brute-force search example. But of course, for any real-world vector search application, a database with vector support is essential: disk-based storage with persistence, advanced indexing techniques, memory mapping to handle large datasets efficiently, while enabling concurrent queries and filtering, and so on… This post is only meant for benchmarking computational time.