Skip to content

Commit 6c43799

Browse files
committed
new post
1 parent f069fbb commit 6c43799

1 file changed

Lines changed: 282 additions & 0 deletions

File tree

Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
---
2+
title: Computing and Visualizing Billions of Bohemian Eigenvalues with Python
3+
layout: post
4+
comments: true
5+
author: François Pacull
6+
tags:
7+
- Python
8+
- numba
9+
- datashader
10+
- parquet
11+
- dask
12+
- colorcet
13+
- linear algebra
14+
- random matrices
15+
- parallel computing
16+
- out-of-core processing
17+
- batch processing
18+
- big data visualization
19+
---
20+
21+
According to [www.bohemianmatrices.com/](http://www.bohemianmatrices.com/),
22+
23+
> A family of Bohemian matrices is a distribution of random matrices where the matrix entries are sampled from a discrete set of bounded height. The discrete set must be independent of the dimension of the matrices.
24+
25+
In our case, we sample 5x5 matrix entries from the discrete set {-1, 0, 1}. For example, here are 2 random matrices with these specifications:
26+
27+
$$
28+
\begin{pmatrix}
29+
0 & 0 & 0 & -1 & 0 \\
30+
0 & 0 & 0 & 1 & -1 \\
31+
1 & -1 & 0 & -1 & -1 \\
32+
1 & -1 & 1 & 1 & 0 \\
33+
-1 & 0 & 1 & 0 & -1
34+
\end{pmatrix}
35+
$$
36+
37+
$$
38+
\begin{pmatrix}
39+
-1 & 0 & 0 & 0 & 0 \\
40+
0 & 1 & 0 & 1 & 1 \\
41+
1 & -1 & 0 & -1 & -1 \\
42+
1 & 1 & 1 & 1 & -1 \\
43+
-1 & 0 & 1 & -1 & -1
44+
\end{pmatrix}
45+
$$
46+
47+
Bohemian **eigenvalues** are the eigenvalues of a family of Bohemian matrices. Eigenvalues $λ$ satisfy the equation $Av = λv$, where $A$ is the matrix and $v$ is a corresponding eigenvector. If the matrices are 5 by 5, the eigenvalue solver is going to return 5 complex eigenvalues.
48+
49+
So what if we want to compute all the possible eigenvalues from all these matrices? Well that would imply $3^{5 \times 5} = 847,288,609,443$ matrices, resulting in $4.236 × 10^{12}$ eigenvalues. Just storing all these matrices would require some significant space. Each 5×5 matrix has 25 entries, and if we need 1 byte per entry to represent {-1, 0, 1}, we get a total storage amount of :
50+
$847,288,609,443$ matrices × 25 bytes/matrix = $21,182,215,236,075$ bytes ≈ 21.18 TB
51+
52+
So instead of computing all possible matrices, we are only going to sample 1 billion matrices. The first motivation to compute these complex eigenvalues is to observe some interesting patterns when plotted in the complex plane. See for example the beautiful gallery from the bohemianmatrices web site : [www.bohemianmatrices.com/gallery/](http://www.bohemianmatrices.com/gallery/). Actually, we are just going to reproduce one of the gallery images :
53+
54+
<p align="center">
55+
<img width="300" src="/img/2025-05-23_01/Random_5x5_1_gallery@2x.jpg" alt="bhime_original">
56+
</p>
57+
58+
Although I like the resulting plot, the main point of this Python notebook is to be able to compute and visualize these 5 billion eigenvalues smoothly. My laptop has 32GB of RAM and a 20-thread Intel i9 CPU. And for that, we are going to use some great packages: numpy, numba, pyarrow, dask and datashader. Note that I tried to perform the eigenvalue computations with PyTorch but it did not really improve the overall efficiency for these very small matrices as compared to numba.
59+
60+
61+
We are going to process by batch for the eigenvalues computation and the visualization. This workflow has two distinct steps:
62+
63+
1. **Generate, compute and store**: We generate matrices in batches, compute eigenvalues in parallel using Numba's `njit`, and write the eigenvalues incrementally to a Parquet file.
64+
65+
2. **Load and visualize**: We use Dask to load data from the Parquet file in chunks, distributing points into partition buckets for out-of-core visualization with Datashader.
66+
67+
This *chunk* approach allows us to process billions of eigenvalues without overwhelming system memory.
68+
69+
## Package versions<a name="imports"></a>
70+
71+
Python implementation: CPython
72+
Python version : 3.13.3
73+
74+
dask : 2025.5.0
75+
datashader: 0.18.1
76+
numpy : 2.2.6
77+
pyarrow : 20.0.0
78+
colorcet : 3.1.0
79+
numba : 0.61.2
80+
tqdm : 4.67.1
81+
82+
## First Part : computing eigenvalues in batches
83+
84+
85+
```python
86+
import gc
87+
import os
88+
89+
import numpy as np
90+
import pyarrow as pa
91+
import pyarrow.parquet as pq
92+
from numba import njit, prange
93+
from tqdm import tqdm
94+
95+
PARQUET_FP = "./eigenvalues.parquet"
96+
RS = 124
97+
```
98+
99+
```python
100+
@njit(parallel=True, fastmath=True)
101+
def compute_eigenvalues_batch(entries, indices, mat_size):
102+
batch_size = indices.shape[0] // (mat_size * mat_size)
103+
eigs = np.empty(batch_size * mat_size, dtype=np.complex64)
104+
for i in prange(batch_size):
105+
mat = np.empty((mat_size, mat_size), dtype=np.complex64)
106+
for j in range(mat_size):
107+
for k in range(mat_size):
108+
index = indices[i * mat_size * mat_size + j * mat_size + k]
109+
mat[j, k] = entries[index]
110+
start = i * mat_size
111+
eigs[start : start + mat_size] = np.linalg.eigvals(mat)
112+
return eigs
113+
114+
def compute_and_save_eigenvalues(
115+
output_path, entries, mat_size, total_count, batch_size, seed=42
116+
):
117+
np.random.seed(seed)
118+
entries_array = np.array(entries, dtype=np.float32)
119+
num_choices = len(entries)
120+
schema = pa.schema(
121+
[
122+
("x", pa.float32()),
123+
("y", pa.float32()),
124+
]
125+
)
126+
with pq.ParquetWriter(output_path, schema) as writer:
127+
for offset in tqdm(range(0, total_count, batch_size)):
128+
current_batch = min(batch_size, total_count - offset)
129+
num_elements = current_batch * mat_size * mat_size
130+
random_indices = np.random.randint(
131+
0, num_choices, size=num_elements
132+
).astype(np.uint8)
133+
eigvals = compute_eigenvalues_batch(entries_array, random_indices, mat_size)
134+
table = pa.table(
135+
{
136+
"x": eigvals.real.astype(np.float32),
137+
"y": eigvals.imag.astype(np.float32),
138+
},
139+
schema=schema,
140+
)
141+
writer.write_table(table)
142+
del eigvals, table
143+
gc.collect()
144+
```
145+
146+
We remove the parquet file if it's already there:
147+
148+
```python
149+
if os.path.exists(PARQUET_FP):
150+
os.remove(PARQUET_FP)
151+
print(f"Deleted: {PARQUET_FP}")
152+
else:
153+
print(f"File does not exist: {PARQUET_FP}")
154+
```
155+
156+
File does not exist: ./eigenvalues.parquet
157+
158+
Let's generate 1 billion 5×5 matrices with entries from {-1, 0, 1} and compute their eigenvalues by batch of 10,000,000:
159+
160+
```python
161+
%%time
162+
163+
compute_and_save_eigenvalues(
164+
output_path=PARQUET_FP,
165+
entries=[-1, 0, 1],
166+
mat_size=5,
167+
total_count=1_000_000_000,
168+
batch_size=10_000_000,
169+
seed=RS,
170+
)
171+
```
172+
173+
0%| | 0/100 [00:00<?, ?it/s]OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
174+
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [19:49<00:00, 11.90s/it]
175+
176+
CPU times: user 4h 14min 28s, sys: 1min 22s, total: 4h 15min 51s
177+
Wall time: 19min 49s
178+
179+
180+
## Second part : visualization with Datashader
181+
182+
Datashader handles large datasets efficiently by processing data in chunks and creating density plots without loading all points into memory, only the final raster (the image) is kept in memory.
183+
184+
185+
```python
186+
import dask.dataframe as dd
187+
import datashader as ds
188+
import datashader.transfer_functions as tf
189+
from datashader import reductions as rd
190+
from colorcet import palette
191+
192+
cmap = palette["kgy"]
193+
bg_col = "black"
194+
```
195+
196+
We filter out real eigenvalues nearly on the real axis (those with imaginary parts `eps` close to zero) to focus on the complex structure. We also remove potential eigenvalues outside the square box $-3\leq x\leq3$ and $-3\leq y\leq3$.
197+
198+
199+
```python
200+
def visualize(
201+
parquet_path,
202+
plot_width=1600,
203+
plot_height=1600,
204+
x_range=(-3, 3),
205+
y_range=(-3, 3),
206+
eps=1e-5,
207+
cmap="fire",
208+
bg_col="black",
209+
output_path="output.png",
210+
how="log",
211+
max_points=1000
212+
):
213+
ddf = dd.read_parquet(parquet_path, columns=["x", "y"])
214+
nrows, ncols = ddf.shape[0].compute(), ddf.shape[1]
215+
print(f"Before filtering : {nrows} rows, {ncols} columns")
216+
ddf = ddf[
217+
(ddf["x"] >= x_range[0])
218+
& (ddf["x"] <= x_range[1])
219+
& (ddf["y"] >= y_range[0])
220+
& (ddf["y"] <= y_range[1])
221+
& ((ddf["y"] <= -eps) | (ddf["y"] >= eps))
222+
]
223+
nrows, ncols = ddf.shape[0].compute(), ddf.shape[1]
224+
print(f"After filtering : {nrows} rows, {ncols} columns")
225+
cvs = ds.Canvas(
226+
plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range
227+
)
228+
agg = cvs.points(ddf, "x", "y", agg=rd.count())
229+
agg_capped = agg.where(agg <= max_points, max_points)
230+
img = tf.shade(agg_capped, cmap=cmap, how=how)
231+
img = tf.set_background(img, bg_col)
232+
img.to_pil().save(output_path)
233+
return img
234+
```
235+
236+
237+
```python
238+
%%time
239+
240+
visualize(
241+
parquet_path=PARQUET_FP,
242+
cmap=cmap,
243+
bg_col=bg_col,
244+
output_path="bohemian_01.png",
245+
eps=1e-3,
246+
how="eq_hist",
247+
max_points=2000
248+
)
249+
```
250+
251+
Before filtering : 5000000000 rows, 2 columns
252+
After filtering : 2640645434 rows, 2 columns
253+
CPU times: user 9min 15s, sys: 11min 16s, total: 20min 31s
254+
Wall time: 1min 45s
255+
256+
257+
<p align="center">
258+
<img width="900" src="/img/2025-05-23_01/output_10_1.png" alt="bhime">
259+
</p>
260+
261+
{% if page.comments %}
262+
<div id="disqus_thread"></div>
263+
<script>
264+
265+
/**
266+
* RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
267+
* LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/
268+
/*
269+
var disqus_config = function () {
270+
this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable
271+
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
272+
};
273+
*/
274+
(function() { // DON'T EDIT BELOW THIS LINE
275+
var d = document, s = d.createElement('script');
276+
s.src = 'https://aetperf-github-io-1.disqus.com/embed.js';
277+
s.setAttribute('data-timestamp', +new Date());
278+
(d.head || d.body).appendChild(s);
279+
})();
280+
</script>
281+
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
282+
{% endif %}

0 commit comments

Comments
 (0)