We’ve been hard at work behind the scenes on Codon and are thrilled to finally be releasing the next major version, marking our first major release of 2025. In this blog post, we’ll go over the new changes, cover key features, do a technical deep dive and more!
Summary
For the impatient, here's a quick summary of the announcements:
- Codon is now fully open source under the Apache-2.0 license
- Codon now ships with a new, native NumPy implementation that...
- ... works with Codon's multithreading and GPU backends
- ... includes NumPy-specific compiler optimizations
- ... attains 2.4x average (geo mean) and up to 900x speedups on the NPBench benchmark suite, single-threaded
Codon is now fully open source
We’re excited to share that Codon is switching from the non-commercial-use Business Source License to the popular Apache License, meaning that from here on out Codon is truly open source and free even for commercial use. Exaloop will continue to offer enterprise licenses with support & services packages for organizations that want and need them, but now anyone, anywhere can use and build with Codon freely. We think this is a necessary step towards Codon achieving its full potential when it comes to bringing performance, scalability and hardware acceleration to Python!
A new NumPy implementation built for Codon, in Codon
We’re also thrilled to finally be releasing a feature-complete, Codon-native NumPy implementation that’s built from scratch in Codon itself, fully compiled, can leverage Codon’s multithreading and GPU features, and ships with NumPy-specific compiler optimizations. We’ll talk a lot more about Codon’s new NumPy support in this blog post, but you can also learn more in the docs and view the source code of the implementation on GitHub.
We’ve found that NumPy support has been the biggest barrier for many folks looking to use Codon, and while it’s possible to call vanilla NumPy (or any other Python library) through Codon’s Python interoperability support, doing so leaves a lot of performance on the table and limits Codon’s compilation & optimization benefits. With this release, Codon will ship with native NumPy support that…
- … is fully compiled with Codon itself, meaning Codon’s optimization framework and LLVM backend have full freedom to inline, auto-vectorize and perform other compiler optimizations that aren’t possible with vanilla NumPy’s opaque C extensions.
- … works seamlessly with Codon’s parallelism features. That means you can e.g. operate on arrays in parallel or offload NumPy processes to the GPU.
- … incorporates NumPy-specific compiler optimizations like operator fusion through new Codon IR optimization passes.
Here are some benchmark results, using the NPBench benchmark suite. All of these benchmarks were run on macOS with Apple Silicon as well as x86_64 Linux against Python 3.12 + NumPy 1.26 without using Codon's parallelism features. The geometric mean of speedups is 2.4x and the maximum speedup attained is over 900x.
Of course, leveraging Codon's multithreading and GPU capabilities would result in much larger speedups still. If you want to try running some of these benchmarks or your own programs, Codon’s new NumPy implementation is publicly available now – all you need to do to give it a try is update to the latest version of Codon. For the remainder of this blog post, we want to delve into the technical & design details of Codon-NumPy and also showcase some cool examples!
Codon-NumPy deep dive
Let’s look at a few examples to get our feet wet.
Basics
You can use Codon’s NumPy implementation by simply importing numpy
as usual:
import numpy as np
arr = np.arange(10)
print(arr) # 0 1 2 .. 9
print((arr ** 2).sum()) # 285
You can also use the usual submodules, like numpy.linalg
:
import numpy as np
a = np.array([[5+2j, 9-2j], [0+2j, 2-1j]])
print(np.linalg.eigh(a).eigenvectors)
# -0.4472 -0.j -0.8944 +0.j
# 0. +0.8944j 0. -0.4472j
Like standard NumPy, Codon-NumPy will use an optimized BLAS library to implement several linear algebra routines, defaulting to OpenBLAS on Linux and Accelerate on macOS.
Loops
Loops are notoriously slow in standard Python, but they’re very fast in Codon since they get fully compiled to native code. For instance, imagine we want to initialize a large array $A$ such that $A_{i,j,k} = i + j + k$. Here’s a simple example that does this and measures the runtime:
import numpy as np
import time
a = np.empty((300, 300, 300), dtype=np.float32)
t0 = time.time()
for i in range(300):
for j in range(300):
for k in range(300):
a[i, j, k] = i + j + k
t1 = time.time()
print(a.mean())
print('time:', t1 - t0)
On my M1 MacBook Pro, running this in Python takes 3.5 seconds whereas Codon takes ~0.01 seconds (a ~300x speedup). In short, you don’t need to worry about loops or other Python constructs being slow when using Codon, since everything gets compiled and optimized.
Fusions
Codon includes NumPy-specific compiler optimizations that can improve the performance of NumPy programs through techniques like operator fusion and memory allocation elision. Let’s take a look at an example that illustrates this—the following program approximates $\pi$ by generating a series of random points in the unit square $[0, 1) \times [0, 1)$ and computing the fraction that lie in the circle of radius $0.5$ centered at $(0.5, 0.5)$. If you do the math, it turns out that this ratio is about $\pi \over 4$:
Here’s the code:
import time
import numpy as np
rng = np.random.default_rng(seed=0)
x = rng.random(500_000_000) # x coordinates
y = rng.random(500_000_000) # y coordinates
t0 = time.time()
# pi ~= 4 x (fraction of points in circle)
pi = ((x-1)**2 + (y-1)**2 < 1).sum() * (4 / len(x))
t1 = time.time()
print(f'Computed pi~={pi:.4f} in {t1 - t0:.2f} sec')
By the way, Codon’s numpy.random
implements all of the same pseudorandom number generation algorithms that are present in standard NumPy, so the generated random numbers are identical in both Codon and Python for the same seed.
The code takes ~2.4 seconds with standard NumPy, but in Codon it takes ~0.42 seconds (on an M1 MacBook Pro) for a ~6x speedup. The core reason for the speedup is that Codon optimizes the expression (x-1)**2 + (y-1)**2 < 1
so that it gets evaluated with just a single pass over the x
and y
arrays, whereas standard NumPy must first compute all the subexpressions x-1
, (x-1)**2
and so on.
You can have Codon print information about fused expressions via the -npfuse-verbose
flag:
Optimizing expression at pi.py:10:7
lt <array[bool, 1]> [cost=6]
add <array[f64, 1]> [cost=5]
pow <array[f64, 1]> [cost=2]
sub <array[f64, 1]> [cost=1]
a0 <array[f64, 1]>
a1 <i64>
a2 <i64>
pow <array[f64, 1]> [cost=2]
sub <array[f64, 1]> [cost=1]
a3 <array[f64, 1]>
a4 <i64>
a5 <i64>
a6 <i64>
-> static fuse:
lt <array[bool, 1]> [cost=6]
add <array[f64, 1]> [cost=5]
pow <array[f64, 1]> [cost=2]
sub <array[f64, 1]> [cost=1]
a0 <array[f64, 1]>
a1 <i64>
a2 <i64>
pow <array[f64, 1]> [cost=2]
sub <array[f64, 1]> [cost=1]
a3 <array[f64, 1]>
a4 <i64>
a5 <i64>
a6 <i64>
Codon’s fusion pass uses a cost model internally to determine the optimal execution strategy for a given expression. We’ll talk more about that later!
Multithreading
Since Codon supports full multithreading and has no “Global Interpreter Lock” like standard Python, we can run NumPy code in parallel. Here’s an example:
import numpy as np
import numpy.random as rnd
import time
N = 100_000_000
n = 10
rng = rnd.default_rng(seed=0)
x = rng.normal(size=(n, N))
y = np.empty(n)
t0 = time.time()
@par(num_threads=n) # tells Codon to run the loop on 'n' threads
for i in range(n):
y[i] = np.ptp(np.exp(-x[i, :]) ** 2)
t1 = time.time()
print(y)
print(t1 - t0, 'seconds')
On my M1 MacBook Pro the timings are:
- Python: 8.7s
- Codon single-threaded: 6.7s
- Codon multi-threaded 1.2s (7x speedup over Python)
GPU
Codon’s GPU backend works seamlessly with Codon-NumPy. You can pass arrays to and from the GPU and also process arrays on the device. Here’s an example that computes the Mandelbrot set:
import numpy as np
import gpu
MAX = 1000 # maximum Mandelbrot iterations
N = 4096 # width and height of image
pixels = np.empty((N, N), int)
def scale(x, a, b):
return a + (x/N)*(b - a)
@gpu.kernel
def mandelbrot(pixels):
i = (gpu.block.x * gpu.block.dim.x) + gpu.thread.x
j = (gpu.block.y * gpu.block.dim.y) + gpu.thread.y
c = complex(scale(j, -2.00, 0.47), scale(i, -1.12, 1.12))
z = 0j
iteration = 0
while abs(z) <= 2 and iteration < MAX:
z = z**2 + c
iteration += 1
pixels[i, j] = 255 * iteration/MAX
mandelbrot(pixels, grid=(N//32, N//32), block=(32, 32))
Of course, because we are in fact using the GPU to compute the Mandelbrot entries in parallel, this code runs 1000s of times faster than in standard Python / NumPy without GPU offloading.
PyTorch integration
Since PyTorch tensors can be converted to NumPy arrays without any data copying, it’s easy to integrate Codon into PyTorch workflows. Let’s return to the first example of initializing an array (or, tensor in this case) $A$ such that $A_{i,j,k} = i + j + k$. We can do this via Codon’s just-in-time (“JIT”) mode:
import numpy as np
import time
import codon
import torch
@codon.jit
def initialize(arr):
for i in range(300):
for j in range(300):
for k in range(300):
arr[i, j, k] = i + j + k
# first call JIT-compiles; subsequent calls use cached JIT'd code
tensor = torch.empty(300, 300, 300)
initialize(tensor.numpy()) # note that 'tensor.numpy()' doesn't copy data!
tensor = torch.empty(300, 300, 300)
t0 = time.time()
initialize(tensor.numpy())
t1 = time.time()
print(tensor)
print(t1 - t0, 'seconds')
Timings are similar to those in the first example (i.e. ~300x faster than pure Python). Note that Codon’s JIT will compile the decorated function on the first invocation and cache the compiled code for future calls, meaning the first call incurs the compilation overhead.
It’s also possible to use Codon’s Python extension backend for the same purpose, which compiles everything ahead of time.
Codon’s new ndarray
type
With those few examples under our belt, let’s dig a bit deeper into how Codon-NumPy works internally, both at the library and compiler level. Codon’s ndarray
type is very lightweight – in fact it’s represented by a named tuple with just 3 fields:
shape
: tuple ofndim
integers representing the shape of the array, with one entry for each axisstrides
: tuple ofndim
integers representing the stride in bytes to get to the next element along each axisdata
: pointer to the actual array data
That’s it. Codon’s ndarray
type is itself also parameterized by i) the array data type, dtype
and ii) the array dimension, ndim
. We chose to make the array dimension a type parameter because it enables a wider range of optimizations and also has almost no effect on the API, since virtually all NumPy functions return arrays with dimensions that are determinable at compile time, given that the input dimensions are known.
Here’s an example of what np.array([[1, 2, 3], [4, 5, 6]])
translates to both logically and physically:
What’s nice about this design is that it has zero runtime overhead and is very amenable to compiler optimizations by LLVM (the compilation framework Codon uses in its backend). For example, shape
and strides
can usually be optimized out, and the result is often identical to that of equivalent C code that just operates on raw pointers.
Vectorization
Vectorization is a critical element of NumPy, and one of the reasons why it can achieve good performance. Standard NumPy uses Python C extensions to implement vectorized loops for its many operations, from binary operations to reductions and so on, for various CPU architectures and SIMD instruction sets like SSE, AVX, Neon etc. By contrast, because Codon is a compiler, Codon-NumPy leverages LLVM’s auto-vectorization capabilities to automatically generate efficient, vectorized code for the host machine. This has a few different advantages:
- It simplifies the code since we don’t need to deal with vectorization or SIMD at the library level, nor worry about maintaining separate implementations for different instruction sets. We just let the compiler worry about all that!
- It works on user code just as well. For example, a loop or series of NumPy operations written by the user will be vectorized just like the library itself is auto-vectorized.
- It can sometimes outperform NumPy’s hand-vectorized loops as the compiler has complete and detailed knowledge of the host machine and can specialize for the given CPU, architecture, instruction set, etc.
- It works well with Codon’s operator fusion optimizations, since fused operations can be auto-vectorized.
The compiler can be pretty clever when it comes to reasoning about and optimizing loops. For example, consider the following code for initializing all the elements of an array to a constant value:
for i in range(len(a)):
a[i] = 3.14
This compiles down to a single call to the libc function memset_pattern16
which sets the values in an array to a constant value:
tail call void @memset_pattern16(ptr nonnull %0, ptr nonnull @.memset_pattern, i64 8000000)
Another example is the expression np.sqrt(a**2 + b**2)
on two fixed-sized contiguous vectors a
and b
, which compiles to the following LLVM IR:
vector.body: ; preds = %entry, %vector.body
%index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ]
%14 = getelementptr double, ptr %10, i64 %index
%15 = getelementptr double, ptr %.fca.2.extract.i675.i.i, i64 %index
%16 = getelementptr double, ptr %.fca.2.extract.i676.i.i, i64 %index
%wide.load = load <2 x double>, ptr %15, align 8
%17 = getelementptr double, ptr %15, i64 2
%wide.load123 = load <2 x double>, ptr %17, align 8
%18 = getelementptr double, ptr %15, i64 4
%wide.load124 = load <2 x double>, ptr %18, align 8
%19 = getelementptr double, ptr %15, i64 6
%wide.load125 = load <2 x double>, ptr %19, align 8
%20 = fmul <2 x double> %wide.load, %wide.load
%21 = fmul <2 x double> %wide.load123, %wide.load123
%22 = fmul <2 x double> %wide.load124, %wide.load124
%23 = fmul <2 x double> %wide.load125, %wide.load125
%wide.load126 = load <2 x double>, ptr %16, align 8
%24 = getelementptr double, ptr %16, i64 2
%wide.load127 = load <2 x double>, ptr %24, align 8
%25 = getelementptr double, ptr %16, i64 4
%wide.load128 = load <2 x double>, ptr %25, align 8
%26 = getelementptr double, ptr %16, i64 6
%wide.load129 = load <2 x double>, ptr %26, align 8
%27 = fmul <2 x double> %wide.load126, %wide.load126
%28 = fmul <2 x double> %wide.load127, %wide.load127
%29 = fmul <2 x double> %wide.load128, %wide.load128
%30 = fmul <2 x double> %wide.load129, %wide.load129
%31 = fadd <2 x double> %20, %27
%32 = fadd <2 x double> %21, %28
%33 = fadd <2 x double> %22, %29
%34 = fadd <2 x double> %23, %30
%35 = tail call <2 x double> @llvm.sqrt.v2f64(<2 x double> %31)
%36 = tail call <2 x double> @llvm.sqrt.v2f64(<2 x double> %32)
%37 = tail call <2 x double> @llvm.sqrt.v2f64(<2 x double> %33)
%38 = tail call <2 x double> @llvm.sqrt.v2f64(<2 x double> %34)
store <2 x double> %35, ptr %14, align 8
%39 = getelementptr double, ptr %14, i64 2
store <2 x double> %36, ptr %39, align 8
%40 = getelementptr double, ptr %14, i64 4
store <2 x double> %37, ptr %40, align 8
%41 = getelementptr double, ptr %14, i64 6
store <2 x double> %38, ptr %41, align 8
%index.next = add nuw nsw i64 %index, 8
%42 = icmp eq i64 %index.next, 1000000
br i1 %42, label %"std.numpy.fusion._loop_alloc:0[Tuple[std.numpy.ndarray.ndarray[float,1],std.numpy.ndarray.ndarray[float,1]],Function[Tuple[Ptr[float],Ptr[float],Ptr[float],Tuple[int,int]],NoneType],Tuple[int,int],float].32064.exit", label %vector.body, !llvm.loop !2
If you’re not well-versed in LLVM IR, don’t worry – the key takeaway is that the individual operations get combined (partly thanks to Codon’s operator fusion optimizations) and aggressively vectorized and unrolled.
For some math functions like cos()
and exp()
, Codon uses Google’s Highway library for efficient vectorized implementations.
Operator fusion optimizations
Codon includes compiler passes that optimize NumPy code through methods like operator fusion, which combine distinct operations so that they can be executed during a single pass through the argument arrays, saving both execution time and memory (since intermediate arrays no longer need to be allocated).
Going back to the same example above for approximating $\pi$:
import time
import numpy as np
rng = np.random.default_rng(seed=0)
x = rng.random(500_000_000) # x coordinates
y = rng.random(500_000_000) # y coordinates
t0 = time.time()
# pi ~= 4 x (fraction of points in circle)
pi = ((x-1)**2 + (y-1)**2 < 1).sum() * (4 / len(x))
t1 = time.time()
print(f'Computed pi~={pi:.4f} in {t1 - t0:.2f} sec')
Codon will fuse the expression (x-1)**2 + (y-1)**2 < 1
so that it gets executed in just a single pass over x
and y
, instead of computing each subexpression, allocating a new array for each etc. like standard NumPy has to do. That gives about a 6x speedup in this case (on my machine).
But how does this optimization work under the hood? Well, it’s actually implemented as a Codon IR pass that extracts NumPy expressions from the program, analyzes them, and uses a cost model to decide when to fuse, when to evaluate sequentially and so on. For example, for the program above, the pass internally extracts the (x-1)**2 + (y-1)**2 < 1
expression from the IR and puts it into an AST-like format:
Then it assigns a cost to each sub-tree based on the operations involved within. Leaf nodes have zero cost because they don’t involve a computation, and internal operator nodes have a cost that corresponds roughly to how expensive it is to carry out that particular operation. Simple operations like addition and multiplication have cost 1, whereas more complex operations like sin
or log
have higher costs. For example, here are some costs for our expression above:
In this case, all the operations have cost 1. Note that “power” (“**
”) in general has a higher cost, but because squaring (i.e. “** 2
”) is such a common operation, it’s special-cased in the compiler pass to have unit cost.
But why do we need to worry about costs at all? It turns out that fusion becomes less beneficial as the expression becomes more complex (i.e. “costly”) due to factors like memory bandwidth limitations, register pressure, instruction pipelining, etc. So we essentially use the cost as a proxy for how beneficial fusion will be, and use a cost threshold to decide when to fuse a given expression as opposed to evaluating it sequentially.
It might be the case that, even in a single expression, the optimal strategy based on this heuristic approach is to fuse some parts of it but evaluate other parts sequentially. For instance, let’s say our cost threshold was 10 and we have the following expression configuration:
Each sub-expression has cost below the threshold, while the larger expression that joins the two with an addition has cost above the threshold. So, the best thing to do based on our heuristic would be to fuse each subexpression, then evaluate the addition on the result of each, which is precisely what Codon’s optimization pass will do. Algorithmically, we just replace each fused sub-expression with a single node representing the computed result, and repeat this until we’re left with either a single node or a tree that can’t be fused anymore based on our threshold.
Handling variables effectively
All of this works great if expressions are written straightforwardly as one unit, but real-world code often breaks up what would otherwise effectively be a single expression by introducing variables. For example, take the $\pi$ approximation code above; the pi = ...
bit could just as easily have been written as:
s = (x - 1) ** 2
t = (y - 1) ** 2
pi = (s + t < 1).sum() * (4 / len(x))
The program is semantically the same despite the introduction of the s
and t
variables, but now we suddenly have three expressions instead of one, which limits the extent to which we can perform fusions!
To remedy this, Codon’s fusion pass implements a pre-processing step that tracks variable assignments and automatically “forwards” variables into other expressions in which they’re used, assuming the compiler can prove that it’s valid to do so. Of course, several terms and conditions apply; for example the variable should only be used once (otherwise there would be a redundant computation introduced), and the assigned expression must be “pure” in that it doesn’t have other side effects. In practice we also need to perform a control-flow analysis to figure out which uses of a particular variable are reachable from which assignments and that we aren’t altering program semantics.
You can think of the blue, red and black boxes as control-flow graph (CFG) nodes, and one of the analyses involved in making sure this optimization is valid is to ensure that there isn’t anything along the blue and red arrows (representing paths in the CFG) that would interfere with this forwarding transformation, such as assignments to s
and t
, other expressions with side effects, etc. In practice, the data structure representing these forwarded variables and expressions is actually a tree, since there can be transitive expression forwarding opportunities such as in the following code:
a = x + 1
b = a / 2
c = b ** 2 # really '((x + 1) / 2) ** 2'
The end result is that our new $\pi$ approximation program with variables gets optimized in exactly the same way as the original.
Future work
There are several additional features for Codon’s NumPy-specific optimizations that we have planned for the near future, such as:
- Automatic multithreading: Fused expressions would be automatically parallelized through Codon’s multithreading / OpenMP backend. This would be controlled by a compiler flag or environment variable.
- Automatic GPU acceleration: If the compiler deems it profitable to do so, fused expressions would be automatically evaluated on the GPU, assuming there’s a GPU available on the host system. This would similarly be controlled by a compiler flag or environment variable.
- Handling more operators: While the fusion pass supports nearly all of NumPy’s unary and element-wise binary operations, there are additional non-element-wise operations like matrix multiplication and transpose that can also be handled effectively. Currently, Codon handles these operations sequentially, but it’s possible to generate fused kernels at compile-time that integrate non-element-wise operations efficiently.
- Automatic differentiation: Codon’s fusion framework can also be applied to automatic differentiation by processing the produced expression syntax trees, which would be useful for various AI/ML applications.
So there’s no shortage of interesting things to work on!
Performance tips
To close out this section, here are a few performance tips to keep in mind when using Codon’s new NumPy implementation:
Array layouts
As with standard NumPy, Codon-NumPy performs best when array data is contiguous in memory, ideally in row-major order (also called "C-order", as opposed to column-major / “Fortran” order). Most NumPy functions will return C-order arrays, but operations like slicing and transposing arrays can alter contiguity. You can use numpy.ascontiguousarray()
to create a contiguous array from an arbitrary array.
Linux huge pages
When working with large arrays on Linux, enabling transparent hugepages can result in significant performance improvements.
You can check if transparent hugepages are enabled via
cat /sys/kernel/mm/transparent_hugepage/enabled
and you can enable them via
echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
Disabling exceptions
By default, Codon performs various validation checks at runtime (e.g. bounds checks when indexing an array) just like standard NumPy, and raises an exception if they fail. If you know your program will not raise or catch any exceptions, you can disable these checks through the -disable-exceptions
compiler flag, which can sometimes yield additional optimization opportunities, e.g. for vectorization. (Note that when using this flag, raising an exception will terminate the program with a SIGTRAP
.)
Fast-math
This one is a bit of a double-edged sword, but you can enable "fast-math" optimizations via the -fast-math
compiler flag. It is advisable to use this flag with caution as it changes floating point semantics and makes assumptions regarding inf
and nan
values. For more information, you can take a look at LLVM's documentation on fast-math flags, and also a couple blog posts describing why it’s best to be cautious with this flag such as here and here.
What’s next?
We’re excited to see what use cases and applications the community comes up with for Codon with its newfound NumPy support. If you haven’t used Codon before, here are some resources to get you started:
- You can download and install Codon with this command:
/bin/bash -c "$(curl -fsSL
https://exaloop.io/install.sh
)"
. - Codon is available on GitHub. Feel free to submit issues and give feedback there.
- You can find detailed docs at docs.exaloop.io.
We have a number of other projects in the works that we’re equally excited about as well, like a Codon-native Pandas implementation in the same spirit as Codon-NumPy. Codon’s compilation framework lends itself well to optimizing data frame queries just like it can optimize NumPy expressions, and we’re looking forward to sharing more in the near future. Stay tuned for more updates, and happy coding!