This project is essentially an attempt to recreate some of the Halide project in Rust as a means of learning that Rust language. Halide is a really clever C++ library that allows programmers to define image processing algorithms in domain specific language which are compiled according to some sort of execution strategy. These strategies might be “tile for cache efficiency” or “optimize for execution on a GPU.” The project is definitely worth poking at for a few minutes.
The project I will be discussing in this blog post is an implementation of the first “half” of Halide, using Rust. Specifically, I’ve implemented a simple DSL for image processing which is JIT compiled with LLVM. I picked this project mostly to learn rust, so my result is certainly not production code but it may still be interesting to read a bit about.
Before jumping into a discussion about how all of this works, lets look at an example of the DSL. In this example, we will define the sobel operator, then process an images with it. For a great overview of the sobel operator, check out this article.
In the DSL, there are things to worry about: Functions and Chains. A function is a single unit of work that takes an $(x,y)$ coordinate and an arbitrary number of inputs. For example, suppose we have a function $Grad(x,y)$ that returns the magnitude of the gradient of two images $I_1$ and $I_2$ at the point $(x,y)$. We might denote this function with mathematical notation as:
In the DSL I have defined, we would denote this operation in a similar manner, sans syntactic differences:
This isn’t the most beautiful way to build a representation of our function, but it works and I learned a lot implementing the magic that makes it work.
Each function stores is a syntax tree representing the expression that the function computes.
The syntax tree defined by the
grad function looks something like this:
There is also helper function that can be used to generate functions which perform a convolution on a single image with a kernel matrix For example, to generate a function that takes a single image as input and returns the convolution with the horizontal sobel matrix, use the following code:
Notice that these functions are defined in a purely functional, mathematical sense. They do not mutate their inputs, nor do they store any state, nor are they coupled to any particular inputs.
Now that we have some abstract functions, we need to compose functions to create something meaningful.
The composition of functions in my DSL is called a function chain.
Chains may be thought of as a stream of pixels, starting from
ImageSources, flowing through a number of transformation functions, and finally, resulting in a new image.
ImageSources define the starting inputs for the entire chain.
Then, any number of functions may be chained together.
For example, the entire sobel image processing chain looks something like this:
Compiling and executing the chain
Now that we have built a chain representing the entire sobel algorithm, we only need to compile the chain and use the chain to process an image:
.compile() on an image chain compiles each function in the chain provided into an LLVM module, optimizes the module with LLVM’s optimizer, and uses LLVM’s MCJIT to compile to machine code.
A compiled chain essentially just holds a function pointer to a function which will be called when the chain is executed (and some things used for bookkeeping).
The only work I had to do to go from AST to function pointer is code generation.
For this reason, LLVM is decidedly awesome.
Note: For the full sobel code, see sobel.rs.
Details, Details, Details
There’s lots of little details which may be interesting to discuss, but I’m only going to discuss the compilation method. First, we need some slightly more rigorous definitions of things:
- All of the user defined functions take two 64 bit integer values and a list of inputs. They return a 64 bit integer value.
- Because functions themselves have no concept of chaining (the gradient function doesn’t call
sobel_y), the “inputs” to a user defined function can be thought of as function-pointers which will eventually be resolved to real functions, although this is not how they are implemented.
- All of the system-defined expressions (eg.
SqrtExpr) take a 64 bit integer and return a 64 bit integer.
The compilation strategy for the DSL is very simple: Every DSL function is compiled into a function with a signature that would look something like this in C:
The array of image inputs provided here is not equivalent to the list of the inputs given to the DSL function.
The inputs given to the DSL function are resolved to other compiled functions (using the chain) during code generation, so our generated
grad function will directly call the
Since every value is a 64 bit integer, the code generation for an expression essentially just involves spitting out adds and multiplies for integers.
grad code roughly corresponds to:
Execution and image inputs
A driver function is injected into the module. This function performs some bookkeeping tasks, then just loops over the pixels in the output image, calling the appropriate function (whichever was last in the chain) for every pixel:
Image inputs (the actual images we are processing), are passed to each function.
When the compiler reaches an
ImageSource in the function chain, it emits a call to a function which returns the pixel in the image at a given $(x,y)$ coordinate.
For anyone interested, I’ve dumped the entire LLVM IR module for an unoptimized sobel chain here.
Some of the code is generated from the file core.c in the github repo for the project, if you need some hints to figure out what’s going on here.
The entry point is the function
There’s lots of other interesting little idiosyncrasies in this code but I don’t have space and you don’t have time to read about all of them.
Anyone who knows a little bit about computers and performance is probably hurting a little bit thinking about how this might perform.
You’ve noticed all of the function calls, don’t these have lots of overhead?
You’ve noticed that I’m computing the
sobel_y values twice in the gradient function.
Don’t worry, it isn’t quite so bad.
Anyone who knows a fair amount about computers and performance noticed that
inline keyword and is wondering if I’m somehow relying on function inlining to extract performance from this technique.
The answer is yes.
Every generated function is marked with an LLVM attribute
AlwaysInline which, when combined with the appropriate LLVM optimization passes, guarantees that these functions will always be inlined into their caller.
Aside: Function inlining
For those who are not totally familiar with the concept of function ininling, here’s a quick example (note that the
inline keyword in C doesn’t guarantee this behavior, it is just a hint to the compiler):
It may seem that this optimization is useful because it removes function call overhead.
This is true, but it isn’t the only critical reason that the optimization is useful.
Many compiler optimizations cannot (or do not) cross function boundaries.
Instead, they often view functions as black boxes about which nothing can be known (this is obviously an oversimplification).
This often makes sense because functions may be defined in different compilations units or in shared libraries, where the compiler cannot access their source.
Function inlining allows the compiler to “see” inside functions, then perform additional optimizations which would not have been possible otherwise.
For example, because the call to
foo has been inlined, the compiler can now (easily) optimize the function
Aggressive function inlining gives me lots of freedom in my code generation.
I can generate code which is totally inefficient, then inline everything and let the compiler do some of its magic.
Of course, this isn’t a general rule, but for this problem the generated code is highly uniform, doesn’t do much with memory (other than reading from
readonly images), and has a few other compiler freindly properties.
At the end of the day, LLVM is doing a pretty good job turning my functional style code into a big fat loop and eliminating redundant computations.
If you’re interested in looking at the optimized sobel LLVM module, here it is: gist.
To benchmark this code, I compared the JITed code with an implementation of the exact same thing written directly in Rust. My benchmarking is not extremely rigorous, but I’ve taken steps to try to create an honest benchmark.
- single core, 3.75GB RAM Google cloud compute virtual machine
- Ubuntu 16.04
- Rust stable (1.10.0)
- LLVM 3.9, built from source. Release build with assertions disabled
- Sobel code presented above
The benchmark input was a 1.2 gig collection of 3255 images of various sizes, ranging from 160x120 to 6922x6922 pixels. The image sizes were mixed to try to stave off cache effects and other size-related effects so that I could hopefully just use averages to compare performance.
Long story short, the average JIT/native speedup is 1.05x, so the LLVM JITed code is 1.05x faster than the direct Rust implementation (this AST construction time and compile time). This means that my JIT compiled code runs at the same speed (subject to some jitter) as the native rust code.
Here is a plot of image vs average speedup (the images are sorted by the total number of pixels in the image):
There are many more plots, but the overall conclusion is pretty clear: compared to the Rust, I’m not performing very poorly. Is this a win? I am not sure, I would need to do many more comparisons. These results do indicate to me that I have at least achieved reasonable performance, with a dramatically different programming style.
Final performance notes
It should be noted that these results are not entirely surprising. Rust is also using LLVM as a backend. It is probably reasonable to assume that the code Rust is generating looks pretty similar to the code I am generating, although I have not verified this.
If you’ve been nodding your head along with me, I have a confession to make: I’ve tricked you a little bit. LLVM is doing an awesome job (considering the code I’ve generated), but I’m certainly missing out on lots of opportunities for performance because of my code generation technique. Also, LLVM (or any compiler) should never be expected to be able to totally understand the problem a piece of code is trying to solve and optimize it perfectly. To really get good performance, I would need to pay attention to caching and quite a few other things which I have totally ignored. Hand tuned code should (and certainly would) run in circles around the code JIT compiled algorithm I’ve generated here.
If you want something that gives you an awesome DSL AND all sorts of control over cache scheduling and whatnot, take a look at Halide. If you have no idea what I’m talking about or why any of this matters, take a look at Halide anyway. The Halide talks give fantastic descriptions of many of the problems it aims to solve.
Overall, this project was extremely enjoyable. I had yet another opportunity to fiddle with LLVM, which is always lots of fun (but sometimes very painful). I learned a little bit about image processing and some of the challenges that arise when shuffling pixels around. Finally, I learned a little bit of Rust. I have only one thing to say about Rust: Rust is an amazing language. Go learn Rust.