Write fast and maintainable code with Halide - Part 2
Introduction
Halide is an open-source programming language designed to make it easier to write high-performance image processing or array processing code on modern machines.
In the last article of this series
I explained:
- What is Halide?
- What is the need for Halide?
- How Halide addresses the need.
In this article I’ll dig deeper and share some general concepts in Halide - key data types and different schedule primitives. For different schedule primitives - example of generated code and demo of code execution is included.
This article is pretty much a fork of the documentation and tutorials shared by the Halide development team. I have tried to summarize some of the primary concepts for the readers and my own future reference.
Important Disclaimer: Any opinion called out in this article are my own and don’t reflect opinion or stance of the organizations I work with.
Halide concepts - the general structure
I’ll be covering this topic just at the surface, for learning more please checkout the tutorials at halide-lang.org/tutorials.
A Halide code typically have some input(s)
, one or more output(s)
, some expressions that expresses the conversion of input(s)
to output(s)
and one or more schedules defining how the algorithm should run. We can define multiple schedules each targeting a different subset of hardware. Also, Halide code typically have three key components:
Func
: AFunc
object represents a pipeline stage. It’s a pure function that defines what value each pixel should have. You can think of it as a computed image.Var
:Var
objects are names to use as variables in the definition of a Func. They have no meaning by themselves. We generally usex
andy
as variables that define thex axes
andy axes
of images.Expr
:Expr
allows us to define complex expressions usingVar
s.
I’ll use the gradient example shared in the tutorials as example here:
int main(int argc, char **argv) {
Halide::Func gradient; // Func defined.
Halide::Var x, y; // Vars defined.
Halide::Expr e = x + y; // An expression for gradient defined, this is
// not really needed in this example, but serves the purpose.
gradient(x, y) = e;
// This is the same as writing:
// gradient(x, y) = x + y;
// Run it.
Halide::Buffer<int32_t> output = gradient.realize({800, 600});
return 0;
}
This should generate an image, something like this
Figure: Gradient image generated with f(x, y) = x + y with matplotlib (cmap=vidris).
Schedules
Schedule is Halides way of defining “how to run the algorithm efficiently on a particular machine”.
Following content is summary of the larger tutorial at halide-lang.org > tutorials > lesson 5
For the above gradient()
example, by default the code will be executed in sequence in row major iteration. The generated code shall look something like:
for (int y = 0; y < HEIGHT; ++y) {
for (int x = 0; x < WIDTH; ++x) {
// gradient
}
}
Figure: Execution order of the gradient example - default. Source: halide-lang.org, Apache license.
Let’s take a look at different schedule options.
Reorder
gradient.reorder(x, y);
This reorders the sequence of iteration, in this the loop shall be executed in column major fashion. The generated code will instead look like
for (int x = 0; x < WIDTH; ++x) {
for (int y = 0; y < HEIGHT; ++y) {
// gradient
}
}
Figure: Execution order of the gradient example - column major iteration. Source: halide-lang.org, Apache license.
This kind of change usually have a large impact on the performance based on how the CPU leverages caches.
Split
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
This is a very powerful primitive and allows us to split a loop in certain dimension in multiple parts. The last argument here is called the “split factor”. In this case the generated code changes to something like
for (int y = 0; y < HEIGHT; ++y) {
for (int x_outer = 0; x_outer < X_OUTER_BOUNDS; x_outer++) {
for (int x_inner = 0; x_inner < X_INNER_BOUNDS; x_inner++) {
int x = x_outer * X_INNER_BOUNDS + x_inner;
// gradient.
}
}
}
Fuse
Var fused;
gradient.fuse(x, y, fused);
This is opposite of splitting, in this case we fuse two variables and merges the two loop into a single for loop over the product of the extents. The generated code looks like
for (int fused = 0; fused < WIDTH * HEIGHT; ++fused) {
int y = fused / WIDTH;
int x = fused % WIDTH;
// gradient
}
Tiling
Var x_outer, x_inner, y_outer, y_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.split(y, y_outer, y_inner, 4);
gradient.reorder(x_inner, y_inner, x_outer, y_outer);
// Shorthand:
Var x_outer, x_inner, y_outer, y_inner;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
With both split and reorder in hand, we can now do tiled evaluations. The above schedule allows us to split both x
and y
loop by a factor of 4 and reorder the order of execution. The generated code is now complex but should look something like
for (int y_outer = 0; y_outer < 2; y_outer++) {
for (int x_outer = 0; x_outer < 2; x_outer++) {
for (int y_inner = 0; y_inner < 4; y_inner++) {
for (int x_inner = 0; x_inner < 4; x_inner++) {
int x = x_outer * 4 + x_inner;
int y = y_outer * 4 + y_inner;
// gradient
}
}
}
}
Figure: Execution order of the gradient example - tiling. Source: halide-lang.org, Apache license.
Vectorization
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 4);
gradient.vectorize(x_inner);
// Shorthand:
gradient.vectorize(x, 4);
We can instruct the compiler to now generate instructions to vectorize the inner loops in x
for loops.
This generates code like
for (int y = 0; y < HEIGHT; y++) {
for (int x_outer = 0; x_outer < X_OUTER_BOUND; x_outer++) {
for (int x_inner = 0; x_inner < X_INNER_BOUNDS / 4; x_inner+=4) {
int x = x_outer * X_INNER_BOUNDS + x_inner;
int x_vec[] = {x, x + 1, x + 2, x + 3};
int val[] = {
x_vec[0] + y,
x_vec[1] + y,
x_vec[2] + y,
x_vec[3] + y
};
// gradient.
}
}
}
Figure: Execution order of the gradient example - vectorization. Source: halide-lang.org, Apache license.
Unrolling a loop
Var x_outer, x_inner;
gradient.split(x, x_outer, x_inner, 2);
gradient.unroll(x_inner);
// Shorthand:
gradient.unroll(x, 2);
We can easily guide the compiler to unroll the loop by using unroll
primitive.
Fusing, tiling, and parallelizing
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4);
gradient.fuse(x_outer, y_outer, tile_index);
gradient.parallel(tile_index);
// Equivalent of:
Var x_outer, y_outer, x_inner, y_inner, tile_index;
gradient
.tile(x, y, x_outer, y_outer, x_inner, y_inner, 4, 4)
.fuse(x_outer, y_outer, tile_index)
.parallel(tile_index);
We can basically combine all of these construct depending on the data independence to achieve best performance. This is where tiling and fusing shine. We can parallelize across the tiles.
The above mentioned schedule generates code like
// This outermost loop should be a parallel for loop (hard to show here)
for (int tile_index = 0; tile_index < 4; tile_index++) {
int y_outer = tile_index / 2;
int x_outer = tile_index % 2;
for (int y_inner = 0; y_inner < 4; y_inner++) {
for (int x_inner = 0; x_inner < 4; x_inner++) {
int y = y_outer * 4 + y_inner;
int x = x_outer * 4 + x_inner;
// Gradient
}
}
}
Figure: Execution order of the gradient example - Fusing, tiling and vectorizing. Source: halide-lang.org, Apache license.
Putting it all together
Further combining all the constructs for a large image:
Var x_outer, y_outer, x_inner, y_inner, tile_index;
// We'll process 64x64 tiles in parallel.
gradient
.tile(x, y, x_outer, y_outer, x_inner, y_inner, 64, 64)
.fuse(x_outer, y_outer, tile_index)
.parallel(tile_index);
// We'll compute two scanlines at once while we walk across
// each tile. We'll also vectorize in x. The easiest way to
// express this is to recursively tile again within each tile
// into 4x2 subtiles, then vectorize the subtiles across x and
// unroll them across y:
Var x_inner_outer, y_inner_outer, x_vectors, y_pairs;
gradient
.tile(x_inner, y_inner, x_inner_outer, y_inner_outer, x_vectors, y_pairs, 4, 2)
.vectorize(x_vectors)
.unroll(y_pairs);
And it gets executed something like this
Figure: Putting it all together. Source: halide-lang.org, Apache license.
Shout out to the development team
A huge shout out to the Jonathan Ragan-Kelly and team for coming up with Halide and making it open source. Much of the content in this article is derived from their work published as tutorials at halide-lang.org/tutorials.
References
- Halide - halide-lang.org
- Halide tutorials
- Halide: decoupling algorithms from schedules for high-performance image processing - Open access research article by Jonathan Ragan-Kelley et. al.
- Some of my relevant articles - Processing images fast with native code in Android - How to use RenderScript to convert YUV_420_888 YUV Image to Bitmap - Faster image processing in Android Java using multi threading
Want to read more such similar contents?
I like to write articles on topic less covered on internet. They revolve around writing fast algorithms, image processing as well as general software engineering.
I publish many of them on Medium.
If you are already on medium - Please join 4200+ other members and Subscribe to my articles to get updates as I publish.
If you are not on Medium - Medium has millions of amazing articles from 100K+ authors. To get access to those, please join using my referral link. This will give you access to all the benefits of Medium and Medium shall pay me a piece to support my writing!
Thanks!