Portable SIMD Programming in Rust

by Caleb Zulawski

Rust’s portable SIMD module enables users to write high performance code without wading into the arcane details of instruction sets.

This book is intended for anyone who is familiar with Rust and curious about SIMD. A few people who might benefit from reading this:

Someone who has never used SIMD but is interested in speeding up their code’s arithmetic
A programmer looking to simplify a codebase with target-specific SIMD code
An experienced SIMD programmer who wants to branch out to other target architectures
Someone already familiar with portable SIMD who wants to fill out their knowledge

A quick introduction

SIMD is short for single instruction, multiple data. As the name suggests, a single SIMD instruction can operate on multiple data values simultaneously. This type of parallelism can speed up programs, and comes with much lower overhead and complexity than other types of parallelism, such as concurrency.

Consider the following function:

#![allow(unused)]
fn main() {
fn add_array(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    [
        a[0] + b[0],
        a[1] + b[1],
        a[2] + b[2],
        a[3] + b[3],
    ]
}
}

Without SIMD, each array element is computed separately. On x86-64, this instruction is addss:

With SIMD, however, all 4 array elements can be computed with a single instruction! On x86-64, this instruction is addps:

SIMD operations make use of special vector registers to perform operations on multiple values at once. The elements of these vectors are also known as lanes.

Portable SIMD

Rust provides a “portable” SIMD implementation, available in the std::simd module.

What is “portable”?

Compilers, including the Rust compiler, often implement SIMD in a few different ways. To understand what “portable” means, let’s explore the different implementations.

Automatic Vectorization

One way that Rust makes use of SIMD is by an optimization called automatic vectorization. Instead of using any special SIMD support in the Rust language, the compiler tries to identify regular Rust code that it can turn into SIMD operations.

This is a convenient optimization because the programmer doesn’t need to do anything special to their code. The downside is that the optimization only works in specific circumstances, since the compiler must be able to determine that it can split up a particular bit of code to run in parallel.

Vendor Intrinsics

Hardware vendors provide special functions called intrinsics which correspond to particular instructions or behavior. In Rust, these are provided by the std::arch module.

Using vendor intrinsics is a form of explicit SIMD, in which the programmer specifies exactly how the program is parallelized. Vendor intrinsics also provide complete access to features available to a particular target architecture, offering the best potential for performance. This capability comes at the cost of portability: code that uses vendor intrinsics will only work on that particular target.

Portable SIMD

Rust’s portable SIMD implementation offers a middle ground between automatic vectorization and vendor intrinsics.

Portable SIMD is another form of explicit SIMD and allows the programmer to specify how the program is parallelized. Unlike vendor intrinsics, portable SIMD works on all targets. Rust’s portable SIMD vector types provide the most common SIMD operations that perform well on most targets.

In many cases, using portable SIMD can result in similar or identical compiled programs as the equivalent code using vendor intrinsics. With portable SIMD, programmers have the ability to write intricate vectorized programs and be confident that their code will not just perform well, but will also act as expected on any target architecture. Reaching for vendor intrinsics is often only necessary when taking advantage of an unusual feature provided by a specific target.

Vectors

Rust provides portable SIMD vectors with the Simd type.

Simd<T, N> can be thought of as a variation of [T; N]. Simd has the same “shape” as an array, but may have a greater alignment. In fact, Simd<T, N> is easily convertible to and from [T; N], and supports some of the same operations, such as indexing.

Elementwise operations on vectors

Unlike arrays, Simd<T, N> is not just a container. While it looks like an array, it also operates like T. Vectors implement all of the same basic operators as its scalar type T, meaning you can do the following:

#![feature(portable_simd)]
fn main() {}
use std::simd::f32x4; // an alias for Simd<f32, 4>

/// y = mx + b
fn y(m: f32x4, x: f32x4, b: f32x4) -> f32x4 {
    m * x + b
}

These operators work elementwise, meaning the operator is applied to each element separately.

Vectors even implement special functions (e.g. abs) with traits:

#![feature(portable_simd)]
fn main() {}
use std::simd::{f32x4, SimdFloat}; // SimdFloat provides abs

/// |a - b|
fn distance(a: f32x4, b: f32x4) -> f32x4 {
    (a - b).abs()
}

Reduction operations

Sometimes you may want to perform an operation across a single vector, rather than between vectors. These operations are referred to as reductions:

#![feature(portable_simd)]
fn main() {}
use std::simd::{f32x4, SimdFloat}; // SimdFloat provides reduce_sum

fn sum(vectors: &[f32x4]) -> f32 {
    let mut sums = f32x4::splat(0.0); // splat fills each element with the value
    for v in vectors {
        sums += v;
    }
    
    // `sums` now contains the elementwise sums, so we must sum across the vector
    sums.reduce_sum()
}

Note

Reductions are slower than elementwise operations in most cases. It’s best to use elementwise operations when possible, and use reductions only when necessary.

Masks

Rust also provides a “truthy” vector type: Mask<T, N>.

While operating like bool, the element type of a mask is always a signed integer. The size of the integer corresponds to the vector elements that the mask can interact with.

Masks operate like [bool; N], but their layout is unspecified and target-specific.

Elementwise operations with masks

Masks are typically produced by comparisons. Like vectors, they also support elementwise operations:

#![feature(portable_simd)]
fn main() {}
// mask32x4 is an alias for Mask<i32, 4>
//
// SimdPartialOrd is the elementwise counterpart to PartialOrd,
// and provides `simd_lt`, `simd_gt`, etc.
use std::simd::{f32x4, mask32x4, SimdPartialOrd};

fn is_between(x: f32x4, lower_bound: f32x4, upper_bound: f32x4) -> mask32x4 {
    let above_lower_bound: mask32x4 = x.simd_gt(lower_bound);
    let below_upper_bound: mask32x4 = x.simd_lt(upper_bound);
    above_lower_bound & below_upper_bound
}

Reduction operations

Masks also support reduction operations:

#![feature(portable_simd)]
fn main() {}
use std::simd::{f32x4, mask32x4, SimdFloat}; // SimdFloat provides is_infinite

fn any_infinite(vectors: &[f32x4]) -> bool {
    let mut infinite = mask32x4::splat(false);
    for v in vectors {
        infinite |= v.is_infinite();
    }
    
    // check if any element is `true`
    infinite.any()
}

Conditionally selecting elements

Masks can be used to conditionally select elements from two vectors:

#![feature(portable_simd)]
fn main() {}
use std::simd::{f32x4, mask32x4, SimdFloat}; // SimdFloat provides is_nan

/// replace nan values with 0
fn replace_nan(x: f32x4) -> f32x4 {
    let nans: mask32x4 = x.is_nan();
    nans.select(f32x4::splat(0.0), x)
}

Vendor Intrinsics

Sometimes you can’t avoid using vendor intrinsics. In those cases, it’s easy to convert between vendor vector types and portable SIMD vector types:

#![feature(portable_simd)]
fn main() {}
use std::arch::x86_64::{_mm_stream_ps, __m128};
use std::simd::f32x4;

unsafe fn non_temporal_store(addr: *mut f32, vector: f32x4) {
    let vendor: __m128 = vector.into(); // convert into the vendor type
    _mm_stream_ps(addr, vendor);
}

Target Features

For many architectures, SIMD support is an optional CPU feature. Rust supports enabling a variety of these “target features”.

To view a list of features supported by a target, run:

rustc --print target-features

To view a list of CPUs supported by a target, run:

rustc --print target-cpus

This list of target features is also available in the documentation for the target-features crate.

The following sections will address different approaches to enabling these target features.

Program-wide with RUSTFLAGS

Target features can be enabled program-wide by setting RUSTFLAGS. The following example enables avx and avx2, and disables fma:

RUSTFLAGS="-Ctarget-features=+avx,+avx2,-fma" cargo build

Instead of targeting specific features, a particular CPU can be targeted:

RUSTFLAGS="-Ctarget-cpu=skylake" cargo build

Rust can also target your specific CPU, with the special native CPU:

RUSTFLAGS="-Ctarget-cpu=native" cargo build

Warning

Enabling features program-wide can be dangerous!

If the program runs on a CPU that does not have an enabled feature, the program will crash.

Runtime detection of target features

To safely use an optional target feature, the program must detect it at runtime. Once a feature is detected, it can be safely used:

#[target_feature(enable = "avx")]
unsafe fn use_avx() {
    println!("This function uses AVX!")
}

fn main() {
    if is_x86_feature_detected!("avx") {
        unsafe { use_avx() }
    } else {
        println!("We can't use AVX.");
    }
}

In this example, the target_feature attribute enables the avx feature. Unlike setting the target features with RUSTFLAGS, this limits the features to particular functions. The is_*_feature_detected macros can then be used to check if the feature is supported, and safely handle the situation where the feature is not present.

The easy way, with multiversion

The multiversion crate is helpful for automatically multiversioning functions.

A multiversioned function is one that’s compiled multiple times for any number of targets, with the optimal function selected at runtime.

use multiversion::multiversion;

#[multiversion(targets(
    "x86_64+avx",
    "x86_64+sse4.2",
    "arm+neon",
))]
fn multiversioned() {
    println!("This function uses whichever features are available")
}

Multiversion also supports automatically targeting all SIMD features on all architectures:

use multiversion::multiversion;

#[multiversion(targets = "simd")]
fn multiversioned() {
    println!("This function automatically uses the best SIMD feature available")
}

For each target, a copy of the function is compiled with the appropriate target_feature attributes. At runtime, the optimal function is selected.

Info

Compared to the example on the previous page, multiversion also selects the appropriate function with less performance overhead.

Tips and tricks

Portable SIMD is intended to be an accessible yet powerful tool. There are a few common pitfalls that can affect the speed of your code. This section will address those pitfalls and provide some recommendations on how to avoid them.

Inlining and target features

Inlining and target features have a few interactions that can significantly affect the speed of your program.

`#[target_feature]` hinders inlining

Compilers are free to optimize functions by reordering their operations, as long as it doesn’t change the observable results. To prevent these optimizations from reordering target feature detection, the #[target_feature] attribute prevents inlining.

Since non-inlined function calls can be slow, try to use one large function tagged with #[target_feature], rather than many separate functions. Additionally, multiple uses of #[target_feature] may not be necessary, as explained in the next paragraph.

Info

Functions with #[target_feature] can still inline into other functions that support the same features via #[target_feature] or the -Ctarget-feature flag.

Inlined functions can inherit target features

When a function is inlined into another function tagged with #[target_feature], the inlined function is also compiled with those target features. To help ensure inlining, the #[inline] attribute can be used. This behavior can be used to write functions without worrying about what the final target features will be—in fact, this is what portable SIMD does! Functions in std::simd are inlined, allowing the user to specify the target features.

In the following example, all of the code is generated using AVX, because it’s all inlined into the function with #[target_feature]:

#![feature(portable_simd)]
fn main() {}
use std::simd::f32x4;

#[inline]
fn double(x: f32x4) -> f32x4 {
    x * f32x4::splat(2.0) // The `splat` and `*` functions are inlined here
}

#[target_feature(enable = "avx")]
unsafe fn double_avx(x: f32x4) -> f32x4 {
    double(x) // The `double` function is inlined here
}

Target features affect calling convention

The target features of a function affect its calling convention, the way memory is arranged when calling the function. Specifically, the target features affect which vector registers are used to pass vectors.

To ensure that functions with different target features can still be used in the same program, Rust passes vectors by reference instead of by value. This can significantly slow down calling functions with vector arguments because it forces the vector to be written to memory.

To avoid this behavior, avoid using vectors as arguments in non-inlined functions. Inlined functions are not affected because the function call is optimized out.

Native vector width

Different architectures have different vector register sizes. To make it even more complicated, some architectures have different vector register sizes available depending on target features and element type.

Using a fixed vector size

Some algorithms might be better suited to particular vector sizes, even if it doesn’t match the native vector size. In the case of mismatched sizes, there are two possibilities:

If the vectors are smaller than the native vector registers, native vectors will still be used but will be partially empty. In this scenario, the full parallel capability of the target is underused.
If the vectors are larger than the native vector registers, multiple native vectors will be used to emulate one large vector. If too many native vectors are used, the program can spill, meaning it has run out of registers and must use (much slower) memory instead.

Using a fixed vector size is a tradeoff between these possible issues and the design of the particular algorithm.

Using the native vector size

The target-features crate provides a function for determining the native vector size:

use target_features::CURRENT_TARGET;

// Different element types can have different vector sizes. Here we use `f32`.
const N: usize = if let Some(size) = CURRENT_TARGET.suggested_vector_width::<f32>() {
    size
} else {
    // If SIMD isn't supported natively, we use a vector of 1 element.
    // This is effectively a scalar value.
    1
};

/// Now we can use `Vector` instead of a particular vector type.
type Vector = Simd<f32, N>;

When detecting features, the multiversion crate provides the same capability:

use multiversion::{multiversion, target::selected_target};

#[multiversion(targets = "simd")]
fn example() {
    // The `selected_target` macro takes into account the detected optional features,
    // and not just the base features used in the previous example.
    const N: usize = if let Some(size) = selected_target!().suggested_vector_width::<f32>() {
        size
    } else {
        1
    };

    /// Once again, we can use `Vector` instead of a particular vector type.
    type Vector = Simd<f32, N>;
}