Vector API (JEP 448): SIMD Computation in Java

Part 14 of 15

May 04, 2026 Abhay 13 min read

Vector API (JEP 448): SIMD Computation in Java

Preview Feature in Java 21 — The Vector API has been in preview since Java 16 (JEP 338). JEP 448 is the sixth preview iteration in Java 21. The API is stable and production-usable with --enable-preview; finalization is pending Project Valhalla value types.

What Is SIMD and Why Does It Matter?

Modern CPUs can perform the same arithmetic operation on multiple data values in a single instruction. This is called SIMD — Single Instruction, Multiple Data. Instead of adding two pairs of floats in two instructions, an AVX2-equipped x86 processor can add eight pairs of floats in one instruction.

Java has always compiled scalar arithmetic to scalar instructions. A loop adding two float[] arrays would generate one ADDSS (scalar) instruction per element. The Vector API lets you write code that the JIT compiles to VADDPS (vector) instructions — yielding 4–16× throughput depending on your hardware and data type.

The Performance Gap

// Scalar: one float added per instruction
float[] a = ..., b = ..., c = new float[N];
for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
}

// Vector (AVX2): 8 floats added per instruction
var species = FloatVector.SPECIES_256;
for (int i = 0; i < N; i += species.length()) {
    var va = FloatVector.fromArray(species, a, i);
    var vb = FloatVector.fromArray(species, b, i);
    va.add(vb).intoArray(c, i);
}

On an AVX2 machine, the vector loop processes 8 floats per iteration vs 1 — roughly 8× the throughput for this operation.

When SIMD Applies

SIMD delivers the most value for:

Numerical computation: signal processing, image filters, ML inference
Bulk data operations: hashing, checksum computation, encryption primitives
Scientific computing: matrix math, FFT, physics simulations
Data transformation: character encoding, compression, serialization

For general business logic with branches, object graphs, and mixed types, the Vector API provides no benefit.

JEP 448: Sixth Preview

JEP 448 continues the API from JEP 426 (Java 20) with minor refinements:

Masked operations are more ergonomic
VectorMask factory methods are expanded
Documentation and error messages are improved
No breaking API changes from JEP 426

To use the Vector API in Java 21:

javac --enable-preview --release 21 VectorExample.java
java  --enable-preview VectorExample

Or with Maven:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-compiler-plugin</artifactId>
  <configuration>
    <compilerArgs>
      <arg>--enable-preview</arg>
    </compilerArgs>
    <release>21</release>
  </configuration>
</plugin>

Core Concepts

VectorSpecies

A VectorSpecies<E> describes the element type and vector shape (bit width). It is the entry point to all vector operations.

import jdk.incubator.vector.*;

// 128-bit vector of floats: 4 lanes
VectorSpecies<Float> SPECIES_128 = FloatVector.SPECIES_128;

// 256-bit vector of floats: 8 lanes
VectorSpecies<Float> SPECIES_256 = FloatVector.SPECIES_256;

// 512-bit vector of floats: 16 lanes
VectorSpecies<Float> SPECIES_512 = FloatVector.SPECIES_512;

// Preferred species: widest vector the current CPU supports
VectorSpecies<Float> PREFERRED = FloatVector.SPECIES_PREFERRED;

Always prefer SPECIES_PREFERRED in production code. It selects the widest vector width the JVM detects at startup. On a machine without AVX-512, this will be SPECIES_256; with AVX-512 it will be SPECIES_512. Hard-coding SPECIES_512 on a machine without AVX-512 forces the JIT to emulate it in software — often slower than scalar.

// Good: adapts to hardware
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

// Bad: may force software emulation on older hardware
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_512;

Vector

A Vector<E> is an immutable fixed-width collection of primitive values. You cannot create one with new; use factory methods on the species or concrete vector types.

// From an array
float[] arr = {1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f};
FloatVector v = FloatVector.fromArray(SPECIES, arr, 0);

// Broadcast: every lane gets the same value
FloatVector ones = FloatVector.broadcast(SPECIES, 1.0f);

// From individual lane values (small vectors only)
FloatVector v2 = (FloatVector) SPECIES.fromValues(1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f);

Lane Operations

Operations are applied elementwise across all lanes:

FloatVector a = FloatVector.fromArray(SPECIES, arrayA, i);
FloatVector b = FloatVector.fromArray(SPECIES, arrayB, i);

FloatVector sum     = a.add(b);
FloatVector diff    = a.sub(b);
FloatVector product = a.mul(b);
FloatVector quot    = a.div(b);
FloatVector absA    = a.abs();
FloatVector negA    = a.neg();
FloatVector sqrtA   = a.sqrt();
FloatVector fma     = a.fma(b, c);  // fused multiply-add: a*b + c

Writing Results Back

// Write to array
result.intoArray(outputArr, i);

// Write to MemorySegment (integrates with FFM API)
result.intoMemorySegment(segment, byteOffset, ByteOrder.nativeOrder());

The Standard Loop Pattern

Almost all Vector API code follows the same structure:

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

static float[] vectorAdd(float[] a, float[] b) {
    float[] result = new float[a.length];
    int i = 0;
    // Vector loop: process SPECIES.length() elements per iteration
    int upperBound = SPECIES.loopBound(a.length);
    for (; i < upperBound; i += SPECIES.length()) {
        FloatVector va = FloatVector.fromArray(SPECIES, a, i);
        FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
        va.add(vb).intoArray(result, i);
    }
    // Scalar tail: handle remaining elements
    for (; i < a.length; i++) {
        result[i] = a[i] + b[i];
    }
    return result;
}

Key points:

SPECIES.loopBound(n) returns the largest multiple of SPECIES.length() ≤ n. This prevents reading past the array end.
The scalar tail handles the remaining n % SPECIES.length() elements.
The tail can also be handled with masked operations (see below).

Masked Operations

A VectorMask<E> selects which lanes participate in an operation. Masks are essential for the loop tail and for conditional computation.

Tail Handling with Masks

static float[] vectorAddMasked(float[] a, float[] b) {
    float[] result = new float[a.length];
    int i = 0;
    for (; i < a.length; i += SPECIES.length()) {
        // Create mask: true for lanes that are within bounds
        VectorMask<Float> mask = SPECIES.indexInRange(i, a.length);
        FloatVector va = FloatVector.fromArray(SPECIES, a, i, mask);
        FloatVector vb = FloatVector.fromArray(SPECIES, b, i, mask);
        va.add(vb).intoArray(result, i, mask);
    }
    return result;
}

SPECIES.indexInRange(i, length) produces a mask where lane k is active if i + k < length. This handles partial vectors at the end without a separate scalar tail.

Conditional Lane Operations

// Replace negative values with zero (ReLU activation)
static float[] relu(float[] input) {
    float[] output = new float[input.length];
    FloatVector zero = FloatVector.zero(SPECIES);
    int i = 0;
    int bound = SPECIES.loopBound(input.length);
    for (; i < bound; i += SPECIES.length()) {
        FloatVector v = FloatVector.fromArray(SPECIES, input, i);
        // mask: true where v > 0
        VectorMask<Float> positive = v.compare(VectorOperators.GT, 0f);
        // blend: take v where positive is true, zero elsewhere
        v.blend(zero, positive.not()).intoArray(output, i);
    }
    for (; i < input.length; i++) {
        output[i] = Math.max(0f, input[i]);
    }
    return output;
}

Reduction with Masks

// Sum only positive elements
static float sumPositive(float[] arr) {
    float total = 0f;
    int i = 0;
    int bound = SPECIES.loopBound(arr.length);
    for (; i < bound; i += SPECIES.length()) {
        FloatVector v = FloatVector.fromArray(SPECIES, arr, i);
        VectorMask<Float> pos = v.compare(VectorOperators.GT, 0f);
        total += v.reduceLanes(VectorOperators.ADD, pos);
    }
    for (; i < arr.length; i++) {
        if (arr[i] > 0) total += arr[i];
    }
    return total;
}

Reductions

Reductions collapse a vector to a single scalar value across all lanes.

FloatVector v = FloatVector.fromArray(SPECIES, data, i);

float sum  = v.reduceLanes(VectorOperators.ADD);
float max  = v.reduceLanes(VectorOperators.MAX);
float min  = v.reduceLanes(VectorOperators.MIN);
float prod = v.reduceLanes(VectorOperators.MUL);

// Integer OR/AND
IntVector iv = IntVector.fromArray(IntVector.SPECIES_PREFERRED, intData, i);
int orResult  = iv.reduceLanes(VectorOperators.OR);
int andResult = iv.reduceLanes(VectorOperators.AND);

Reductions are useful for finding minima/maxima, computing checksums, or aggregating statistics across arrays.

Type Conversions and Cross-Lane Operations

Type Casting

FloatVector fv = FloatVector.fromArray(SPECIES_256, floatArr, i);

// Convert float lanes to int lanes (same bit width)
IntVector iv = (IntVector) fv.convert(VectorOperators.F2I, 0);

// Convert float to double (doubles wider — need two parts)
DoubleVector dvLow  = (DoubleVector) fv.convert(VectorOperators.F2D, 0); // low half
DoubleVector dvHigh = (DoubleVector) fv.convert(VectorOperators.F2D, 1); // high half

The second argument to convert is the “part” index — necessary when the output element type is wider than the input.

Lane Permutations and Shuffles

// Reverse lane order
VectorShuffle<Float> reverseOrder = VectorShuffle.fromOp(SPECIES, i -> SPECIES.length() - 1 - i);
FloatVector reversed = v.rearrange(reverseOrder);

// Rotate lanes left by 1
FloatVector rotated = v.rearrange(VectorShuffle.iota(SPECIES, 1, 1, true));

// Broadcast lane 0 to all lanes
FloatVector broadcast0 = v.rearrange(VectorShuffle.broadcast(SPECIES, 0));

Shuffles compile to vpermps / vpshufb instructions on x86. Use them for matrix transposes, convolution kernels, and data layout transformations.

Integer Vectors

The Vector API covers all primitive types. Integer vectors are common in hashing, compression, and bitwise operations.

private static final VectorSpecies<Integer> INT_SPECIES = IntVector.SPECIES_PREFERRED;

// Bitwise operations
IntVector a = IntVector.fromArray(INT_SPECIES, dataA, i);
IntVector b = IntVector.fromArray(INT_SPECIES, dataB, i);

IntVector anded  = a.and(b);
IntVector ored   = a.or(b);
IntVector xored  = a.xor(b);
IntVector shifted = a.lanewise(VectorOperators.LSHL, 2); // logical shift left 2

Vectorized Popcount (Bit Count)

static int[] vectorPopcount(int[] data) {
    int[] result = new int[data.length];
    int i = 0;
    int bound = INT_SPECIES.loopBound(data.length);
    for (; i < bound; i += INT_SPECIES.length()) {
        IntVector v = IntVector.fromArray(INT_SPECIES, data, i);
        v.lanewise(VectorOperators.BIT_COUNT).intoArray(result, i);
    }
    for (; i < data.length; i++) {
        result[i] = Integer.bitCount(data[i]);
    }
    return result;
}

Real-World Example: Dot Product

Computing a dot product is a textbook SIMD case — multiply corresponding elements and sum all products.

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

static float dotProduct(float[] a, float[] b) {
    assert a.length == b.length;
    FloatVector accumulator = FloatVector.zero(SPECIES);
    int i = 0;
    int bound = SPECIES.loopBound(a.length);
    for (; i < bound; i += SPECIES.length()) {
        FloatVector va = FloatVector.fromArray(SPECIES, a, i);
        FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
        accumulator = va.fma(vb, accumulator); // fused multiply-add: no extra rounding
    }
    float result = accumulator.reduceLanes(VectorOperators.ADD);
    for (; i < a.length; i++) {
        result += a[i] * b[i];
    }
    return result;
}

fma (fused multiply-add) computes a * b + c in a single hardware instruction, avoiding intermediate rounding. On AVX-512, this runs 16 FMAs per iteration.

Real-World Example: RGB to Grayscale

Image processing is a classic SIMD domain. Converting RGB pixels to grayscale: gray = 0.299 * R + 0.587 * G + 0.114 * B

private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

static float[] rgbToGrayscale(float[] r, float[] g, float[] b) {
    int n = r.length;
    float[] gray = new float[n];
    FloatVector wr = FloatVector.broadcast(SPECIES, 0.299f);
    FloatVector wg = FloatVector.broadcast(SPECIES, 0.587f);
    FloatVector wb = FloatVector.broadcast(SPECIES, 0.114f);
    int i = 0;
    int bound = SPECIES.loopBound(n);
    for (; i < bound; i += SPECIES.length()) {
        FloatVector vr = FloatVector.fromArray(SPECIES, r, i);
        FloatVector vg = FloatVector.fromArray(SPECIES, g, i);
        FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
        // Equivalent to: gray[i] = wr*r[i] + wg*g[i] + wb*b[i]
        vr.mul(wr).add(vg.mul(wg)).add(vb.mul(wb)).intoArray(gray, i);
    }
    for (; i < n; i++) {
        gray[i] = 0.299f * r[i] + 0.587f * g[i] + 0.114f * b[i];
    }
    return gray;
}

How the JIT Compiles Vector Code

The Vector API is designed so that the JIT can lower FloatVector.add() directly to hardware vector instructions. This works because:

Intrinsics: The HotSpot JIT has built-in intrinsics for each Vector API operation. FloatVector.fromArray compiles to vmovups; .add() compiles to vaddps.
No heap allocation: Vector objects are scalar-replaced by the JIT — they never actually live on the heap. No GC pressure.
Auto-vectorization overlap: The JIT’s auto-vectorizer and the Vector API use the same backend. The difference is that you control the vector width and operation selection explicitly.

Verifying with `-XX:+PrintCompilation`

java --enable-preview -XX:+PrintCompilation -XX:+PrintInlining VectorApp 2>&1 | grep -i vector

You should see your vector methods compiled with C2 (the optimizing JIT compiler), not interpreted.

Verifying with Assembly

java --enable-preview \
  -XX:+UnlockDiagnosticVMOptions \
  -XX:CompileCommand="print,*.vectorAdd" \
  VectorApp

Look for vaddps (AVX float add), vmulps (AVX float multiply), or vfmadd (FMA) in the output — these confirm SIMD is active.

Benchmarking with JMH

Never benchmark Java with hand-rolled System.nanoTime() loops. Use JMH.

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(value = 2, jvmArgs = {"--enable-preview"})
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
public class VectorBenchmark {

    private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

    @Param({"1024", "65536", "1048576"})
    int size;

    float[] a, b, result;

    @Setup
    public void setup() {
        a = new float[size];
        b = new float[size];
        result = new float[size];
        Random rng = new Random(42);
        for (int i = 0; i < size; i++) {
            a[i] = rng.nextFloat();
            b[i] = rng.nextFloat();
        }
    }

    @Benchmark
    public float[] scalar() {
        for (int i = 0; i < size; i++) result[i] = a[i] + b[i];
        return result;
    }

    @Benchmark
    public float[] vector() {
        int i = 0;
        int bound = SPECIES.loopBound(size);
        for (; i < bound; i += SPECIES.length()) {
            FloatVector.fromArray(SPECIES, a, i)
                       .add(FloatVector.fromArray(SPECIES, b, i))
                       .intoArray(result, i);
        }
        for (; i < size; i++) result[i] = a[i] + b[i];
        return result;
    }
}

Typical results on a modern x86 with AVX2:

Benchmark	Size	ops/ms
scalar	65536	180
vector	65536	820
scalar	1048576	11
vector	1048576	52

The gap widens with large arrays where auto-vectorization is less reliable.

Common Pitfalls

Pitfall 1: Hard-Coded Species Width

// Bad: forces software emulation on machines without AVX-512
VectorSpecies<Float> species = FloatVector.SPECIES_512;

// Good: adapts to hardware capability
VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;

Pitfall 2: Forgetting the Scalar Tail

If n is not a multiple of SPECIES.length(), the vector loop leaves elements unprocessed.

// Bug: misses the last (n % SPECIES.length()) elements
for (int i = 0; i < a.length; i += SPECIES.length()) { ... }

// Fix: use loopBound + scalar tail
int bound = SPECIES.loopBound(a.length);
for (int i = 0; i < bound; i += SPECIES.length()) { ... }
for (int i = bound; i < a.length; i++) { ... } // scalar tail

Pitfall 3: Object Accumulation

Vector objects should be local variables so the JIT can scalar-replace them. Storing them in fields or collections defeats this optimization.

// Bad: may allocate on heap, blocking scalar replacement
List<FloatVector> vectors = new ArrayList<>();
vectors.add(FloatVector.fromArray(SPECIES, arr, i));

// Good: local variable, JIT scalar-replaces
FloatVector v = FloatVector.fromArray(SPECIES, arr, i);
v.add(other).intoArray(result, i);

Pitfall 4: Mixing Species in One Expression

You cannot add a 128-bit vector to a 256-bit vector. Mismatched species throw ClassCastException at runtime.

FloatVector v128 = FloatVector.fromArray(FloatVector.SPECIES_128, arr, 0);
FloatVector v256 = FloatVector.fromArray(FloatVector.SPECIES_256, arr, 0);
v128.add(v256); // ClassCastException: species mismatch

Pitfall 5: Premature Optimization

The JIT’s auto-vectorizer handles simple loops well. Profile first. Apply the Vector API only where profiling shows a bottleneck in arithmetic-heavy code.

Integration with Foreign Function & Memory API

The Vector API integrates with the FFM API (JEP 442) for reading/writing off-heap data:

// Read from MemorySegment
MemorySegment segment = Arena.ofAuto().allocate(SPECIES.vectorByteSize() * n);
FloatVector v = FloatVector.fromMemorySegment(SPECIES, segment, offset, ByteOrder.nativeOrder());

// Write to MemorySegment
v.intoMemorySegment(segment, offset, ByteOrder.nativeOrder());

This is useful when working with native libraries that expose data via pointers — you can operate on the memory directly without copying to a Java array.

Supported Hardware

The JIT’s Vector API backend targets:

Architecture	ISA Extension	Max Width
x86-64	SSE2	128-bit
x86-64	AVX / AVX2	256-bit
x86-64	AVX-512	512-bit
ARM AArch64	NEON	128-bit
ARM AArch64	SVE	Variable

SPECIES_PREFERRED selects the widest supported width at JVM startup. On SVE (ARM), it selects the hardware’s actual vector length.

When the Vector API Beats Auto-Vectorization

The JIT’s auto-vectorizer is good but conservative. It refuses to vectorize when:

There are potential array aliasing issues
There are complex control flow inside the loop
The loop body contains method calls it cannot inline
Index expressions are non-trivial

The Vector API bypasses all these restrictions — you express the vector operations explicitly, so the JIT doesn’t need to prove they’re safe. This is the primary reason to reach for the Vector API: control over vectorization when the compiler won’t do it automatically.

Summary

Concept	Key API
Choose vector width	`FloatVector.SPECIES_PREFERRED`
Load from array	`FloatVector.fromArray(species, arr, offset)`
Arithmetic	`.add()`, `.mul()`, `.fma()`, `.sqrt()`
Write to array	`.intoArray(arr, offset)`
Tail handling	`species.loopBound(n)` + scalar loop
Masked operations	`species.indexInRange(i, n)`
Reductions	`.reduceLanes(VectorOperators.ADD)`
Shuffles	`.rearrange(VectorShuffle.fromOp(...))`
Conditionals	`.compare()` + `.blend()`

The Vector API is production-ready under --enable-preview. Use SPECIES_PREFERRED, follow the standard loop pattern, always handle the tail, and benchmark with JMH before and after to confirm real gains.

What’s Next

Article 15: Migrating to Java 21 — From Java 8, 11, and 17 walks through the practical steps to upgrade an existing codebase to Java 21, covering module system issues, deprecated API removal, and GC tuning changes.