Vector API (JEP 448): SIMD Computation in Java
Preview Feature in Java 21 — The Vector API has been in preview since Java 16 (JEP 338). JEP 448 is the sixth preview iteration in Java 21. The API is stable and production-usable with
--enable-preview; finalization is pending Project Valhalla value types.
What Is SIMD and Why Does It Matter?
Modern CPUs can perform the same arithmetic operation on multiple data values in a single instruction. This is called SIMD — Single Instruction, Multiple Data. Instead of adding two pairs of floats in two instructions, an AVX2-equipped x86 processor can add eight pairs of floats in one instruction.
Java has always compiled scalar arithmetic to scalar instructions. A loop adding two float[] arrays would generate one ADDSS (scalar) instruction per element. The Vector API lets you write code that the JIT compiles to VADDPS (vector) instructions — yielding 4–16× throughput depending on your hardware and data type.
The Performance Gap
// Scalar: one float added per instruction
float[] a = ..., b = ..., c = new float[N];
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
// Vector (AVX2): 8 floats added per instruction
var species = FloatVector.SPECIES_256;
for (int i = 0; i < N; i += species.length()) {
var va = FloatVector.fromArray(species, a, i);
var vb = FloatVector.fromArray(species, b, i);
va.add(vb).intoArray(c, i);
}
On an AVX2 machine, the vector loop processes 8 floats per iteration vs 1 — roughly 8× the throughput for this operation.
When SIMD Applies
SIMD delivers the most value for:
- Numerical computation: signal processing, image filters, ML inference
- Bulk data operations: hashing, checksum computation, encryption primitives
- Scientific computing: matrix math, FFT, physics simulations
- Data transformation: character encoding, compression, serialization
For general business logic with branches, object graphs, and mixed types, the Vector API provides no benefit.
JEP 448: Sixth Preview
JEP 448 continues the API from JEP 426 (Java 20) with minor refinements:
- Masked operations are more ergonomic
VectorMaskfactory methods are expanded- Documentation and error messages are improved
- No breaking API changes from JEP 426
To use the Vector API in Java 21:
javac --enable-preview --release 21 VectorExample.java
java --enable-preview VectorExample
Or with Maven:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<compilerArgs>
<arg>--enable-preview</arg>
</compilerArgs>
<release>21</release>
</configuration>
</plugin>
Core Concepts
VectorSpecies
A VectorSpecies<E> describes the element type and vector shape (bit width). It is the entry point to all vector operations.
import jdk.incubator.vector.*;
// 128-bit vector of floats: 4 lanes
VectorSpecies<Float> SPECIES_128 = FloatVector.SPECIES_128;
// 256-bit vector of floats: 8 lanes
VectorSpecies<Float> SPECIES_256 = FloatVector.SPECIES_256;
// 512-bit vector of floats: 16 lanes
VectorSpecies<Float> SPECIES_512 = FloatVector.SPECIES_512;
// Preferred species: widest vector the current CPU supports
VectorSpecies<Float> PREFERRED = FloatVector.SPECIES_PREFERRED;
Always prefer SPECIES_PREFERRED in production code. It selects the widest vector width the JVM detects at startup. On a machine without AVX-512, this will be SPECIES_256; with AVX-512 it will be SPECIES_512. Hard-coding SPECIES_512 on a machine without AVX-512 forces the JIT to emulate it in software — often slower than scalar.
// Good: adapts to hardware
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
// Bad: may force software emulation on older hardware
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_512;
Vector
A Vector<E> is an immutable fixed-width collection of primitive values. You cannot create one with new; use factory methods on the species or concrete vector types.
// From an array
float[] arr = {1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f};
FloatVector v = FloatVector.fromArray(SPECIES, arr, 0);
// Broadcast: every lane gets the same value
FloatVector ones = FloatVector.broadcast(SPECIES, 1.0f);
// From individual lane values (small vectors only)
FloatVector v2 = (FloatVector) SPECIES.fromValues(1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f);
Lane Operations
Operations are applied elementwise across all lanes:
FloatVector a = FloatVector.fromArray(SPECIES, arrayA, i);
FloatVector b = FloatVector.fromArray(SPECIES, arrayB, i);
FloatVector sum = a.add(b);
FloatVector diff = a.sub(b);
FloatVector product = a.mul(b);
FloatVector quot = a.div(b);
FloatVector absA = a.abs();
FloatVector negA = a.neg();
FloatVector sqrtA = a.sqrt();
FloatVector fma = a.fma(b, c); // fused multiply-add: a*b + c
Writing Results Back
// Write to array
result.intoArray(outputArr, i);
// Write to MemorySegment (integrates with FFM API)
result.intoMemorySegment(segment, byteOffset, ByteOrder.nativeOrder());
The Standard Loop Pattern
Almost all Vector API code follows the same structure:
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
static float[] vectorAdd(float[] a, float[] b) {
float[] result = new float[a.length];
int i = 0;
// Vector loop: process SPECIES.length() elements per iteration
int upperBound = SPECIES.loopBound(a.length);
for (; i < upperBound; i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
va.add(vb).intoArray(result, i);
}
// Scalar tail: handle remaining elements
for (; i < a.length; i++) {
result[i] = a[i] + b[i];
}
return result;
}
Key points:
SPECIES.loopBound(n)returns the largest multiple ofSPECIES.length()≤n. This prevents reading past the array end.- The scalar tail handles the remaining
n % SPECIES.length()elements. - The tail can also be handled with masked operations (see below).
Masked Operations
A VectorMask<E> selects which lanes participate in an operation. Masks are essential for the loop tail and for conditional computation.
Tail Handling with Masks
static float[] vectorAddMasked(float[] a, float[] b) {
float[] result = new float[a.length];
int i = 0;
for (; i < a.length; i += SPECIES.length()) {
// Create mask: true for lanes that are within bounds
VectorMask<Float> mask = SPECIES.indexInRange(i, a.length);
FloatVector va = FloatVector.fromArray(SPECIES, a, i, mask);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i, mask);
va.add(vb).intoArray(result, i, mask);
}
return result;
}
SPECIES.indexInRange(i, length) produces a mask where lane k is active if i + k < length. This handles partial vectors at the end without a separate scalar tail.
Conditional Lane Operations
// Replace negative values with zero (ReLU activation)
static float[] relu(float[] input) {
float[] output = new float[input.length];
FloatVector zero = FloatVector.zero(SPECIES);
int i = 0;
int bound = SPECIES.loopBound(input.length);
for (; i < bound; i += SPECIES.length()) {
FloatVector v = FloatVector.fromArray(SPECIES, input, i);
// mask: true where v > 0
VectorMask<Float> positive = v.compare(VectorOperators.GT, 0f);
// blend: take v where positive is true, zero elsewhere
v.blend(zero, positive.not()).intoArray(output, i);
}
for (; i < input.length; i++) {
output[i] = Math.max(0f, input[i]);
}
return output;
}
Reduction with Masks
// Sum only positive elements
static float sumPositive(float[] arr) {
float total = 0f;
int i = 0;
int bound = SPECIES.loopBound(arr.length);
for (; i < bound; i += SPECIES.length()) {
FloatVector v = FloatVector.fromArray(SPECIES, arr, i);
VectorMask<Float> pos = v.compare(VectorOperators.GT, 0f);
total += v.reduceLanes(VectorOperators.ADD, pos);
}
for (; i < arr.length; i++) {
if (arr[i] > 0) total += arr[i];
}
return total;
}
Reductions
Reductions collapse a vector to a single scalar value across all lanes.
FloatVector v = FloatVector.fromArray(SPECIES, data, i);
float sum = v.reduceLanes(VectorOperators.ADD);
float max = v.reduceLanes(VectorOperators.MAX);
float min = v.reduceLanes(VectorOperators.MIN);
float prod = v.reduceLanes(VectorOperators.MUL);
// Integer OR/AND
IntVector iv = IntVector.fromArray(IntVector.SPECIES_PREFERRED, intData, i);
int orResult = iv.reduceLanes(VectorOperators.OR);
int andResult = iv.reduceLanes(VectorOperators.AND);
Reductions are useful for finding minima/maxima, computing checksums, or aggregating statistics across arrays.
Type Conversions and Cross-Lane Operations
Type Casting
FloatVector fv = FloatVector.fromArray(SPECIES_256, floatArr, i);
// Convert float lanes to int lanes (same bit width)
IntVector iv = (IntVector) fv.convert(VectorOperators.F2I, 0);
// Convert float to double (doubles wider — need two parts)
DoubleVector dvLow = (DoubleVector) fv.convert(VectorOperators.F2D, 0); // low half
DoubleVector dvHigh = (DoubleVector) fv.convert(VectorOperators.F2D, 1); // high half
The second argument to convert is the “part” index — necessary when the output element type is wider than the input.
Lane Permutations and Shuffles
// Reverse lane order
VectorShuffle<Float> reverseOrder = VectorShuffle.fromOp(SPECIES, i -> SPECIES.length() - 1 - i);
FloatVector reversed = v.rearrange(reverseOrder);
// Rotate lanes left by 1
FloatVector rotated = v.rearrange(VectorShuffle.iota(SPECIES, 1, 1, true));
// Broadcast lane 0 to all lanes
FloatVector broadcast0 = v.rearrange(VectorShuffle.broadcast(SPECIES, 0));
Shuffles compile to vpermps / vpshufb instructions on x86. Use them for matrix transposes, convolution kernels, and data layout transformations.
Integer Vectors
The Vector API covers all primitive types. Integer vectors are common in hashing, compression, and bitwise operations.
private static final VectorSpecies<Integer> INT_SPECIES = IntVector.SPECIES_PREFERRED;
// Bitwise operations
IntVector a = IntVector.fromArray(INT_SPECIES, dataA, i);
IntVector b = IntVector.fromArray(INT_SPECIES, dataB, i);
IntVector anded = a.and(b);
IntVector ored = a.or(b);
IntVector xored = a.xor(b);
IntVector shifted = a.lanewise(VectorOperators.LSHL, 2); // logical shift left 2
Vectorized Popcount (Bit Count)
static int[] vectorPopcount(int[] data) {
int[] result = new int[data.length];
int i = 0;
int bound = INT_SPECIES.loopBound(data.length);
for (; i < bound; i += INT_SPECIES.length()) {
IntVector v = IntVector.fromArray(INT_SPECIES, data, i);
v.lanewise(VectorOperators.BIT_COUNT).intoArray(result, i);
}
for (; i < data.length; i++) {
result[i] = Integer.bitCount(data[i]);
}
return result;
}
Real-World Example: Dot Product
Computing a dot product is a textbook SIMD case — multiply corresponding elements and sum all products.
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
static float dotProduct(float[] a, float[] b) {
assert a.length == b.length;
FloatVector accumulator = FloatVector.zero(SPECIES);
int i = 0;
int bound = SPECIES.loopBound(a.length);
for (; i < bound; i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
accumulator = va.fma(vb, accumulator); // fused multiply-add: no extra rounding
}
float result = accumulator.reduceLanes(VectorOperators.ADD);
for (; i < a.length; i++) {
result += a[i] * b[i];
}
return result;
}
fma (fused multiply-add) computes a * b + c in a single hardware instruction, avoiding intermediate rounding. On AVX-512, this runs 16 FMAs per iteration.
Real-World Example: RGB to Grayscale
Image processing is a classic SIMD domain. Converting RGB pixels to grayscale:
gray = 0.299 * R + 0.587 * G + 0.114 * B
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
static float[] rgbToGrayscale(float[] r, float[] g, float[] b) {
int n = r.length;
float[] gray = new float[n];
FloatVector wr = FloatVector.broadcast(SPECIES, 0.299f);
FloatVector wg = FloatVector.broadcast(SPECIES, 0.587f);
FloatVector wb = FloatVector.broadcast(SPECIES, 0.114f);
int i = 0;
int bound = SPECIES.loopBound(n);
for (; i < bound; i += SPECIES.length()) {
FloatVector vr = FloatVector.fromArray(SPECIES, r, i);
FloatVector vg = FloatVector.fromArray(SPECIES, g, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
// Equivalent to: gray[i] = wr*r[i] + wg*g[i] + wb*b[i]
vr.mul(wr).add(vg.mul(wg)).add(vb.mul(wb)).intoArray(gray, i);
}
for (; i < n; i++) {
gray[i] = 0.299f * r[i] + 0.587f * g[i] + 0.114f * b[i];
}
return gray;
}
How the JIT Compiles Vector Code
The Vector API is designed so that the JIT can lower FloatVector.add() directly to hardware vector instructions. This works because:
- Intrinsics: The HotSpot JIT has built-in intrinsics for each Vector API operation.
FloatVector.fromArraycompiles tovmovups;.add()compiles tovaddps. - No heap allocation: Vector objects are scalar-replaced by the JIT — they never actually live on the heap. No GC pressure.
- Auto-vectorization overlap: The JIT’s auto-vectorizer and the Vector API use the same backend. The difference is that you control the vector width and operation selection explicitly.
Verifying with -XX:+PrintCompilation
java --enable-preview -XX:+PrintCompilation -XX:+PrintInlining VectorApp 2>&1 | grep -i vector
You should see your vector methods compiled with C2 (the optimizing JIT compiler), not interpreted.
Verifying with Assembly
java --enable-preview \
-XX:+UnlockDiagnosticVMOptions \
-XX:CompileCommand="print,*.vectorAdd" \
VectorApp
Look for vaddps (AVX float add), vmulps (AVX float multiply), or vfmadd (FMA) in the output — these confirm SIMD is active.
Benchmarking with JMH
Never benchmark Java with hand-rolled System.nanoTime() loops. Use JMH.
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(value = 2, jvmArgs = {"--enable-preview"})
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
public class VectorBenchmark {
private static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
@Param({"1024", "65536", "1048576"})
int size;
float[] a, b, result;
@Setup
public void setup() {
a = new float[size];
b = new float[size];
result = new float[size];
Random rng = new Random(42);
for (int i = 0; i < size; i++) {
a[i] = rng.nextFloat();
b[i] = rng.nextFloat();
}
}
@Benchmark
public float[] scalar() {
for (int i = 0; i < size; i++) result[i] = a[i] + b[i];
return result;
}
@Benchmark
public float[] vector() {
int i = 0;
int bound = SPECIES.loopBound(size);
for (; i < bound; i += SPECIES.length()) {
FloatVector.fromArray(SPECIES, a, i)
.add(FloatVector.fromArray(SPECIES, b, i))
.intoArray(result, i);
}
for (; i < size; i++) result[i] = a[i] + b[i];
return result;
}
}
Typical results on a modern x86 with AVX2:
| Benchmark | Size | ops/ms |
|---|---|---|
| scalar | 65536 | 180 |
| vector | 65536 | 820 |
| scalar | 1048576 | 11 |
| vector | 1048576 | 52 |
The gap widens with large arrays where auto-vectorization is less reliable.
Common Pitfalls
Pitfall 1: Hard-Coded Species Width
// Bad: forces software emulation on machines without AVX-512
VectorSpecies<Float> species = FloatVector.SPECIES_512;
// Good: adapts to hardware capability
VectorSpecies<Float> species = FloatVector.SPECIES_PREFERRED;
Pitfall 2: Forgetting the Scalar Tail
If n is not a multiple of SPECIES.length(), the vector loop leaves elements unprocessed.
// Bug: misses the last (n % SPECIES.length()) elements
for (int i = 0; i < a.length; i += SPECIES.length()) { ... }
// Fix: use loopBound + scalar tail
int bound = SPECIES.loopBound(a.length);
for (int i = 0; i < bound; i += SPECIES.length()) { ... }
for (int i = bound; i < a.length; i++) { ... } // scalar tail
Pitfall 3: Object Accumulation
Vector objects should be local variables so the JIT can scalar-replace them. Storing them in fields or collections defeats this optimization.
// Bad: may allocate on heap, blocking scalar replacement
List<FloatVector> vectors = new ArrayList<>();
vectors.add(FloatVector.fromArray(SPECIES, arr, i));
// Good: local variable, JIT scalar-replaces
FloatVector v = FloatVector.fromArray(SPECIES, arr, i);
v.add(other).intoArray(result, i);
Pitfall 4: Mixing Species in One Expression
You cannot add a 128-bit vector to a 256-bit vector. Mismatched species throw ClassCastException at runtime.
FloatVector v128 = FloatVector.fromArray(FloatVector.SPECIES_128, arr, 0);
FloatVector v256 = FloatVector.fromArray(FloatVector.SPECIES_256, arr, 0);
v128.add(v256); // ClassCastException: species mismatch
Pitfall 5: Premature Optimization
The JIT’s auto-vectorizer handles simple loops well. Profile first. Apply the Vector API only where profiling shows a bottleneck in arithmetic-heavy code.
Integration with Foreign Function & Memory API
The Vector API integrates with the FFM API (JEP 442) for reading/writing off-heap data:
// Read from MemorySegment
MemorySegment segment = Arena.ofAuto().allocate(SPECIES.vectorByteSize() * n);
FloatVector v = FloatVector.fromMemorySegment(SPECIES, segment, offset, ByteOrder.nativeOrder());
// Write to MemorySegment
v.intoMemorySegment(segment, offset, ByteOrder.nativeOrder());
This is useful when working with native libraries that expose data via pointers — you can operate on the memory directly without copying to a Java array.
Supported Hardware
The JIT’s Vector API backend targets:
| Architecture | ISA Extension | Max Width |
|---|---|---|
| x86-64 | SSE2 | 128-bit |
| x86-64 | AVX / AVX2 | 256-bit |
| x86-64 | AVX-512 | 512-bit |
| ARM AArch64 | NEON | 128-bit |
| ARM AArch64 | SVE | Variable |
SPECIES_PREFERRED selects the widest supported width at JVM startup. On SVE (ARM), it selects the hardware’s actual vector length.
When the Vector API Beats Auto-Vectorization
The JIT’s auto-vectorizer is good but conservative. It refuses to vectorize when:
- There are potential array aliasing issues
- There are complex control flow inside the loop
- The loop body contains method calls it cannot inline
- Index expressions are non-trivial
The Vector API bypasses all these restrictions — you express the vector operations explicitly, so the JIT doesn’t need to prove they’re safe. This is the primary reason to reach for the Vector API: control over vectorization when the compiler won’t do it automatically.
Summary
| Concept | Key API |
|---|---|
| Choose vector width | FloatVector.SPECIES_PREFERRED |
| Load from array | FloatVector.fromArray(species, arr, offset) |
| Arithmetic | .add(), .mul(), .fma(), .sqrt() |
| Write to array | .intoArray(arr, offset) |
| Tail handling | species.loopBound(n) + scalar loop |
| Masked operations | species.indexInRange(i, n) |
| Reductions | .reduceLanes(VectorOperators.ADD) |
| Shuffles | .rearrange(VectorShuffle.fromOp(...)) |
| Conditionals | .compare() + .blend() |
The Vector API is production-ready under --enable-preview. Use SPECIES_PREFERRED, follow the standard loop pattern, always handle the tail, and benchmark with JMH before and after to confirm real gains.
What’s Next
Article 15: Migrating to Java 21 — From Java 8, 11, and 17 walks through the practical steps to upgrade an existing codebase to Java 21, covering module system issues, deprecated API removal, and GC tuning changes.