Blog

Technical Deep Dive: Accelerating Data Processing on CPUs with Vectorization, SIMD, and Native Execution

7m read

As data systems scale to petabyte volumes and workloads become increasingly complex, CPU performance remains central to modern distributed data processing. Even as heterogeneous compute architectures gain traction, CPUs still execute the vast majority of ETL, SQL analytics, and data engineering workloads. What is less obvious is that today’s CPUs have evolved dramatically; they include wider SIMD registers, deeper cache hierarchies, NUMA-optimized memory fabrics, and specialized instruction sets optimized for analytical patterns.

However, most distributed frameworks (Spark, Presto/Trino, Flink, etc.) historically rely on virtual machines, interpreted expression evaluators, and row-oriented storage formats that prevent these hardware capabilities from being fully exploited. To unlock the true performance potential of CPUs, data engines must re-architect their execution paths around three key principles:

Vectorized operators
SIMD-optimized kernels
Native, JVM-bypassing execution pipelines

This deep dive explains how these concepts work at a systems-engineering level and how modern engines like Velox achieve major performance gains through CPU-focused design.

CPU Fundamentals: Why Execution Model Matters More Than Core Count

Modern CPUs offer tremendous hardware capabilities:

Wide vector registers: AVX2 (256-bit), AVX-512 (512-bit), SVE (scalable vectors up to 2048 bits)
Large, multi-level caches: L1 (32 KB/core), L2 (256KB to 1MB/core), L3 (shared up to 50+ MB)
High memory bandwidth: 150–400 GB/s per socket with DDR5
Dozens to hundreds of hardware threads

But these capabilities are only exploitable when:

Data is laid out contiguously
Branching is minimized
Expression evaluators operate on uniform batches
Memory access patterns are predictable
Kernels are compiled, not interpreted

Legacy row-based execution fundamentally breaks these constraints. For example:

Each row triggers dozens of virtual method calls.
Expressions run through interpreted bytecode.
Objects scatter across heap memory, destroying cache locality.
Branch-heavy logic prevents SIMD vectorization.

Thus, the bottleneck is not CPU hardware; it is how data engines feed instructions and data to the CPU.

The next sections unpack how vectorized execution, SIMD instructions, and native runtimes fix these issues.

1. Vectorized Execution: Rebuilding the Core Execution Model

Vectorized engines process data in columnar batches, typically 1,024 to 16,384 values per batch, rather than row-by-row. This creates a uniform, predictable execution pattern that is ideal for modern CPU hardware.

1.1. Columnar Layout and Memory Behavior

Columnar vector storage achieves:

Spatial locality: adjacent values are neighbors in memory
Cache efficiency: each L1/L2 cacheline fetch brings several values
Prefetchability: CPU hardware prefetchers detect regular strides
Alignment: vectors align with SIMD register widths

For example, an Arrow column of 10,000 int64 values:

| v0 | v1 | v2 | v3 | … | v9999 |

will be processed 8, 16, or 32 values at a time, depending on SIMD width.

Compare this to row-based layouts, where each value is embedded in variable-length object structures, pointer-chasing is required, and every row triggers additional memory accesses.

1.2. Operator Pipelines

Vectorized engines execute pipelines that operate on vectors:

Scan → Filter → Projection → Aggregation → Sort → Join

Each operator consumes columnar batches and emits new columnar batches.

Operators themselves are implemented as vector kernels:

Filter returns a selection vector (bitmask or index list)
Projection applies an expression tree to all values in a batch
Aggregation applies hash or vectorized group kernels
Sorting uses vector-based partitioning + SIMD comparisons
Join uses vectorized hash table probing and build-side scanning

Critically, all these kernels are implemented without branching per row, enabling compilers to generate extremely tight loops.

2. SIMD Acceleration: Leveraging the CPU’s Internal Parallelism

SIMD (Single Instruction, Multiple Data) enables one CPU instruction to operate on multiple data elements simultaneously.

Examples:

AVX2: 256-bit → process 4 x 64-bit ints or 8 x 32-bit ints
AVX-512: 512-bit → process 8 x 64-bit ints or 16 x 32-bit ints
SVE: Scalable vectors, hardware selects optimal width at runtime

2.1. SIMD and Branch Avoidance

The biggest enemy of SIMD utilization is branching:

if value > threshold:

...

Branches break vector execution because different lanes may take different paths. Vectorized engines use predication or selection vectors instead:

Compute comparison on all values in parallel
Produce a mask (bitset)
Apply downstream operations only on selected lanes

This structure eliminates per-row branching entirely.

2.2. SIMD in Common Operators

Filters

value > const is translated into a SIMD compare instruction
Mask from compare feeds into the selection vector

Projections

Expression trees are fused (via IR) into tight kernels. Example:

(c1 * 5 + c2) / c3

Compiled to:

Load vector c1 → apply vector multiply
Load vector c2 → add
Load vector c3 → divide using SIMD-friendly division

Aggregations

Vectorized:

min, max, sum, count use horizontal SIMD reductions
Group-by uses SIMD hash probes and vector-level batching

Joins

Vectorized:

Hash table probe with vectorized comparisons
Use bitmap operations for accelerating match detection

SIMD turns each CPU core into a tiny parallel processor, dramatically increasing throughput.

3. Native Execution: Escaping the JVM for C++ Efficiency

JVM-bound data engines (Spark, Flink, Presto) face inherent challenges:

Object allocation
Function pointer indirection
Garbage collection
Branch-heavy interpreted expression evaluation

Native execution replaces these bottlenecks with:

Compiled C++ kernels
Fixed-width columnar data via Arrow
Operator fusion
Tight loops with predictable control flow
Explicit memory management

3.1. Substrait: The Intermediate Representation

Substrait IR is a hardware-agnostic, language-neutral representation of a query plan:

Standardizes expressions, operators, and types
Allows different engines (Spark, Trino, DuckDB-like engines) to share a common execution layer
Enables optimized C++ backends like Velox

A Spark query becomes:

Spark Catalyst Logical Plan → Substrait → Velox Physical Execution

3.2. Gluten: Bridging Spark with Native Engines

Gluten uses:

Catalyst → Substrait translation
Apache Arrow for zero-copy columnar data exchange
JNI to transfer columnar vectors between JVM and C++
Velox or other native engines for actual execution

This allows Spark to maintain its distributed scheduler while offloading execution to native C++.

3.3. Velox: A Modern CPU-Native Execution Engine

Velox is a C++ library providing:

Vectorized columnar data model
Expression evaluation subsystem
Function registry + JIT optimizations
SIMD-friendly operator kernels
Specialized memory allocators
Lazy materialization + dictionary encoding

It acts as a shared execution substrate for multiple engines.

4. Deep Dive: Operator Implementations in Native Engines

4.1. Filters

Velox filter kernel:

Load 1024 values into SIMD registers
Compare against predicate (single SIMD instruction)
Produce a selection mask
Compress pass-through values into output vector

Extremely efficient, branch-free, and cache-friendly.

4.2. Aggregations

Native engines use:

Hash-based group-by: SIMD probe + vectorized build
Dictionary-encoded aggregates: operate on the dictionary instead of raw values
Partial aggregation: runs before global shuffle to reduce data volume
Final aggregation: merges partial results with SIMD kernels

4.3. Joins

Join optimizations include:

SIMD-accelerated hash table lookups
Prefetching hash buckets based on vector probes
Multi-stage probers to reduce branch mispredicts
Vectorized null-handling and equality checks

4.4. Sort and Merge

Vectorized radix sort
SIMD comparisons
Warp-based partitioning
Multiway merge performed in batches

5. Memory Management in Native Engines

Native engines use custom allocators:

Arena allocators: batch allocate memory for vector batches
NUMA-aware pools: pin memory to local NUMA node
Zero-copy Arrow buffers: minimize serialization

They avoid:

Per-row object creation
Pagination overhead
GC pressure

This ensures deterministic, low-latency memory behavior.

6. Putting It All Together: A Modern CPU-Native Execution Stack

A typical CPU-native execution architecture looks like:

Distributed Scheduler (Spark, Trino, Flink)

↓

Substrait IR

↓

Gluten

↓

Arrow Columnar Batches

↓

Velox

[Vectorized + SIMD execution]

Key properties:

Compiled, vectorized kernels
SIMD-optimized hot loops
Zero-copy memory movement
CPU cache–aligned execution

This architecture achieves significant speedups, often 2×–3× on TPC benchmarks, without altering pipelines.

7. How to Transition to CPU-Native Acceleration

Enable vectorized/native modes in Spark, Trino, or other engines.
Benchmark with representative workloads.
Inspect operator coverage (fallback to default JVM engine reduces gains).
Tune batch sizes to optimize prefetching and SIMD use.
Pin memory and executors to NUMA nodes.
Monitor metrics such as vectorized operator hit rate, batch sizes, and SIMD utilization.

Conclusion: The Untapped Power of CPUs

Modern CPUs are extremely capable analytical processors; all they need is an execution model that feeds them data in a form they can exploit. Vectorization provides the structure. SIMD provides the parallelism. Native execution provides efficiency.

Together, they redefine what CPU-based data processing can deliver.

Organizations can unlock major speedups using the hardware they already own, no changes to SQL, no new languages, no application rewrites. Just better engines.