As data systems scale to petabyte volumes and workloads become increasingly complex, CPU performance remains central to modern distributed data processing. Even as heterogeneous compute architectures gain traction, CPUs still execute the vast majority of ETL, SQL analytics, and data engineering workloads. What is less obvious is that today’s CPUs have evolved dramatically; they include wider SIMD registers, deeper cache hierarchies, NUMA-optimized memory fabrics, and specialized instruction sets optimized for analytical patterns.
However, most distributed frameworks (Spark, Presto/Trino, Flink, etc.) historically rely on virtual machines, interpreted expression evaluators, and row-oriented storage formats that prevent these hardware capabilities from being fully exploited. To unlock the true performance potential of CPUs, data engines must re-architect their execution paths around three key principles:
- Vectorized operators
- SIMD-optimized kernels
- Native, JVM-bypassing execution pipelines
This deep dive explains how these concepts work at a systems-engineering level and how modern engines like Velox achieve major performance gains through CPU-focused design.
CPU Fundamentals: Why Execution Model Matters More Than Core Count
Modern CPUs offer tremendous hardware capabilities:
- Wide vector registers: AVX2 (256-bit), AVX-512 (512-bit), SVE (scalable vectors up to 2048 bits)
- Large, multi-level caches: L1 (32 KB/core), L2 (256KB to 1MB/core), L3 (shared up to 50+ MB)
- High memory bandwidth: 150–400 GB/s per socket with DDR5
- Dozens to hundreds of hardware threads
But these capabilities are only exploitable when:
- Data is laid out contiguously
- Branching is minimized
- Expression evaluators operate on uniform batches
- Memory access patterns are predictable
- Kernels are compiled, not interpreted
Legacy row-based execution fundamentally breaks these constraints. For example:
- Each row triggers dozens of virtual method calls.
- Expressions run through interpreted bytecode.
- Objects scatter across heap memory, destroying cache locality.
- Branch-heavy logic prevents SIMD vectorization.
Thus, the bottleneck is not CPU hardware; it is how data engines feed instructions and data to the CPU.
The next sections unpack how vectorized execution, SIMD instructions, and native runtimes fix these issues.
1. Vectorized Execution: Rebuilding the Core Execution Model
Vectorized engines process data in columnar batches, typically 1,024 to 16,384 values per batch, rather than row-by-row. This creates a uniform, predictable execution pattern that is ideal for modern CPU hardware.
1.1. Columnar Layout and Memory Behavior
Columnar vector storage achieves:
- Spatial locality: adjacent values are neighbors in memory
- Cache efficiency: each L1/L2 cacheline fetch brings several values
- Prefetchability: CPU hardware prefetchers detect regular strides
- Alignment: vectors align with SIMD register widths
For example, an Arrow column of 10,000 int64 values:
| v0 | v1 | v2 | v3 | … | v9999 |
will be processed 8, 16, or 32 values at a time, depending on SIMD width.
Compare this to row-based layouts, where each value is embedded in variable-length object structures, pointer-chasing is required, and every row triggers additional memory accesses.
1.2. Operator Pipelines
Vectorized engines execute pipelines that operate on vectors:
Scan → Filter → Projection → Aggregation → Sort → Join
Each operator consumes columnar batches and emits new columnar batches.
Operators themselves are implemented as vector kernels:
- Filter returns a selection vector (bitmask or index list)
- Projection applies an expression tree to all values in a batch
- Aggregation applies hash or vectorized group kernels
- Sorting uses vector-based partitioning + SIMD comparisons
- Join uses vectorized hash table probing and build-side scanning
Critically, all these kernels are implemented without branching per row, enabling compilers to generate extremely tight loops.
2. SIMD Acceleration: Leveraging the CPU’s Internal Parallelism
SIMD (Single Instruction, Multiple Data) enables one CPU instruction to operate on multiple data elements simultaneously.
Examples:
- AVX2: 256-bit → process 4 x 64-bit ints or 8 x 32-bit ints
- AVX-512: 512-bit → process 8 x 64-bit ints or 16 x 32-bit ints
- SVE: Scalable vectors, hardware selects optimal width at runtime
2.1. SIMD and Branch Avoidance
The biggest enemy of SIMD utilization is branching:
if value > threshold:
...
Branches break vector execution because different lanes may take different paths. Vectorized engines use predication or selection vectors instead:
- Compute comparison on all values in parallel
- Produce a mask (bitset)
- Apply downstream operations only on selected lanes
This structure eliminates per-row branching entirely.
2.2. SIMD in Common Operators
Filters
- value > const is translated into a SIMD compare instruction
- Mask from compare feeds into the selection vector
Projections
Expression trees are fused (via IR) into tight kernels. Example:
(c1 * 5 + c2) / c3
Compiled to:
- Load vector c1 → apply vector multiply
- Load vector c2 → add
- Load vector c3 → divide using SIMD-friendly division
Aggregations
Vectorized:
- min, max, sum, count use horizontal SIMD reductions
- Group-by uses SIMD hash probes and vector-level batching
Joins
Vectorized:
- Hash table probe with vectorized comparisons
- Use bitmap operations for accelerating match detection
SIMD turns each CPU core into a tiny parallel processor, dramatically increasing throughput.
3. Native Execution: Escaping the JVM for C++ Efficiency
JVM-bound data engines (Spark, Flink, Presto) face inherent challenges:
- Object allocation
- Function pointer indirection
- Garbage collection
- Branch-heavy interpreted expression evaluation
Native execution replaces these bottlenecks with:
- Compiled C++ kernels
- Fixed-width columnar data via Arrow
- Operator fusion
- Tight loops with predictable control flow
- Explicit memory management
3.1. Substrait: The Intermediate Representation
Substrait IR is a hardware-agnostic, language-neutral representation of a query plan:
- Standardizes expressions, operators, and types
- Allows different engines (Spark, Trino, DuckDB-like engines) to share a common execution layer
- Enables optimized C++ backends like Velox
A Spark query becomes:
Spark Catalyst Logical Plan → Substrait → Velox Physical Execution
3.2. Gluten: Bridging Spark with Native Engines
Gluten uses:
- Catalyst → Substrait translation
- Apache Arrow for zero-copy columnar data exchange
- JNI to transfer columnar vectors between JVM and C++
- Velox or other native engines for actual execution
This allows Spark to maintain its distributed scheduler while offloading execution to native C++.
3.3. Velox: A Modern CPU-Native Execution Engine
Velox is a C++ library providing:
- Vectorized columnar data model
- Expression evaluation subsystem
- Function registry + JIT optimizations
- SIMD-friendly operator kernels
- Specialized memory allocators
- Lazy materialization + dictionary encoding
It acts as a shared execution substrate for multiple engines.
4. Deep Dive: Operator Implementations in Native Engines
4.1. Filters
Velox filter kernel:
- Load 1024 values into SIMD registers
- Compare against predicate (single SIMD instruction)
- Produce a selection mask
- Compress pass-through values into output vector
Extremely efficient, branch-free, and cache-friendly.
4.2. Aggregations
Native engines use:
- Hash-based group-by: SIMD probe + vectorized build
- Dictionary-encoded aggregates: operate on the dictionary instead of raw values
- Partial aggregation: runs before global shuffle to reduce data volume
- Final aggregation: merges partial results with SIMD kernels
4.3. Joins
Join optimizations include:
- SIMD-accelerated hash table lookups
- Prefetching hash buckets based on vector probes
- Multi-stage probers to reduce branch mispredicts
- Vectorized null-handling and equality checks
4.4. Sort and Merge
- Vectorized radix sort
- SIMD comparisons
- Warp-based partitioning
- Multiway merge performed in batches
5. Memory Management in Native Engines
Native engines use custom allocators:
- Arena allocators: batch allocate memory for vector batches
- NUMA-aware pools: pin memory to local NUMA node
- Zero-copy Arrow buffers: minimize serialization
They avoid:
- Per-row object creation
- Pagination overhead
- GC pressure
This ensures deterministic, low-latency memory behavior.
6. Putting It All Together: A Modern CPU-Native Execution Stack
A typical CPU-native execution architecture looks like:
Distributed Scheduler (Spark, Trino, Flink)
↓
Substrait IR
↓
Gluten
↓
Arrow Columnar Batches
↓
Velox
[Vectorized + SIMD execution]
Key properties:
- Compiled, vectorized kernels
- SIMD-optimized hot loops
- Zero-copy memory movement
- CPU cache–aligned execution
This architecture achieves significant speedups, often 2×–3× on TPC benchmarks, without altering pipelines.
7. How to Transition to CPU-Native Acceleration
- Enable vectorized/native modes in Spark, Trino, or other engines.
- Benchmark with representative workloads.
- Inspect operator coverage (fallback to default JVM engine reduces gains).
- Tune batch sizes to optimize prefetching and SIMD use.
- Pin memory and executors to NUMA nodes.
- Monitor metrics such as vectorized operator hit rate, batch sizes, and SIMD utilization.
Conclusion: The Untapped Power of CPUs
Modern CPUs are extremely capable analytical processors; all they need is an execution model that feeds them data in a form they can exploit. Vectorization provides the structure. SIMD provides the parallelism. Native execution provides efficiency.
Together, they redefine what CPU-based data processing can deliver.
Organizations can unlock major speedups using the hardware they already own, no changes to SQL, no new languages, no application rewrites. Just better engines.