Blog

The future of Spark is heterogenous

7m read

Introduction

Apache Spark’s CPU-centric design is hitting its limits. Built for horizontal scaling, it now struggles as data volumes soar, workloads diversify, and cloud costs rise. CPUs, though reliable, are capped by modest core counts (2–128), limited memory bandwidth (100–300 GB/s), and orchestration overhead. Even with vectorization, SIMD, and native execution via Apache Gluten and Velox, gains are modest at 2–3×.

GPUs, with 2,000–10,000+ cores and 300 GB/s–3 TB/s bandwidth, promise massive parallelism but remain underused due to fragmented integration and memory limits. The result: a widening gap between Spark’s CPU roots and modern AI-scale workloads.

Closing it demands a hybrid execution model that unites CPU control with GPU compute, eliminating data-movement inefficiencies and scaling performance for the next generation of data and AI processing.

The Problem: CPU-Centric Spark Architectures Have Reached Their Scalability Limit

Spark was originally designed to exploit commodity CPUs through horizontal scaling. This design is now proving inadequate for current and emerging workloads, with the limiting saturation and inefficiencies of CPU-only execution.

1. CPU Architecture Bottlenecks

CPUs have limited core counts (tens of cores), limiting the degree of parallel execution.
CPU cores and threads are designed for the parallel execution of coarse-grained tasks operating with different instructions on different data.
Bandwidth between CPU and caches/memory lags considerably behind processor-memory engines, such as GPUs, built for high-throughput data processing.

2. Von Neumann Bottleneck

At a large scale, Spark clusters spend 40–70% of execution time moving data to/from the processing element or node. Every Spark stage typically involves moving data between the CPU cores and memory/storage within a Spark node and shuffling data among the nodes in the Spark cluster.

When organizations increase the horizontal scale out of clusters to overcome the CPU-memory constraints of individual nodes, network latency and throughput for remote data shuffle dominate execution time - yielding diminishing returns with increasing cluster size and cost.

3. Cloud Cost Explosion

Customers have no other alternatives from their cloud vendors. So, they are forced to scale the cluster, which operates at lower efficiency. To manage the complexity and efficiency, they frequently adopt serverless cloud solutions at a cost premium.

The Shift from Scaling Out to Scaling Smart

These bottlenecks have pushed the Spark ecosystem to seek efficiency gains not just through horizontal scaling, but through deeper architectural innovation. Rather than adding more nodes, the new frontier extracts more performance per CPU cycle.

This has given rise to vectorization, SIMD acceleration, and native execution frameworks that move computation closer to the hardware, bypassing JVM overhead. Projects like Apache Gluten and Velox embody this evolution, bridging Spark’s distributed model with high-performance C++ execution.

Yet, even as they push CPUs to their architectural limits, the next leap in performance will require going beyond CPUs altogether.

Vectorization, SIMD, and Native Execution Still Need An Extra Boost From GPU

Apache Gluten and Velox form a native execution stack that modernizes data processing through vectorization, SIMD acceleration, and JIT-compiled native execution. Gluten bridges distributed query engines like Apache Spark with low-level hardware-optimized frameworks by translating Spark’s plans into Substrait IR, which Velox executes natively.

Built in C++, Velox runs vectorized operators using SIMD instructions to process multiple data elements in parallel, bypassing the JVM and reducing interpretation, serialization, and garbage collection overhead. The stack’s innovation lies in Velox’s columnar vector processing model, which aligns with modern CPU cache hierarchies for high throughput. Gluten orchestrates translation and operator mapping, while Velox executes through precompiled or JIT-optimized kernels.

Together, they create a tightly integrated native execution layer that retains Spark compatibility yet achieves C++-level performance, leveraging Arrow-based zero-copy data sharing and efficient memory pooling to deliver 2.5x-3x speedup on compute-heavy workloads.

Enter GPUs

You need to go beyond the 2.5x-3x speedup provided by the CPU-only stack to counter the growing mismatch between CPU scaling and data growth, and the diversification of modern analytic workloads. You need a heterogeneous stack that includes GPU, FPGAs, and other hardware.

GPUs, with their massively parallel architectures and high-bandwidth memory, offer a compelling path forward for processing thousands of data elements simultaneously. This enables dramatic acceleration for compute-heavy operations such as joins, aggregations, and model inference, unlocking order-of-magnitude performance gains over traditional CPU-only systems. A study by Nvidia showed that Nvidia Decision Support Benchmark (NDS), a variant of TPC-DS, finished execution in 34 minutes on a four-node GPU cluster with 32vCPU, 120 GB RAM, and 8xT4 NVIDIA GPUs as compared to 184 minutes on a four-node CPU cluster with 32vCPU, 120 GB RAM; a 5x speed up.

Velox and Gluten currently lack native GPU execution support. Their execution model and memory management stack are heavily CPU-centric, optimized around x86 and ARM SIMD instructions, not CUDA or ROCm kernels. To enable GPU acceleration, they would need an extensible device abstraction layer, GPU-aware memory management (for unified or pinned memory) allowing CPU and GPU to share a single memory address space, reducing the need for explicit data transfer, and an execution backend capable of offloading vectorized operators to GPU kernels while preserving Substrait compatibility. Integration with frameworks such as NVIDIA RAPIDS, Arrow CUDA buffers, or OpenCL-based compute adapters could provide that bridge. Until such extensions are in place, Gluten + Velox remain primarily CPU-optimized frameworks, highly efficient for vectorized native execution, but not yet capable of harnessing the full potential of heterogeneous compute environments.

Naive GPU Integration Fails

Naive GPU integration suffers from 3 problems:

(1) Fragmented Execution: Most GPU acceleration frameworks integrate at the individual Spark operator level, requiring frequent data transfers between CPU and GPU. For every operator, data must be copied from CPU RAM to GPU memory, a GPU kernel must be launched, and then the results are copied back to the CPU for the next stage. This constant “ping-pong” of data across PCIe or NVLink buses significantly diminishes the benefits of parallel GPU compute. Even if a GPU kernel executes 20× faster than a CPU, the overall gain can be negligible or even negative if data movement overhead consumes 80% or more of total execution time.

(2) GPU Memory Thrashing: While GPU high-bandwidth memory (HBM) is extremely fast, it is also limited in capacity, typically 40–80 GB compared to 256–2048 GB of CPU RAM. Without intelligent memory management, large datasets quickly overflow GPU memory, forcing data to spill back to the CPU or even disk. PCIe and NVLink are bottlenecked by this data trashing. This causes GPU starvation and performance stalls, negating the expected acceleration benefits.

(3) GPU-Oblivious Resource Management: Naive GPU acceleration solutions still rely on Spark's physical planning, task distribution, shuffle coordination, memory management, resource management, etc., which do not consider GPU resources and capabilities.

The above results in sub-optimal use of the GPU in these naive implementations without effective use of the GPU parallelism, GPU memory, and CPU-GPU I/O interfaces. Spark's capabilities should be extended to factor CPU and GPU compute, memory, and I/O resources in its planning, task distribution, and resource management mechanisms and policies.

Why Specific-Purpose GPU Acceleration Attempts Have Failed

Several prior systems have attempted to bolt GPUs onto Spark without rethinking the execution model, leading to lackluster results. For example, Nvidia RAPIDS provides partial operator coverage and falls back to CPUs for string processing, User Defined Functions (UDFs), or complex expressions.

Ad-hoc use of GPU to offload a subset of operators, bolting GPUs onto a CPU-centric execution model, and GPU-centric execution oblivious to CPU capabilities have scratched the surface. They have not taken a holistic view of the server hardware and software at all layers of the system to develop a comprehensive solution.

GPUs Alone Are Not Enough

While GPUs offer massive theoretical throughput but simply adding a GPU to Spark clusters does not guarantee acceleration. Many organizations have attempted GPU adoption through frameworks like RAPIDS only to see marginal gains or even performance regressions.

Neither CPUs nor GPUs can individually handle the full spectrum of modern data processing workloads, and each excels in different dimensions but falls short in others. CPUs offer larger memory capacity, higher cache hierarchies, and superior support for branching logic, I/O orchestration, and network throughput, making them ideal for control-heavy, latency-sensitive, or metadata-driven operations such as query planning, shuffling, and complex UDFs. They also provide a richer programming model with mature ecosystems, enabling flexible task scheduling and better fault tolerance.

However, CPUs struggle with massive data-parallel operations where the same computation must be applied to billions of data elements, and this is where GPUs shine. GPUs deliver massively parallel execution with thousands of lightweight cores and high-bandwidth memory (HBM), excelling in compute-bound stages like joins, aggregations, and vectorized ML inference.

Yet, their smaller memory capacity, higher data transfer costs, and limited concurrency with I/O make them less efficient for orchestration-heavy workloads.

The Industry is Facing a Compute Economics Crisis

Data volumes are doubling every 12 months.
Data processing workloads for Analytics and GenAI are increasing at a rapid pace.
Cloud providers are charging premium rates for compute and I/O heavy workloads.
Traditional Spark scaling is not sustainable for emerging workloads and the long term.

A new execution model is required that (1) leverages the strengths of all processing elements, (2) employs software-hardware co-design for whole system optimization, (3) propels low-cost solutions, and (4) is built using widely available components.

The future of high-performance data systems lies in hybrid architectures that leverage CPUs for coordination, data movement, and mixed workloads, while using GPUs for parallel computation and acceleration. Together, they can deliver an optimal balance of throughput, cost efficiency, and scalability that neither architecture can achieve alone.

The solution requires rethinking Spark from the ground up, as a truly heterogeneous engine.