Blog

Apache Spark Resource Configuration

From Theory to Practice

Ran Reichman

•

read time

•

February 5, 2025

Heading 2

Apache Spark's resource configuration remains one of the most challenging aspects of operating data pipelines at scale. Theoretical best practices are widely available, but production deployments often require adjustments to accommodate real-world constraints. This guide bridges that gap, exploring how to properly size Spark resources—from executors to partitions—while identifying common failure patterns and strategies to address them in production.

The Baseline Configuration

Consider a typical Spark job processing 1TB of data. A standard recommended setup might include:

A cluster of 20 nodes, each with 32 cores and 256GB RAM
Effective capacity of 28 cores and 240GB RAM per node after system overhead
4 executors per node (80 total executors)
7 cores per executor (with 1 core reserved for overhead)
56GB RAM per executor
~128MB partition sizes for optimal parallelism

While this configuration serves as a solid starting point, production workloads rarely conform to such clean boundaries. Let's examine some common failure patterns and mitigation strategies.When Reality Hits: Failure Patterns and Solutions

‍Failure Pattern #1: Workload Evolution Requiring Infrastructure Changes

‍A typical scenario: A job that previously ran efficiently on 20 nodes begins to experience increasing memory pressure or extended runtimes, despite configuration adjustments. Signs of resource constraints include:

Consistently high GC time across executors (>15% of executor runtime)
Storage fraction frequently dropping below 0.3
Executor memory usage consistently above 85%
Stage attempts failing despite conservative memory settings

Root cause analysis approach:

Analyze growth patterns in your data volume and complexity.
Profile representative jobs to understand resource bottlenecks.

Key scaling triggers:

CPU-bound: When average CPU utilization stays above 80% for most of the job duration.
Memory-bound: When GC time exceeds 15% or OOM errors occur despite tuning.
I/O-bound: When shuffle spill exceeds 20% of executor memory.

If CPU-bound (high CPU utilization, low wait times):

First try increasing cores per executor.
If insufficient, add nodes while maintaining a similar cores/node ratio.

If memory-bound (Out Of Memory - OOM):

First try reducing executors per node to allocate more memory per executor.
If insufficient, add nodes with higher memory configurations.

‍Failure Pattern #2: Memory Exhaustion In Compute Heavy Operations

‍A typical scenario: Your job runs fine for many days but then suddenly fails with Out Of Memory (OOM) errors. Investigation reveals that during month-end processing, certain joins produce intermediate results 5-10x larger than your input data. The executor memory gets exhausted trying to handle these large shuffles.A possible solution would be to update the configuration to:

spark.executor.memoryOverhead: 25% (increased from default 10%)
spark.memory.fraction: 0.75 (decreased from default 0.6)

These settings help because they:- Reserve more memory for off-heap operations (shuffles, network buffers)- Reduce the fraction of memory used for caching, giving more to execution- Allow GC to reclaim memory more aggressively‍

Failure Pattern #3: Data Skew, The Silent Killer

‍A typical scenario: Your daily aggregation job suddenly takes 4 hours instead of 1 hour. Investigation shows that 90% of the data is going to 10% of the partitions. Common culprits:- Timestamp-based keys clustering around business hours- Geographic data concentrated in major cities- Business IDs with vastly different activity levelsBefore implementing solutions, quantify your skew:

Monitor partition sizes through the Spark UI
Track duration variation across tasks within the same stage
Look for orders of magnitude differences in partition sizes

A possible solution would be to analyze your key distribution and for known skewed keys, implement pre-processing like so:// For timestamp skewval smoothed_key = concat(date_col, hash(minute_col) % 10)// For business ID skewval salted_key = concat(business_id, hash(row_number) % 5)Using Spark’s built-in skew handling helps, but understanding the specific skew of your data is more robust and lasting. Spark’s skew handling configurations:

spark.sql.adaptive.enabled: true
spark.sql.adaptive.skewJoin.enabled: true

‍Failure Pattern #4: Resource Starvation in Mixed Workloads

‍A typical scenario: A seemingly well-configured job starts showing erratic behavior—some stages complete quickly while others seem stuck, executors appear underutilized despite high load, and the overall job progress becomes unpredictable. This is a typical case of resource starvation occurring within a single application.

Late stages in complex DAGs struggle to get resources
Shuffle operations become bottlenecks
Some executors are overwhelmed while others sit idle
Task attempts timeout and retry repeatedly

The root cause often lies in complex transformation chains: sqlCopydata.join(lookup1).groupBy("key1").agg(...).join(lookup2).groupBy("key2").agg(...)Each transformation creates intermediate results that compete for resources. Without proper management, earlier stages can hog resources, starving later stages.Possible solutions include:

Dividing compute-intensive jobs into smaller jobs that use resources more predictably.
If splitting a large job isn’t possible, using checkpoints and persist methods to better divide a single job into distinct parts. (expect a future blog post on these methods)
Applying Spark Shuffle management - setting spark.dynamicAllocation.shuffleTracking.enabled and spark.shuffle.service.enabled to true.

Conclusions & The Path Forward

We've found that most Spark issues manifest first as performance degradation before becoming outright failures. The goal of a data engineering team isn't to prevent all issues but to catch and address them before they impact production stability. While adding resources can sometimes help, precise optimization and proper monitoring often provide more sustainable solutions. Spark offers a robust set of job management tools and settings, but addressing problems through standard Spark configurations alone often proves insufficient.The Flarion platform transforms this landscape in two key ways: through significant workload acceleration that reduces resource requirements and minimizes garbage collection overhead, and by providing enhanced visibility into Spark deployments. This combination of speed and improved observability enables engineering teams to identify potential issues before they escalate into failures, shifting from reactive troubleshooting to proactive optimization. As a result, data engineering teams experience both reduced failure rates and decreased operational burden, creating a more stable and efficient production environment.

Resources

Blog

This is some text inside of a div block.

Why the World Needs Flarion

Unlocking Data's Full Potential

read time

November 27, 2024

Did you know that data-driven organizations spend up to 40% of their IT budgets on data processing alone?

As organizations scale their data processing capabilities, two critical challenges emerge: the mounting costs of processing big data and the pressing need for faster performance. Today, we're sharing our journey and explaining why Flarion is transforming how organizations leverage their data assets while staying competitive in an increasingly data-driven world.

The Journey to Better Data Processing

Through years of experience across diverse industries, Flarion’s co-founders witnessed the universal struggle of escalating data processing costs and performance bottlenecks.

During his years building data processing systems for mass-scale consumer applications and autonomous vehicles, Ran experienced firsthand how organizations struggled with the growing costs and performance demands of expanding datasets. In consumer applications, better insights can help create great experiences for hundreds of millions of people, but the high computational costs and processing limitations often make this prohibitively expensive. In autonomous vehicles, data processing at scale allows us to understand and tackle the toughest "long tail" challenges, but technical limitations can make this slow and cost-inefficient.

Through his extensive work with enterprises across various industries, Udi observed a consistent pattern: organizations were hitting both a performance and cost ceiling in their data processing capabilities. Despite significant investments in infrastructure and talent, companies found themselves constrained by processing limitations that held back their ability to launch new features or products while managing escalating infrastructure costs.

The Evolution of Data Processing Needs

The landscape of data processing has evolved dramatically. What started as simple analytics has transformed into complex data pipelines processing hundreds of terabytes daily. These diverse challenges underline the pressing need for solutions that address both speed and cost at scale.

In automotive, processing speed directly impacts vehicle safety and performance, while processing costs affect vehicle affordability and market competitiveness. In financial services, faster data processing enables real-time decision-making and better risk assessment, but the infrastructure costs of high-frequency trading and real-time analytics can quickly erode profit margins. For e-commerce companies, efficient data processing means better customer recommendations and inventory management, yet the cost of processing massive customer datasets across global markets can be prohibitive. Almost every industry relies heavily on efficient data processing and analytics, making both speed and cost optimization critical factors in maintaining competitive advantage.

A New Approach to Performance

Traditional approaches to improving data processing often involve extensive code changes, specialized expertise, or specific deployment requirements. For enterprises with massive legacy codebases, these solutions are often impractical or impossible to implement, creating additional complexity without solving the fundamental challenges of performance and cost efficiency.

We built Flarion with a different vision: what if organizations could dramatically improve their data processing performance without changing their code or disrupting existing workflows? With new Spark, Hadoop and Ray execution engines, we've created a solution that delivers up to 3x performance improvement while maintaining robust reliability and full compatibility. Most importantly, Flarion can be implemented in just 5 minutes, requiring minimal effort from organizations looking to modernize their data stack.

Enabling Innovation Through Efficiency

The impact of accelerated data processing extends far beyond just faster completion times. When organizations can process their data more efficiently and cost-effectively, they can explore new use cases, launch innovative features, and focus on extracting value from their data rather than managing infrastructure costs.

For AI and machine learning applications, efficient data processing is becoming increasingly crucial. The ability to process large datasets quickly and reliably can mean the difference between a successful model deployment and a missed opportunity. With Flarion, organizations can focus on innovation rather than infrastructure optimization, all while maintaining their existing codebase and operations.

The Future of Data Processing

As we enter an era where data drives competitive advantage, organizations need solutions that enable them to process more data, faster and more cost-effectively. The future of data processing isn't just about handling today's workloads - it's about being ready for tomorrow's challenges while managing costs sustainably.

With Flarion, organizations are not just keeping pace—they’re leading the charge into a data-driven future. Our solution enables organizations to unlock the full potential of their data assets, whether they're running data processing in the cloud or on-premises. By delivering significant performance improvements through advanced optimization techniques, we're helping organizations process their data more efficiently while reducing their infrastructure costs. Most importantly, we're doing this in a way that respects the reality of enterprise systems - with a solution that can be implemented in minutes, not months.

The future of data processing should empower organizations to focus on innovation and value creation without being held back by legacy infrastructure or rising costs.

At Flarion, we're making that future a reality.

Featured Blog

Blog

Streaming in Modern Query Engines: Where DataFusion Shines

read time

May 4, 2025

The landscape of data processing has evolved dramatically over the past few years. As datasets grow exponentially, query engines are adapting beyond traditional batch processing. Today's most innovative engines incorporate streaming capabilities to process data incrementally, enabling analysis of datasets larger than available memory while maintaining high performance. Among the leading contenders - Apache DataFusion, Polars, and DuckDB - the approaches to streaming differ significantly, with DataFusion emerging as the clear frontrunner for true streaming applications.

The Evolution of Streaming Query Execution

The term "streaming" has become somewhat ambiguous in the data processing world, spanning several distinct capabilities:

Pipelined execution: Processing data in small chunks through a query plan
Out-of-core processing: Handling datasets larger than available memory
Continuous processing: Executing long-running queries on never-ending data streams
Real-time ingestion: Continuously incorporating new data from external sources

While all three engines we're examining implement some form of streaming, they vary dramatically in their approach and capabilities. DuckDB and Polars primarily focus on the first two points—efficient execution of traditional queries—while DataFusion uniquely addresses all four aspects, providing a foundation for true streaming applications.

DataFusion's Native Streaming Architecture

Apache DataFusion, the Rust-based query engine at the heart of the Apache Arrow ecosystem, was designed with streaming as a core architectural principle. Most physical operators in DataFusion support an "Unbounded" execution mode specifically for handling infinite streams.

DataFusion's streaming architecture delivers several key advantages:

Streaming-First Design: While other engines adapted batch processing for streaming, DataFusion incorporates streaming principles natively. Its physical execution plan includes operators like StreamTableExec and SymmetricHashJoinExec specifically designed for unbounded data. This fundamental design choice enables true continuous query execution.

Streaming Join Support: Where traditional engines struggle with joins on streaming data, DataFusion's SymmetricHashJoinExec operator efficiently joins unbounded streams on the fly. This critical capability unlocks complex real-time analytics that would otherwise require batch window processing.

Arrow Integration: DataFusion processes data in Arrow record batches, providing memory-efficient, zero-copy operations on columnar data. This tight integration with Arrow gives DataFusion significant performance advantages when streaming data between systems or components.

Low-Level API Flexibility: DataFusion provides the foundational building blocks needed to construct sophisticated streaming applications. While higher-level functionality like watermarking is still emerging, its extensible architecture allows developers to implement these capabilities directly.

Polars and DuckDB: Streaming Capabilities

Both Polars and DuckDB offer capabilities related to data processing, though with important limitations for true streaming:

Polars' Streaming Status: Polars previously implemented a streaming execution mode that processed data in batches. However, it's worth noting that this streaming engine has been deprecated, and while the Polars team is working on a new streaming implementation, it's not currently something to build production systems on. Polars continues to excel at single-node workloads where memory isn't a significant constraint, offering exceptional performance for data transformation and analytics.

DuckDB Pipelined Execution: DuckDB employs a vectorized, pipelined execution model that processes data in small chunks (vectors) through query operators. This approach is particularly effective for quick in-memory operations and can handle streaming workloads efficiently when the data volumes definitively fit in memory. DuckDB's columnar architecture and parallel execution make analytical queries remarkably fast for these scenarios.

Neither engine is designed for continuous streaming of unbounded data. Both lack built-in stream ingestion capabilities and don't maintain persistent state across query executions. Each query runs to completion on the data available at execution time.

Choosing the Right Tool for Your Streaming Needs

Understanding the key differences in streaming capabilities helps select the right tool for specific use cases:

For True Streaming Applications: DataFusion stands out when you need continuous processing of unbounded data streams. Its ability to handle streaming joins, process Kafka data directly through StreamTableExec, and maintain state between batches makes it ideal for real-time applications with continuous data flows.

For Large Dataset Processing: Polars and DuckDB excel when processing large files or datasets that don't fit in memory. Their streaming execution modes efficiently handle out-of-core processing for analytics, ETL, and data transformation tasks with excellent performance.

Use Case Examples:

Real-time analytics pipeline: DataFusion provides the foundation for building systems that continuously ingest from Kafka and maintain up-to-date results.
Large log file analysis: Polars and DuckDB can efficiently process multi-gigabyte log files on modest hardware, even if the files exceed available memory.
Periodic batch processing: For scheduled ETL jobs that process accumulated data at intervals, Polars and DuckDB offer simpler implementation with excellent performance.

Each engine shines in its intended domain. DataFusion excels at true streaming while Polars and DuckDB deliver outstanding performance for analytical workloads and large dataset processing.

The Future of Streaming Query Engines

As data volumes continue growing and real-time analytics becomes increasingly critical, each engine is evolving to better serve its core use cases:

DataFusion continues advancing its streaming capabilities with ongoing development focused on:

Native watermarking support for proper event-time processing
Built-in state checkpointing for fault tolerance
Enhanced connector ecosystem for popular streaming sources

Polars and DuckDB continue to optimize their engines for analytical performance within their target domains, with Polars working on a new streaming engine and DuckDB enhancing its vectorized execution capabilities.

At Flarion, we believe in selecting the right tool for each specific task. We're always evaluating the strengths of different engines and are happy to give each one a chance in the domain where it shines. This pragmatic approach means using DataFusion when true streaming capabilities are required, while leveraging Polars for high-performance single-node analytics and DuckDB for quick in-memory operations.

‍

Featured Blog

Blog

Vectorized Processing

The Silent Performance Multiplier in Data Systems

read time

March 18, 2025

Vectorization has emerged as the most critical performance innovation in modern data platforms. At its core, the concept is straightforward: process entire batches of data simultaneously rather than one row at a time. This approach unlocks substantial efficiency gains and has become fundamental to high-performance data systems.

The Birth of Vectorized Processing

The database community first embraced vectorization through pioneering systems like MonetDB and VectorWise in the mid-2000s. These systems addressed the observation that traditional row-by-row processing created significant CPU bottlenecks. Their solution involved processing data in batches small enough to fit in CPU caches, dramatically improving query performance by eliminating per-row function call overhead.

In parallel, the scientific Python ecosystem built NumPy and Pandas around vectorized operations, allowing data scientists to perform bulk calculations orders of magnitude faster than Python loops. These early implementations demonstrated that vectorization represented a fundamental paradigm shift in data processing.

How Vectorization Transforms Performance

Vectorization aligns with modern hardware capabilities through multiple mechanisms:

CPU Vector Instructions (SIMD): Modern CPUs include SIMD (Single Instruction Multiple Data) units that can perform the same operation on multiple values simultaneously. These specialized processor features have evolved significantly:
- SIMD Evolution: From early MMX and SSE instructions processing 128 bits (4 integers) at once, to AVX-256 handling 8 integers, and modern AVX-512 capable of processing 16 integers or floats in a single instruction
- Hardware Implementation: SIMD registers are wider than standard registers—256 or 512 bits versus 64 bits—allowing a single instruction to operate on multiple data elements
- Operation Types: Common SIMD operations in data processing include vectorized comparison (generating bitmasks for filtering), arithmetic (sum, multiply, divide entire arrays), and specialized operations like shuffle and gather/scatter
- Compiler Support: Modern compilers can auto-vectorize simple loops, while high-performance systems use intrinsics (specialized C functions that map directly to SIMD instructions) for maximum control
- Performance Impact: SIMD instructions can provide theoretical speedups proportional to the vector width—up to 16x for certain operations on AVX-512 systems
Memory Efficiency: Columnar data layouts enable sequential memory access, maximizing cache efficiency and minimizing memory stalls.
Reduced Overhead: With vectorization, the cost of function calls and interpretation is amortized across hundreds or thousands of values.

A simple example illustrates the difference. Consider summing a column with a million values:

Traditional approach: Loop through one million values, with function call overhead for each
Vectorized approach: Process 1,024 values at once in a tight loop, leveraging SIMD instructions

The Role of Apache Arrow

Apache Arrow has become the central enabling technology for the vectorization ecosystem. It provides:

Zero-copy columnar memory format: Arrow defines a standardized in-memory columnar representation that allows data to be processed without serialization or deserialization when moving between systems.
SIMD-optimized compute kernels: Arrow includes a library of vectorized operations optimized for modern CPUs, ensuring that as new vector instruction sets emerge (AVX-512, ARM SVE), all Arrow-based systems can benefit.
Cross-language compatibility: Arrow implementations exist across multiple programming languages (C++, Rust, Python, Java, etc.), enabling efficient data exchange between different environments.
Integration across the ecosystem: Major platforms including Spark, DataFusion, Polars, and Velox have adopted Arrow as their interchange format.
Flight protocol: Arrow Flight provides high-performance data transfer between systems using the Arrow format, offering substantial improvements over traditional protocols.

The significance of Arrow lies in its ability to break down silos between previously isolated data systems. A dataset in Arrow format can move seamlessly between a Spark cluster, Python analysis environment, and GPU-accelerated visualization tool with minimal overhead.

The Vectorization Landscape Today

This approach has permeated virtually every corner of the data ecosystem:

Analytical Databases

ClickHouse processes data in batches, routinely scanning billions of records per second on a single server
DuckDB processes fixed-size batches of 1,024 values, matching dedicated database servers for medium-sized datasets
Apache DataFusion operates natively on columnar RecordBatches, performing highly efficient SIMD-enabled computations

Big Data Systems

Apache Spark now leverages Pandas UDFs with Arrow as a zero-copy data interchange format, though it still does not use vectorization in its primary flows
Databricks Photon replaces row-wise processing with a native columnar engine
Meta's Velox provides a unified C++ execution engine with vectorized expression evaluation

Data Science and ML

Polars combines Apache Arrow's memory-efficient format with multi-threaded, SIMD-accelerated operations
TensorFlow and PyTorch leverage optimized libraries like Intel's oneAPI Math Kernel Library and NVIDIA CUDA
Scientific computing applications depend on vectorization to achieve performance at scale

Real-World Impact: Quantifiable Improvements

The performance gains from vectorization translate to measurable improvements:

Databricks Photon achieves over 10× speedups on some SQL and DataFrame operations
Meta's Velox delivers 6-7× faster performance on heavy analytical queries in production at Facebook
CockroachDB's vectorized OLAP engine yields up to 4× speedups in standard analytics benchmarks
In machine learning, GPU-accelerated vectorized operations can be 10-100× faster than CPU-based sequential processing

These improvements enable interactive queries on terabytes of data, ML models trained in minutes instead of hours, and scientific simulations at previously impossible resolutions.

The Future of Vectorized Processing

As hardware continues to evolve with wider vector units, more cores, and specialized accelerators, vectorization remains the foundation of high-performance data systems. The convergence between database technology, data science tools, and ML frameworks demonstrates that vectorization has become a fundamental paradigm for modern computing.

Embracing vectorized processing is now essential for delivering the performance required by data-intensive applications across industries and domains.

‍

Thank you

You have been subscribed.

Oops! Something went wrong while submitting the form.

Apache Spark Resource Configuration

The Baseline Configuration

‍Failure Pattern #1: Workload Evolution Requiring Infrastructure Changes

‍Failure Pattern #2: Memory Exhaustion In Compute Heavy Operations

Failure Pattern #3: Data Skew, The Silent Killer

‍Failure Pattern #4: Resource Starvation in Mixed Workloads

Conclusions & The Path Forward

Related Posts

The Journey to Better Data Processing

The Evolution of Data Processing Needs

A New Approach to Performance

Enabling Innovation Through Efficiency

The Future of Data Processing

The Evolution of Streaming Query Execution

DataFusion's Native Streaming Architecture

Polars and DuckDB: Streaming Capabilities

Choosing the Right Tool for Your Streaming Needs

The Future of Streaming Query Engines

The Birth of Vectorized Processing

How Vectorization Transforms Performance

The Role of Apache Arrow

The Vectorization Landscape Today

Real-World Impact: Quantifiable Improvements

The Future of Vectorized Processing

Subscribe today

Thank you