Blog

Vectorized Processing

The Silent Performance Multiplier in Data Systems

Ran Reichman

•

read time

•

March 18, 2025

Heading 2

Vectorization has emerged as the most critical performance innovation in modern data platforms. At its core, the concept is straightforward: process entire batches of data simultaneously rather than one row at a time. This approach unlocks substantial efficiency gains and has become fundamental to high-performance data systems.

The Birth of Vectorized Processing

The database community first embraced vectorization through pioneering systems like MonetDB and VectorWise in the mid-2000s. These systems addressed the observation that traditional row-by-row processing created significant CPU bottlenecks. Their solution involved processing data in batches small enough to fit in CPU caches, dramatically improving query performance by eliminating per-row function call overhead.

In parallel, the scientific Python ecosystem built NumPy and Pandas around vectorized operations, allowing data scientists to perform bulk calculations orders of magnitude faster than Python loops. These early implementations demonstrated that vectorization represented a fundamental paradigm shift in data processing.

How Vectorization Transforms Performance

Vectorization aligns with modern hardware capabilities through multiple mechanisms:

CPU Vector Instructions (SIMD): Modern CPUs include SIMD (Single Instruction Multiple Data) units that can perform the same operation on multiple values simultaneously. These specialized processor features have evolved significantly:
- SIMD Evolution: From early MMX and SSE instructions processing 128 bits (4 integers) at once, to AVX-256 handling 8 integers, and modern AVX-512 capable of processing 16 integers or floats in a single instruction
- Hardware Implementation: SIMD registers are wider than standard registers—256 or 512 bits versus 64 bits—allowing a single instruction to operate on multiple data elements
- Operation Types: Common SIMD operations in data processing include vectorized comparison (generating bitmasks for filtering), arithmetic (sum, multiply, divide entire arrays), and specialized operations like shuffle and gather/scatter
- Compiler Support: Modern compilers can auto-vectorize simple loops, while high-performance systems use intrinsics (specialized C functions that map directly to SIMD instructions) for maximum control
- Performance Impact: SIMD instructions can provide theoretical speedups proportional to the vector width—up to 16x for certain operations on AVX-512 systems
Memory Efficiency: Columnar data layouts enable sequential memory access, maximizing cache efficiency and minimizing memory stalls.
Reduced Overhead: With vectorization, the cost of function calls and interpretation is amortized across hundreds or thousands of values.

A simple example illustrates the difference. Consider summing a column with a million values:

Traditional approach: Loop through one million values, with function call overhead for each
Vectorized approach: Process 1,024 values at once in a tight loop, leveraging SIMD instructions

The Role of Apache Arrow

Apache Arrow has become the central enabling technology for the vectorization ecosystem. It provides:

Zero-copy columnar memory format: Arrow defines a standardized in-memory columnar representation that allows data to be processed without serialization or deserialization when moving between systems.
SIMD-optimized compute kernels: Arrow includes a library of vectorized operations optimized for modern CPUs, ensuring that as new vector instruction sets emerge (AVX-512, ARM SVE), all Arrow-based systems can benefit.
Cross-language compatibility: Arrow implementations exist across multiple programming languages (C++, Rust, Python, Java, etc.), enabling efficient data exchange between different environments.
Integration across the ecosystem: Major platforms including Spark, DataFusion, Polars, and Velox have adopted Arrow as their interchange format.
Flight protocol: Arrow Flight provides high-performance data transfer between systems using the Arrow format, offering substantial improvements over traditional protocols.

The significance of Arrow lies in its ability to break down silos between previously isolated data systems. A dataset in Arrow format can move seamlessly between a Spark cluster, Python analysis environment, and GPU-accelerated visualization tool with minimal overhead.

The Vectorization Landscape Today

This approach has permeated virtually every corner of the data ecosystem:

Analytical Databases

ClickHouse processes data in batches, routinely scanning billions of records per second on a single server
DuckDB processes fixed-size batches of 1,024 values, matching dedicated database servers for medium-sized datasets
Apache DataFusion operates natively on columnar RecordBatches, performing highly efficient SIMD-enabled computations

Big Data Systems

Apache Spark now leverages Pandas UDFs with Arrow as a zero-copy data interchange format, though it still does not use vectorization in its primary flows
Databricks Photon replaces row-wise processing with a native columnar engine
Meta's Velox provides a unified C++ execution engine with vectorized expression evaluation

Data Science and ML

Polars combines Apache Arrow's memory-efficient format with multi-threaded, SIMD-accelerated operations
TensorFlow and PyTorch leverage optimized libraries like Intel's oneAPI Math Kernel Library and NVIDIA CUDA
Scientific computing applications depend on vectorization to achieve performance at scale

Real-World Impact: Quantifiable Improvements

The performance gains from vectorization translate to measurable improvements:

Databricks Photon achieves over 10× speedups on some SQL and DataFrame operations
Meta's Velox delivers 6-7× faster performance on heavy analytical queries in production at Facebook
CockroachDB's vectorized OLAP engine yields up to 4× speedups in standard analytics benchmarks
In machine learning, GPU-accelerated vectorized operations can be 10-100× faster than CPU-based sequential processing

These improvements enable interactive queries on terabytes of data, ML models trained in minutes instead of hours, and scientific simulations at previously impossible resolutions.

The Future of Vectorized Processing

As hardware continues to evolve with wider vector units, more cores, and specialized accelerators, vectorization remains the foundation of high-performance data systems. The convergence between database technology, data science tools, and ML frameworks demonstrates that vectorization has become a fundamental paradigm for modern computing.

Embracing vectorized processing is now essential for delivering the performance required by data-intensive applications across industries and domains.

‍

Resources

Blog

This is some text inside of a div block.

Why the World Needs Flarion

Unlocking Data's Full Potential

read time

November 27, 2024

Did you know that data-driven organizations spend up to 40% of their IT budgets on data processing alone?

As organizations scale their data processing capabilities, two critical challenges emerge: the mounting costs of processing big data and the pressing need for faster performance. Today, we're sharing our journey and explaining why Flarion is transforming how organizations leverage their data assets while staying competitive in an increasingly data-driven world.

The Journey to Better Data Processing

Through years of experience across diverse industries, Flarion’s co-founders witnessed the universal struggle of escalating data processing costs and performance bottlenecks.

During his years building data processing systems for mass-scale consumer applications and autonomous vehicles, Ran experienced firsthand how organizations struggled with the growing costs and performance demands of expanding datasets. In consumer applications, better insights can help create great experiences for hundreds of millions of people, but the high computational costs and processing limitations often make this prohibitively expensive. In autonomous vehicles, data processing at scale allows us to understand and tackle the toughest "long tail" challenges, but technical limitations can make this slow and cost-inefficient.

Through his extensive work with enterprises across various industries, Udi observed a consistent pattern: organizations were hitting both a performance and cost ceiling in their data processing capabilities. Despite significant investments in infrastructure and talent, companies found themselves constrained by processing limitations that held back their ability to launch new features or products while managing escalating infrastructure costs.

The Evolution of Data Processing Needs

The landscape of data processing has evolved dramatically. What started as simple analytics has transformed into complex data pipelines processing hundreds of terabytes daily. These diverse challenges underline the pressing need for solutions that address both speed and cost at scale.

In automotive, processing speed directly impacts vehicle safety and performance, while processing costs affect vehicle affordability and market competitiveness. In financial services, faster data processing enables real-time decision-making and better risk assessment, but the infrastructure costs of high-frequency trading and real-time analytics can quickly erode profit margins. For e-commerce companies, efficient data processing means better customer recommendations and inventory management, yet the cost of processing massive customer datasets across global markets can be prohibitive. Almost every industry relies heavily on efficient data processing and analytics, making both speed and cost optimization critical factors in maintaining competitive advantage.

A New Approach to Performance

Traditional approaches to improving data processing often involve extensive code changes, specialized expertise, or specific deployment requirements. For enterprises with massive legacy codebases, these solutions are often impractical or impossible to implement, creating additional complexity without solving the fundamental challenges of performance and cost efficiency.

We built Flarion with a different vision: what if organizations could dramatically improve their data processing performance without changing their code or disrupting existing workflows? With new Spark, Hadoop and Ray execution engines, we've created a solution that delivers up to 3x performance improvement while maintaining robust reliability and full compatibility. Most importantly, Flarion can be implemented in just 5 minutes, requiring minimal effort from organizations looking to modernize their data stack.

Enabling Innovation Through Efficiency

The impact of accelerated data processing extends far beyond just faster completion times. When organizations can process their data more efficiently and cost-effectively, they can explore new use cases, launch innovative features, and focus on extracting value from their data rather than managing infrastructure costs.

For AI and machine learning applications, efficient data processing is becoming increasingly crucial. The ability to process large datasets quickly and reliably can mean the difference between a successful model deployment and a missed opportunity. With Flarion, organizations can focus on innovation rather than infrastructure optimization, all while maintaining their existing codebase and operations.

The Future of Data Processing

As we enter an era where data drives competitive advantage, organizations need solutions that enable them to process more data, faster and more cost-effectively. The future of data processing isn't just about handling today's workloads - it's about being ready for tomorrow's challenges while managing costs sustainably.

With Flarion, organizations are not just keeping pace—they’re leading the charge into a data-driven future. Our solution enables organizations to unlock the full potential of their data assets, whether they're running data processing in the cloud or on-premises. By delivering significant performance improvements through advanced optimization techniques, we're helping organizations process their data more efficiently while reducing their infrastructure costs. Most importantly, we're doing this in a way that respects the reality of enterprise systems - with a solution that can be implemented in minutes, not months.

The future of data processing should empower organizations to focus on innovation and value creation without being held back by legacy infrastructure or rising costs.

At Flarion, we're making that future a reality.

Featured Blog

Blog

Storage Format Lock-in: The Constraint Limiting Query Performance

read time

November 17, 2025

Modern distributed processing engines achieve remarkable scale and reliability, yet they share a fundamental architectural constraint: rigid storage format requirements. This constraint forces organizations to accept significant performance penalties, sometimes 10-100x slower queries than technically possible, simply because their processing engine cannot adapt to optimized storage formats.

Why Parquet Falls Short for Selective Analytics

Parquet excels as a universal columnar format, but its design prioritizes compatibility and compression over query performance. Understanding its limitations requires examining how analytical queries actually execute.

The Row Group Problem

Parquet organizes data into row groups, typically 100,000 to 1 million rows each. This coarse granularity creates a fundamental problem: if even a single row in a row group matches your query filter, you must read ALL requested columns for the ENTIRE row group.

Consider a query filtering for a specific customer ID in a billion-row table. That customer's 100 transactions might be scattered across 100 different row groups. Parquet must read 100 row groups × 100,000 rows × all requested columns, processing 10 million rows to return 100.

Single-Phase Execution

Parquet readers process queries in a single phase:
‍

1. Read all requested columns for relevant row groups

2. Decompress the data

3. Apply filters

4. Return matching rows

This means reading and decompressing massive amounts of data that will immediately be discarded. There's no mechanism to filter first, then read only what's needed.

Limited Predicate Pushdown

While Parquet stores min/max statistics per row group and optional Bloom filters, these only help skip entire row groups. For analytical queries where some matching rows exist in many row groups, this provides minimal benefit. You still read entire 100,000-row chunks to extract perhaps 10 rows from each.

How Query-Optimized Formats Solve These Problems

Formats like ClickHouse's MergeTree take a fundamentally different approach:

Granular Storage

Instead of 100,000-row groups, data is organized into 8,192-row granules. This 12x finer granularity means reading much less unnecessary data when matches are sparse. Finding those same 100 customer transactions requires reading ~12x less data just from granularity alone.

Two-Phase Execution with PREWHERE

Query-optimized formats implement two-phase execution:

1. Phase 1: Read ONLY filter columns for candidate granules

2. Phase 2: For rows that pass filters, read the remaining requested columns

This seemingly simple change has a profound impact. Instead of reading 20 columns for millions of rows, you read 1-2 filter columns first, identify the 100 matching rows, then read the other 18 columns for just those 100 rows.

Sparse Indexing

A sparse primary index stores one entry per granule (every 8,192 rows), creating a tiny index that fits entirely in memory even for billion-row tables. Binary search on this index instantly identifies which granules to read, eliminating 99%+ of data before any I/O occurs.

Lazy Materialization

Column reads are deferred until absolutely necessary. If a query has multiple filters, the engine applies them progressively, reading additional columns only for rows that survive each filter. This minimizes decompression work and memory bandwidth usage.

The Performance Gap: A First-Principles Calculation

Let's calculate the actual performance difference using a realistic analytical query on data stored in S3:

Scenario:

100 million rows, 100 columns
Query with 0.01% selectivity returning 10,000 rows
Projects 20 columns
All data in S3 (network I/O is the same for both formats)

Parquet Execution:

Data organized in 100,000-row groups
10,000 matching rows spread across ~100 row groups
Must read all 20 requested columns for entire row groups containing any match
Compressed data transferred from S3: 160 MB (with 10:1 compression)
Data decompressed and processed: 1.6 GB

Query-Optimized Format (like ClickHouse MergeTree):

Data organized in 8,192-row granules with sparse index
Two-phase execution: read filter columns first, identify matches, then read other columns
Compressed data transferred from S3: 46 MB (with 3.5:1 compression)
Data decompressed and processed: 162 MB

Despite storing data 2.9x larger on S3 due to less aggressive compression, the query-optimized format transfers 3.5x less data over the network and processes 10x less decompressed data. Processing 1.6 GB vs 162 MB of decompressed data is the difference between 100ms and 10ms query time

Why Processing Engines Don't Support Alternative Formats

While Spark and Ray technically could support arbitrary storage formats through extension APIs, in practice they don't. The ecosystem has converged on Parquet, ORC, and Avro, leaving significant performance opportunities unexplored.

The Spark Reality

Spark provides a DataSource V2 API for custom formats, yet virtually no production deployments use them. The practical barriers:

Format implementations must handle Spark's complex internal row representation
Custom formats cannot leverage Spark's vectorized execution optimizations
The Catalyst optimizer cannot reason about custom format capabilities
Maintaining custom format readers requires deep Spark internals knowledge

Ray's Format Limitations

Ray delegates data loading to Pandas or PyArrow:

python

@ray.remote

def process_partition(file_path):

# Limited to formats Pandas supports

df = pd.read_parquet(file_path)

return df.groupby('device_id').mean()

Adding optimized formats would require writing C++ extensions, ensuring serialization compatibility, and maintaining format-specific optimizations Ray doesn't understand.

The Architectural Impedance Mismatch

Even when custom format support exists, fundamental architectural assumptions prevent leveraging format-specific optimizations. Standard engines read all requested columns before filtering and lack the execution primitives for two-phase execution. The granularity mismatch between coarse row groups and fine granules cannot be bridged through format plugins.

Real-World Impact on S3-Based Data Lakes

Production systems using S3 data lakes demonstrate concrete impact:

E-commerce Recommendation Engine

Dataset in S3: 500M products × 800 attributes
Parquet in Spark: 8.2 seconds average latency
Possible with optimized format: 100ms
Impact: Real-time recommendations impossible

Financial Risk Analytics

Dataset in S3: 10 years of trades, 2,000 columns
Parquet in Spark: 45 seconds per portfolio
Possible with optimized format: 500ms
Impact: Risk managers wait minutes for updates that could take seconds

IoT Monitoring

Dataset in S3: 100,000 devices × 1,000 metrics
Parquet in Ray: 12 seconds per device group
Possible with optimized format: 50ms
Impact: Alert latency prevents real-time response

The Compression-Performance Trade-off

Parquet achieves superior compression, typically 2-3x better than query-optimized formats. A 100GB dataset might compress to 10GB in Parquet versus 30GB in an optimized format, translating to $2.30/month versus $6.90/month on S3.

However, compute costs dominate this equation. When queries run 10-100x faster with optimized formats, the required infrastructure shrinks proportionally. A workload requiring a 16-node Spark cluster at $8,000/month might run on 2 nodes at $1,000/month with format optimization. The $7,000/month compute savings overwhelms the $4.60/month additional S3 storage cost by a factor of 1,500x.

Performance Patterns by Query Type

Different query patterns show varying sensitivity to format selection:

Highly Selective Queries (<0.1% of rows returned)

Performance difference: 20-100x
Parquet's coarse row groups create massive read amplification

Wide Table Projections (many columns, few rows)

Performance difference: 10-50x
Single-phase execution forces reading all columns upfront

Aggregation Queries

Performance difference: 5-20x
Processing unnecessary rows dominates computation

Full Table Scans

Performance difference: 0.7-1.4x (Parquet often faster)
Parquet's superior compression provides advantage when reading everything

The Hidden Costs of Format Lock-in

Beyond direct performance impact, format rigidity creates cascading inefficiencies:

Infrastructure Over-provisioning: Organizations scale clusters to compensate for format inefficiency. A workload naturally requiring 2 nodes might run on 20 nodes to achieve acceptable latency.

Architectural Workarounds: Teams implement pre-aggregation pipelines, materialized views, and caching layers, each adding operational complexity without addressing the root cause.

Opportunity Costs: When queries take 30 seconds instead of 300ms, entire categories of applications become infeasible.

The Path Forward

Parquet remains excellent for data interchange, archival storage, and full-scan workloads. The opportunity lies in enabling processing engines to leverage format diversity based on workload requirements.

The ideal architecture would support understanding the profile of the workloads, adjusting compaction accordingly, and using the best data format for the job. Current processing engines cannot implement this strategy due to their format rigidity. They lack the flexibility to transparently read different formats while applying format-specific optimizations.

This limitation is beginning to change. Next-generation execution engines recognize that format flexibility is essential for modern analytical workloads. By decoupling query execution from storage assumptions, systems like Flarion can deliver the performance that has always been technically possible but practically unreachable.

Organizations no longer need to accept 10x performance penalties as the price of distributed processing. The ability to match storage formats to query patterns represents the difference between queries that frustrate users and analytics that drive real-time decisions.

‍

Featured Blog

Blog

Spark 4.0's Columnar Journey: Real Progress, Real Limitations

read time

August 18, 2025

Apache Spark 4.0 marks a significant milestone in the framework's evolution toward columnar processing. With enhanced Apache Arrow integration, improved UDF support, and refined plugin architectures, Spark has taken meaningful steps forward. Yet understanding both the advances and the remaining gaps reveals why the journey toward truly efficient columnar processing continues.

The Promise and Reality of Columnar Improvements

Spark 4.0's headline improvements center on Apache Arrow integration. The framework now offers direct DataFrame-to-Arrow conversions, Arrow-optimized Python UDFs achieving up to 1.9x performance improvements, and cleaner APIs for columnar data exchange. These changes particularly benefit PySpark users, who've long suffered from serialization overhead when moving data between JVM and Python processes.

Yet examining Spark's architecture reveals a fundamental reality: the core execution engine remains predominantly row-oriented. While Spark provides hooks for columnar execution through its plugin architecture, the built-in operators - projections, filters, joins, aggregations - still process data row by row through optimized Java code generation. This isn't an oversight but a deliberate design choice that prioritizes compatibility and stability over pure performance.

Where Columnar Support Actually Stands

Understanding Spark 4.0's columnar capabilities requires distinguishing between data format and execution model. Spark has long supported columnar storage formats like Parquet. However, during actual computation, most operations convert this columnar data back to rows for processing.

Built-in SQL expressions execute through Catalyst's code generation, producing tight Java loops that process one row at a time. Complex expressions involving conditionals, nested functions, or custom logic follow this row-wise pattern. The JVM's JIT compiler optimizes these loops well, but they fundamentally lack the vectorized operations that define true columnar processing.

UDF support presents a mixed picture. Pandas UDFs genuinely operate on columnar data, leveraging NumPy's vectorized operations. The new Arrow-optimized Python UDFs improve data transfer efficiency but don't change the scalar execution model - they still process individual values, just with better serialization. Scala and Java UDFs remain entirely row-based, forcing any columnar data to convert back to rows for execution.

The architectural split becomes clear when examining memory management. Spark uses its own ColumnVector implementations for internal operations, not pure Arrow format. Converting between Spark's internal format and Arrow involves either copying or wrapping data, adding overhead that pure columnar engines avoid.

The Performance Gap That Remains

The practical implications become evident in production workloads. Join operations still rely on sort-merge or hash algorithms implemented in Java without SIMD optimization. Aggregations process groups row by row rather than operating on entire column chunks. String operations, mathematical expressions, and date manipulations all follow the same pattern - optimized Java code that processes individual values rather than vectors.

Native columnar engines demonstrate what's possible with true vectorized execution. By leveraging SIMD instructions and processing entire column batches simultaneously, these engines achieve significant speedups - often 2x or more - on the same hardware. This isn't because Spark's code is poorly written; it's because columnar execution with hardware vectorization fundamentally outperforms row-wise processing for analytical workloads.

The memory efficiency gap proves equally significant. Native columnar engines process data in its compressed form, maintaining compression through operations wherever possible. Spark's row-wise operations require decompression and materialization, increasing memory pressure and triggering more frequent garbage collection. For workloads pushing memory limits - a common scenario given how frequently Spark jobs encounter OOM errors - this efficiency difference can determine whether jobs complete successfully.

The Path Forward: Complementary Solutions

Spark 4.0's columnar improvements represent genuine progress, particularly for Python workflows and data interchange scenarios. Yet the core execution engine's row-based nature means achieving optimal columnar performance requires additional components.

Organizations increasingly deploy hybrid architectures that leverage Spark's strengths - distributed orchestration, fault tolerance, broad connector support - while delegating performance-critical operations to specialized columnar engines. Whether through native code execution or hardware acceleration, these complementary technologies fill the gaps in Spark's columnar story. This is precisely where solutions like Flarion's Arrow-based columnar processing provide value - plugging directly into existing Spark deployments to accelerate workloads without requiring code changes, while maintaining the distributed capabilities teams already rely on.

Understanding both Spark 4.0's advances and its limitations enables informed architectural decisions. While Spark takes important steps toward columnar processing, the journey toward truly efficient columnar execution often requires recognizing where additional acceleration provides essential value. For teams facing today's performance challenges - growing datasets, tightening SLAs, and mounting infrastructure costs - combining Spark's orchestration capabilities with purpose-built columnar acceleration delivers the performance modern data platforms demand.

‍

Thank you

You have been subscribed.

Oops! Something went wrong while submitting the form.

Vectorized Processing

The Birth of Vectorized Processing

How Vectorization Transforms Performance

The Role of Apache Arrow

The Vectorization Landscape Today

Real-World Impact: Quantifiable Improvements

The Future of Vectorized Processing

Related Posts

The Journey to Better Data Processing

The Evolution of Data Processing Needs

A New Approach to Performance

Enabling Innovation Through Efficiency

The Future of Data Processing

Why Parquet Falls Short for Selective Analytics

The Row Group Problem

Single-Phase Execution

Limited Predicate Pushdown

How Query-Optimized Formats Solve These Problems

Granular Storage

Two-Phase Execution with PREWHERE

Sparse Indexing

Lazy Materialization

The Performance Gap: A First-Principles Calculation

Why Processing Engines Don't Support Alternative Formats

The Architectural Impedance Mismatch

Real-World Impact on S3-Based Data Lakes

The Compression-Performance Trade-off

Performance Patterns by Query Type

The Hidden Costs of Format Lock-in

The Path Forward

The Promise and Reality of Columnar Improvements

Where Columnar Support Actually Stands

The Performance Gap That Remains

The Path Forward: Complementary Solutions

Subscribe today

Thank you