The Challenge of Garbage Collection in Java-based Data Engines

If you've run Spark for any length of time, you might have seen a job stall for no obvious reason. The data is loaded, the executors are up, the CPUs are barely ticking over, and the stage just sits there. The culprit is usually garbage collection. In this post we'll look at why it happens and what can be done about it.
What the JVM Is Doing When Everything Stops
Spark runs on the JVM, and the JVM is responsible for memory management. Your code creates objects as it works through the data, the dead ones pile up, and a background process called the garbage collector comes along and clears them out. Most of the time this happens seamlessly and without any user impact.
The trouble is in how it clears them. Before the collector can safely reclaim memory, it needs a moment where nothing is changing underneath it, so it can tell which objects are still in use. To get that moment, it stops the program. Every thread on the executor freezes at the same instant and waits. These are called stop-the-world pauses, and the name is honest about it. Nothing runs until the collector is finished.
Sometimes people ask, does the program really just wait for garbage collection? The answer is yes, that's exactly what happens. It's worth being fair to the modern collectors, because they're good. They do most of their bookkeeping concurrently, alongside the running job, precisely to keep these freezes short, and any single pause today is usually quick. But quick isn't free. The pauses never go away entirely, and the collectors tuned for data throughput will take longer pauses on purpose, because it lets them get more done in between. Add up enough short freezes across a big fleet and a long job, and you're looking at real idle time, which hits both performance and cost.
Can’t I Solve This Problem By Tuning?
The natural fix is to tune. Switch collectors, give it a bigger heap, change how the memory is split. This helps, but only up to a point.
Spark processes data a row at a time, and represents each value as an object on the heap. A single stage over a large dataset creates a huge number of short-lived objects: a value gets boxed, wrapped in an iterator, passed through an operator, and discarded. Every one is work the collector has to do later, and the faster they pile up, the more often it runs. No setting changes that. You can make collection faster and the pauses shorter, but you can't make the collector do less while the engine above it produces garbage by design. That's the ceiling.
Pauses aren't the whole cost either. The objects carry overhead of their own: headers, boxing and unboxing, and a scattered memory layout the CPU can't read efficiently. You pay for it whether the collector is running or not, and it usually stays hidden until someone profiles the job.
Not Making the Garbage in the First Place
If the volume of short-lived objects is the real problem, then the way out is an engine that doesn't create them to begin with, and that's a change you make well below anything a config file can reach. Flarion replaces Spark's row-by-row execution with a native engine written in Rust, built on Apache Arrow and DataFusion. Two things about that design do most of the work.
First, it's columnar. Instead of a parade of individual row objects, data moves through the engine in large contiguous blocks, one per column. The operation that used to allocate an object per value now chews through a whole block at once. The torrent of short-lived objects that kept the collector busy is never generated, so there's nothing to collect.
Second, there's no garbage collector in the picture because Rust doesn't use one. It tracks the ownership of memory as part of the language, so a piece of memory is freed at an exact, known point in the program, the instant it's no longer needed, and nothing ever has to stop the world to go hunting for what's dead. The data also lives off-heap, outside the JVM's managed memory entirely. There's no mechanism left to produce the idle time you were watching, so it's largely gone.
This doesn’t mean we’ve left the JVM behind. It still runs there, and it reaches the native engine through JNI, the standard bridge between Java and native code. What crosses that bridge is control and a handle to where the data lives, not the data itself, which stays off-heap on the native side. So handing work across to the engine doesn't pull everything back onto the heap for the collector to find, and the garbage stays uncreated.
Where It Doesn't Reach
Native execution covers most of a job, not all of it. The driver, the query planning, and any operation the engine doesn't support yet still run on the heap and still make objects for the collector. So garbage collection doesn't disappear from a real workload, it shrinks, and whatever is left tends to gather in the stages that haven't been converted. On more than one workload we've watched it settle into the final write, the last step still handing data back through the JVM. How much you're left with simply tracks how much of the job still runs on Spark.
None of this is something you have to manage. Flarion goes in as a plugin inside the Spark job you already have. What it supports runs natively, with no collector and no pause, and what it doesn't falls back to Spark and runs the way it always did. The acceleration happens underneath, and for the parts that run natively, the pause you used to wait on isn't there.
What This Adds Up To
Garbage collection sets a floor on how well a JVM-based engine can perform at scale. Tuning lowers that floor but never removes it, because the engine keeps producing the work the collector has to clear. Getting under the floor means not creating the garbage at all, which is why the same jobs, unchanged, run faster on a native engine. They finish sooner, and the bill for all that idle time goes with them.
