Polars vs pandas 2026: 14GB Benchmark

Every Polars-vs-pandas comparison I've read in the last year uses a 500MB CSV and announces that Polars is faster. Congratulations. I wanted to know what happens on the workload I actually run at work: tens of gigabytes of partitioned parquet, a real join, a real groupby, a real write.

So I built a reproducible benchmark and ran it. Here's the short version. The long version (with EXPLAIN plans, memory graphs over time, and the docker-compose to reproduce it on your own box) lives on my blog.

The workload

Fact table: 120M rows of synthetic event logs, partitioned by day, ~14GB total as snappy parquet.
Dim table: 40M-row user dimension, ~2.1GB parquet.
Query: inner join fact to dim on user_id, group by user_id and day, aggregate (sum, count, max), write the result back as parquet.
Machine: 8 vCPU, 32GB RAM. Nothing exotic.

Versions tested

pandas 2.2.3 with the pyarrow dtype backend
pandas 2.2.3 with the default numpy backend
Polars 1.12 (lazy + streaming=True)
Polars 1.12 (eager, just to embarrass myself)

Results

pandas 2.2 (numpy): 24m 11s wall time, 28.4 GB peak RSS, OOM'd once in 5 runs. pandas 2.2 (pyarrow): 18m 02s, 19.6 GB, no OOM. Polars 1.12 eager: 28m 47s, 26.1 GB, OOM'd twice in 5 runs. Polars 1.12 lazy + streaming: 3m 38s wall time, 8.9 GB peak RSS, no OOM.

The lazy + streaming column is the headline. Eager Polars on data that doesn't fit comfortably in RAM is genuinely worse than pandas-pyarrow.

What I got wrong the first time

My first Polars rewrite looked like this:

import polars as pl

fact = pl.read_parquet("events/*.parquet")
dim = pl.read_parquet("users.parquet")

result = (
    fact.join(dim, on="user_id")
        .group_by(["user_id", "day"])
        .agg([
            pl.col("amount").sum(),
            pl.col("event_id").count(),
            pl.col("score").max(),
        ])
)
result.write_parquet("out.parquet")

read_parquet is eager. The whole fact table has to fit in RAM before the query plan even gets to look at it. Predicate pushdown? Gone. Streaming? Gone.

The lazy version is almost identical but uses scan_parquet and defers materialization:

import polars as pl

fact = pl.scan_parquet("events/*.parquet")
dim = pl.scan_parquet("users.parquet")

plan = (
    fact.join(dim, on="user_id")
        .group_by(["user_id", "day"])
        .agg([
            pl.col("amount").sum(),
            pl.col("event_id").count().alias("events"),
            pl.col("score").max(),
        ])
)

plan.sink_parquet("out.parquet")  # streaming write

Note sink_parquet instead of .collect().write_parquet(). sink_parquet is the streaming engine's preferred terminal — it writes batches as it goes and never tries to materialize the full result.

Gotchas I burned hours on

1. Streaming silently falls back. Some operations (cross joins, certain window functions, a few string ops in 1.12) aren't implemented in the streaming engine. Polars does not log a warning. It just runs them in the in-memory engine, your RSS balloons, and you blame Polars. The fix is to call .explain(streaming=True) and look for STREAMING: markers — anything outside that block is running in-memory.

2. Casts inside aggs block pushdown. This:

.agg(pl.col("amount").cast(pl.Int64).sum())

...prevented predicate pushdown for me in 1.12. Moving the cast into the scan_parquet schema override gave back ~20% of runtime.

3. pl.col("x").apply(...) exists, and you should treat it like a code smell. If you reach for it, search the expression namespace for a native equivalent. It's almost always there, the docs just don't surface it well.

When pandas still wins

For anything under a few GB that fits comfortably in RAM, pandas-pyarrow is within 2x of Polars on most operations and has a much larger ecosystem (matplotlib integration, statsmodels, every tutorial ever written). I am not migrating my notebooks.

For batch jobs that process more data than fits in RAM, lazy Polars is not close. It's a different category.

Reproduce it yourself

The data generator, the benchmark harness, and the docker-compose are all in the full writeup, along with EXPLAIN plans for both engines and memory usage sampled at 200ms intervals:

Read the full benchmark on pythondatabench.com

If you run it on your own hardware, I'd genuinely like to hear the numbers — especially anyone with NVMe vs spinning rust, since I suspect the streaming-engine gap widens on slower storage.

The workload

Versions tested

Results

What I got wrong the first time

Gotchas I burned hours on

When pandas still wins

Reproduce it yourself

Related articles

Related Articles

LLM Quantization in Python: GPTQ vs AWQ vs bitsandbytes vs GGUF (2026)

Geospatial Analysis with GeoPandas 1.0 in Python: A Practical 2026 Guide

LLM Inference Servers in Python: vLLM vs TGI vs SGLang Compared (2026)