Polars apply alternative: when/then/otherwise for 80x speedups
Eight worked examples that swap pandas-style .apply() in Polars for native when/then/otherwise expressions, plus benchmarks showing the real speedup on a 50M-row dataset.
Python Data Bench is where our team unpacks the messy middle of data engineering: the part between a working notebook and a pipeline that survives a Monday morning. We write for engineers who maintain pandas, Polars, and dbt in production, not for tutorial-tourists, so every article is grounded in a real schema, a real bug, or a real bill from Snowflake or BigQuery.
Polars has crossed the line from curiosity to default for a lot of teams, and we have spent the last year migrating real codebases to it. Our coverage compares Polars 1.x lazy frames against pandas 2.x with PyArrow-backed dtypes on the workloads people actually run: wide joins, window functions over event streams, and the dreaded group-by-then-explode pattern. We look at memory ceilings on a 32 GB worker, streaming engine throughput on Parquet partitioned by day, and the ergonomic tax of rewriting apply calls as expressions.
We also stay honest about where pandas still wins. If your codebase leans on scikit-learn pipelines, statsmodels, or the long tail of plotting libraries that speak the pandas dialect, a wholesale rewrite is rarely the cheapest move. Our migration guides cover the Apache Arrow bridge, dtype-backend flags, and the specific cases where Polars-on-Rust beats DuckDB or where DuckDB is the better answer entirely.
Incremental models are the feature that turns dbt from a SQL templater into an actual data platform, and also the feature that quietly breaks the most production warehouses. We document the patterns we use to keep them honest: unique_key on real surrogate keys, insert_overwrite with partition predicates on BigQuery, and the merge strategies that work on Snowflake without exploding micro-partitions. Each pattern is paired with a backfill plan, because the first question on any incident bridge is "can we safely re-run yesterday?"
We pull heavily from the official dbt docs but go further into the edges they skip: late-arriving facts, slowly changing dimensions with type-2 history, and the snapshot table sprawl that hits every team around year two. Expect concrete macros, CI checks built on dbt tests and Great Expectations, and a hard look at when to graduate from dbt Core to dbt Cloud or Coalesce.
2026 is the year SQLMesh stopped being the "alternative to dbt" and started being a credible default for teams that need real virtual data environments and column-level lineage out of the box. We benchmark SQLMesh against dbt on the things that matter to a working data engineer: plan-and-apply semantics, breaking-change detection, blue-green model promotion, and how each tool behaves when a model definition changes mid-backfill. Our reviews are based on parallel production deployments, not a clean toy project.
We also cover the surrounding ecosystem: DuckDB as a local development engine, Daft for multimodal workloads, and how Apache Iceberg table formats change what your transformation layer should and should not handle. The goal is to help you pick a stack that will still make sense in two years.
Browse the latest articles below for our newest field notes, benchmarks, and migration write-ups. New posts go up most weeks, and our archive is organized by tool and by problem so you can jump straight to the one that is on fire today.
Eight worked examples that swap pandas-style .apply() in Polars for native when/then/otherwise expressions, plus benchmarks showing the real speedup on a 50M-row dataset.
A field-tested walkthrough of the five reasons your dbt incremental model silently runs a full refresh, with a reproducible debug recipe using compiled SQL, run results, and warehouse query history.
Every Polars-vs-pandas comparison I've read in the last year uses a 500MB CSV and announces that Polars is faster. I wanted to know what happens on the workload I actually run at work.
BentoML, Ray Serve, FastAPI, and Triton compared for production ML model serving in Python: latency overhead, GPU batching, autoscaling, and cost per prediction with working code examples.
A practical 2026 guide to conformal prediction in Python with MAPIE 1.0: split conformal, CQR, APS/RAPS classification, alpha selection, and how to keep coverage under drift.
Marimo replaces Jupyter's manual execution with a reactive DAG, plain Python files, and one-command WASM deploys. Here's how the two notebooks compare in 2026.
Compare the three leading open-source Python drift detection libraries — Evidently, NannyML, and Alibi Detect — with runnable code, statistical test guidance, and production patterns for ML monitoring in 2026.
A practical 2026 guide to imbalanced classification in Python: when to reach for SMOTE, ADASYN, BorderlineSMOTE, class_weight, or threshold tuning — with runnable scikit-learn 1.8 and imbalanced-learn 0.14 code, common pitfalls, and a clear decision framework.
Learn how to run A/B tests in Python from start to finish — power analysis with statsmodels, frequentist hypothesis testing with SciPy 1.17, and Bayesian analysis with PyMC 5.28. Includes working code, decision frameworks, and common pitfalls to avoid.
Learn five practical ways to parallelize pandas operations — multiprocessing, Joblib, Dask, Modin, and Swifter — with working code examples, benchmarks, and a decision guide to pick the right tool.
A practical, code-driven guide to hypothesis testing in Python using SciPy 1.17. Covers t-tests, chi-square, ANOVA, Mann-Whitney U, and Kruskal-Wallis with working examples, assumption checking, and a decision framework for choosing the right test.
Find out which automated EDA tool fits your Python workflow. We compare YData Profiling, SweetViz, DataPrep, and D-Tale with code examples, benchmarks, and a practical decision framework.
Choose your preferred language to explore our content