Polars DataFrame入门到精通教程(2026)

2026年的 Python 数据处理圈子里，一场不太张扬的变革正在发生。上次我们聊完 Pandas 3.0 的八大核心变化之后，评论区被提到最多的名字就是——Polars。说实话，这个用 Rust 写的、基于 Apache Arrow 的 DataFrame 库，已经不能再被当成什么"小众玩具"了。越来越多做数据工程的朋友跟我说，Polars 才是他们日常干活的首选。

截至2026年2月，Polars 最新版本已经到了 1.38.x。需要说清楚的是，它并不是要干掉 Pandas——尤其在 Pandas 3.0 搞定了 Copy-on-Write、字符串类型革新这些大动作之后，两者其实各有各的强项。但如果你经常处理百万行以上的数据，或者已经受够了 Pandas 在大数据集上磨磨蹭蹭的速度和动不动就爆的内存，那 Polars 绝对值得你认真看一看。

这篇文章会从设计理念一直讲到实战案例，带大量可运行的代码。不管你是刚听说 Polars 这个名字的好奇者，还是已经在项目里试过它的老手，都能从这篇指南里建立起一个比较完整的知识框架。

一、Polars 核心设计理念

1.1 Rust 内核，Python 外壳

Polars 的核心引擎完全用 Rust 编写。如果你不太了解 Rust，简单说就是：这是一门以零成本抽象和内存安全出名的系统级语言，没有垃圾回收器的暂停问题，也绕开了 Python GIL（全局解释器锁）的限制。结果就是，Polars 能真正利用你机器上的多核 CPU 来并行跑查询——而 Pandas 嘛，大多数操作本质上还是单线程的。

但好消息是，Polars 暴露给你的是一套非常优雅的 Python API。你不需要会 Rust，也不用关心底层那些细节。写起来跟 Pandas 一样顺手，跑起来却快了好几倍。这种"性能与易用性兼得"的感觉，说真的挺爽的。

1.2 Apache Arrow 列式内存格式

Polars 用的是 Apache Arrow 作为底层内存格式。Arrow 是一种标准化的列式内存表示，有几个很实际的优势：

列式存储：同一列的数据在内存里连续排列，对现代 CPU 缓存极度友好
零拷贝互操作：跟其他支持 Arrow 的库（DuckDB、Spark、DataFusion 等）之间可以不复制数据直接交换
原生缺失值支持：每一列有独立的有效性位图，不用像 NumPy 那样拿 NaN 来凑合
丰富的类型系统：原生支持日期时间、嵌套类型（List、Struct）、分类类型等等

1.3 惰性求值与查询优化

这个是 Polars 跟 Pandas 最本质的区别。Polars 有两种执行模式：

即时求值（Eager）：跟 Pandas 一样，每一步操作立马执行
惰性求值（Lazy）：先把查询计划攒起来，最后一次性优化再执行

惰性模式下，Polars 的查询优化器会自动搞定谓词下推、投影下推、公共子表达式消除这些优化——跟 SQL 数据库的查询优化器差不多意思。换句话说，你不需要绞尽脑汁去手动调优查询顺序，Polars 自己就能找到最优的执行路径。（后面第七节会详细展开这部分。）

1.4 自动并行执行

Polars 会自动分析操作之间的依赖关系，把没有依赖的操作扔到不同线程并行跑。你不用写任何并行代码，也不需要配置线程池——全自动的。在一台8核机器上，复杂聚合查询通常能拿到接近线性的加速，这个体验相当丝滑。

二、快速安装与环境配置

2.1 基础安装

安装 Polars 非常简单，支持 Python 3.9 及以上版本：

# 基础安装
pip install polars

# 安装所有可选依赖（推荐，包含 numpy、pandas 互操作等）
pip install "polars[all]"

# 或者只安装特定的可选依赖
pip install "polars[numpy,pandas,pyarrow]"

2.2 验证安装

import polars as pl
print(pl.__version__)  # 1.38.x
print(pl.show_versions())

2.3 可选依赖说明

Polars 的核心功能不依赖任何 Python 包（计算全在 Rust 层完成），但以下可选依赖在特定场景下挺有用：

polars[numpy]：与 NumPy 数组互转
polars[pandas]：与 Pandas DataFrame 互转
polars[pyarrow]：直接操作 PyArrow 表
polars[fsspec]：读写远程文件系统（S3、GCS 之类的）
polars[xlsx2csv]：读取 Excel 文件
polars[connectorx]：直接从 SQL 数据库拉数据

三、DataFrame 与 LazyFrame 基础操作

3.1 创建 DataFrame

创建 Polars DataFrame 非常直观，最常用的方式是从字典构建：

import polars as pl

# 从字典创建
df = pl.DataFrame({
    "name": ["张三", "李四", "王五", "赵六"],
    "age": [28, 35, 42, 31],
    "city": ["北京", "上海", "广州", "深圳"],
    "salary": [15000, 22000, 18000, 25000],
})

print(df)
# shape: (4, 4)
# ┌──────┬─────┬──────┬────────┐
# │ name ┆ age ┆ city ┆ salary │
# │ ---  ┆ --- ┆ ---  ┆ ---    │
# │ str  ┆ i64 ┆ str  ┆ i64    │
# ╞══════╪═════╪══════╪════════╡
# │ 张三 ┆ 28  ┆ 北京 ┆ 15000  │
# │ 李四 ┆ 35  ┆ 上海 ┆ 22000  │
# │ 王五 ┆ 42  ┆ 广州 ┆ 18000  │
# │ 赵六 ┆ 31  ┆ 深圳 ┆ 25000  │
# └──────┴─────┴──────┴────────┘

3.2 从文件读取

# 读取 CSV
df = pl.read_csv("data.csv")

# 读取 Parquet（推荐格式，读写最快）
df = pl.read_parquet("data.parquet")

# 读取 JSON
df = pl.read_json("data.json")

# 惰性扫描（不立即加载到内存，后面会详细讲）
lf = pl.scan_csv("data.csv")
lf = pl.scan_parquet("data.parquet")

3.3 基础操作：select、filter、with_columns、sort

Polars 的 API 设计哲学是"表达式优先"。几乎所有的数据操作都通过表达式来完成，一开始可能觉得跟 Pandas 有点不一样，但用熟了之后你会发现这个思路其实更清晰：

import polars as pl

df = pl.DataFrame({
    "name": ["张三", "李四", "王五", "赵六", "钱七"],
    "department": ["工程", "市场", "工程", "市场", "工程"],
    "age": [28, 35, 42, 31, 26],
    "salary": [15000, 22000, 18000, 25000, 13000],
})

# 选择列
result = df.select("name", "salary")
# 或者用表达式写法
result = df.select(pl.col("name"), pl.col("salary"))

# 筛选行
seniors = df.filter(pl.col("age") > 30)

# 添加/修改列
df_new = df.with_columns(
    (pl.col("salary") * 12).alias("annual_salary"),
    (pl.col("age") >= 35).alias("is_senior"),
)

# 排序
df_sorted = df.sort("salary", descending=True)

# 链式调用——这才是 Polars 的典型写法
result = (
    df
    .filter(pl.col("department") == "工程")
    .with_columns(
        (pl.col("salary") * 1.1).round(0).alias("new_salary")
    )
    .sort("age")
    .select("name", "age", "new_salary")
)
print(result)

3.4 Pandas 对比速查

如果你之前一直用 Pandas，这张对比表能帮你快速找到对应的 Polars 写法：

# Pandas                              Polars
# ------                              ------
# df["col"]                           df.select("col") 或 df["col"]（返回 Series）
# df[df["age"] > 30]                  df.filter(pl.col("age") > 30)
# df["new"] = df["a"] + df["b"]      df.with_columns((pl.col("a") + pl.col("b")).alias("new"))
# df.sort_values("col")               df.sort("col")
# df.groupby("col").agg({"v":"sum"})  df.group_by("col").agg(pl.col("v").sum())
# df.rename(columns={"a":"b"})        df.rename({"a": "b"})
# df.drop(columns=["a"])              df.drop("a")
# df.head(5)                           df.head(5)

3.5 LazyFrame：惰性求值入门

LazyFrame 可以说是 Polars 最强大的特性之一了。它不会立即执行任何操作，而是先构建一个查询计划，等你调用 .collect() 的时候才一次性执行：

import polars as pl

# 即时模式：每一步都立即执行
df = pl.read_csv("sales.csv")
result = df.filter(pl.col("amount") > 100).select("product", "amount")

# 惰性模式：先攒计划，再一口气跑完
result = (
    pl.scan_csv("sales.csv")          # 返回 LazyFrame，数据还没读
    .filter(pl.col("amount") > 100)   # 记下来，不执行
    .select("product", "amount")       # 继续记，还不执行
    .collect()                         # 现在才真正动手！
)

这么做的好处在于：查询优化器能看到整个查询计划，然后只读取需要的列（投影下推），只扫描符合条件的行（谓词下推）。处理大文件时，这个优化带来的性能提升非常可观。

四、表达式系统：Polars 的核心力量

4.1 基础表达式

表达式是 Polars 的灵魂所在。我个人觉得，理解了表达式系统，你就算掌握了 Polars 80% 的能力——真不夸张。

import polars as pl

df = pl.DataFrame({
    "product": ["笔记本", "手机", "平板", "耳机", "手表"],
    "price": [6999, 4999, 3299, 999, 2499],
    "quantity": [120, 350, 200, 800, 150],
    "category": ["电脑", "手机", "电脑", "配件", "配件"],
})

# pl.col() - 引用列
total = df.select(pl.col("price") * pl.col("quantity"))

# pl.lit() - 字面量
df_with_tax = df.with_columns(
    (pl.col("price") * pl.lit(1.13)).alias("price_with_tax")
)

# pl.when().then().otherwise() - 条件表达式（类似 SQL 的 CASE WHEN）
df_level = df.with_columns(
    pl.when(pl.col("price") > 5000)
    .then(pl.lit("高端"))
    .when(pl.col("price") > 2000)
    .then(pl.lit("中端"))
    .otherwise(pl.lit("入门"))
    .alias("price_level")
)
print(df_level)

4.2 表达式链式调用

Polars 的表达式支持非常灵活的链式调用，一个表达式里就能搞定复杂的数据转换：

import polars as pl

df = pl.DataFrame({
    "text": ["  Hello World  ", "  Polars is FAST  ", "  数据分析  "],
    "value": [1.23456, 7.89012, 3.45678],
    "date_str": ["2026-01-15", "2026-02-20", "2026-03-25"],
})

result = df.with_columns(
    # 字符串链式处理
    pl.col("text")
    .str.strip_chars()
    .str.to_lowercase()
    .str.replace_all(" ", "_")
    .alias("cleaned_text"),

    # 数值处理
    pl.col("value")
    .round(2)
    .cast(pl.Utf8)
    .str.concat_horizontal(pl.lit(" 元"))
    .alias("formatted_value"),

    # 日期解析
    pl.col("date_str")
    .str.to_date("%Y-%m-%d")
    .alias("date"),
)
print(result)

4.3 字符串与日期时间表达式

Polars 内置了强大的字符串和日期时间处理能力，分别通过 .str 和 .dt 命名空间来访问。这块设计得挺优雅的，用起来很自然：

import polars as pl
from datetime import datetime

df = pl.DataFrame({
    "name": ["张三丰", "李白", "杜甫", "白居易"],
    "email": ["[email protected]", "[email protected]", "[email protected]", "[email protected]"],
    "created_at": [
        datetime(2026, 1, 15, 8, 30),
        datetime(2026, 1, 20, 14, 15),
        datetime(2026, 2, 1, 9, 0),
        datetime(2026, 2, 10, 16, 45),
    ],
})

result = df.with_columns(
    # 字符串表达式
    pl.col("name").str.len_chars().alias("name_length"),
    pl.col("email").str.split("@").list.get(1).alias("domain"),
    pl.col("name").str.contains("白").alias("has_bai"),

    # 日期时间表达式
    pl.col("created_at").dt.year().alias("year"),
    pl.col("created_at").dt.month().alias("month"),
    pl.col("created_at").dt.weekday().alias("weekday"),
    pl.col("created_at").dt.strftime("%Y年%m月%d日").alias("formatted_date"),
)
print(result)

4.4 关于 map_elements：尽量别用

map_elements（类似 Pandas 的 apply）可以对每个元素跑一个自定义 Python 函数。但说真的，把它当作最后的手段就好——因为一旦用了它，就等于退出了 Rust 执行引擎，逐个元素去调用 Python，性能直线下跌。

import polars as pl

df = pl.DataFrame({"value": [1, 2, 3, 4, 5]})

# ❌ 慢：使用 map_elements
result_slow = df.with_columns(
    pl.col("value").map_elements(lambda x: x ** 2 + 1, return_dtype=pl.Int64).alias("result")
)

# ✅ 快：使用原生表达式
result_fast = df.with_columns(
    (pl.col("value").pow(2) + 1).alias("result")
)

我的经验法则是：如果你发现自己在写 map_elements，先停下来想想能不能用 Polars 原生表达式搞定。99% 的情况下，答案是"能"。剩下的 1%——好吧，那就只能用了（笑）。

五、分组聚合与窗口函数

5.1 基础分组聚合

Polars 的 group_by().agg() 用起来非常痛快，一次聚合里就能算多个指标：

import polars as pl

df = pl.DataFrame({
    "department": ["工程", "市场", "工程", "市场", "工程", "运营", "运营"],
    "employee": ["张三", "李四", "王五", "赵六", "钱七", "孙八", "周九"],
    "salary": [15000, 22000, 18000, 25000, 13000, 16000, 14000],
    "years": [3, 7, 10, 5, 1, 4, 2],
})

result = df.group_by("department").agg(
    pl.col("employee").count().alias("人数"),
    pl.col("salary").mean().round(0).alias("平均薪资"),
    pl.col("salary").max().alias("最高薪资"),
    pl.col("salary").min().alias("最低薪资"),
    pl.col("years").mean().round(1).alias("平均年限"),
    pl.col("employee").sort_by("salary", descending=True).first().alias("最高薪员工"),
)
print(result)

注意看最后一个聚合表达式——在聚合内部先排序再取第一个。这要是在 Pandas 里，你得写个自定义聚合函数，但在 Polars 里一行表达式就搞定了。这种表达力，确实让人用了就回不去。

5.2 窗口函数：.over()

窗口函数是 SQL 世界里最强大的特性之一，而 Polars 通过 .over() 把它完美带到了 DataFrame 的世界。简单说，窗口函数让你在不改变行数的前提下，基于分组去算聚合值：

import polars as pl

df = pl.DataFrame({
    "department": ["工程", "工程", "工程", "市场", "市场"],
    "employee": ["张三", "王五", "钱七", "李四", "赵六"],
    "salary": [15000, 18000, 13000, 22000, 25000],
})

result = df.with_columns(
    # 部门平均薪资（窗口聚合）
    pl.col("salary").mean().over("department").alias("dept_avg_salary"),

    # 部门内薪资排名
    pl.col("salary").rank("dense", descending=True).over("department").alias("dept_rank"),

    # 薪资占部门总薪资的百分比
    (pl.col("salary") / pl.col("salary").sum().over("department") * 100)
    .round(1)
    .alias("dept_salary_pct"),
)
print(result)
# shape: (5, 6)
# ┌────────────┬──────────┬────────┬────────────────┬───────────┬─────────────────┐
# │ department ┆ employee ┆ salary ┆ dept_avg_salary┆ dept_rank ┆ dept_salary_pct │
# │ ---        ┆ ---      ┆ ---    ┆ ---            ┆ ---       ┆ ---             │
# │ str        ┆ str      ┆ i64    ┆ f64            ┆ u32       ┆ f64             │
# ╞════════════╪══════════╪════════╪════════════════╪═══════════╪═════════════════╡
# │ 工程       ┆ 张三     ┆ 15000  ┆ 15333.333333   ┆ 2         ┆ 32.6            │
# │ 工程       ┆ 王五     ┆ 18000  ┆ 15333.333333   ┆ 1         ┆ 39.1            │
# │ 工程       ┆ 钱七     ┆ 13000  ┆ 15333.333333   ┆ 3         ┆ 28.3            │
# │ 市场       ┆ 李四     ┆ 22000  ┆ 23500.0        ┆ 2         ┆ 46.8            │
# │ 市场       ┆ 赵六     ┆ 25000  ┆ 23500.0        ┆ 1         ┆ 53.2            │
# └────────────┴──────────┴────────┴────────────────┴───────────┴─────────────────┘

5.3 滚动计算

处理时间序列的时候，滚动窗口几乎是必备操作。Polars 在这方面支持得也很好：

import polars as pl
from datetime import date

df = pl.DataFrame({
    "date": pl.date_range(date(2026, 1, 1), date(2026, 1, 10), eager=True),
    "sales": [100, 120, 90, 150, 200, 180, 160, 210, 190, 220],
})

result = df.with_columns(
    # 3日移动平均
    pl.col("sales").rolling_mean(window_size=3).alias("ma_3"),
    # 累计销售额
    pl.col("sales").cum_sum().alias("cumulative_sales"),
    # 环比增长率
    ((pl.col("sales") - pl.col("sales").shift(1)) / pl.col("sales").shift(1) * 100)
    .round(1)
    .alias("growth_rate"),
)
print(result)

六、数据连接与合并

6.1 join 操作

Polars 的 join 支持很全面，语法也清晰直观。下面是各种连接类型的演示：

import polars as pl

# 订单表
orders = pl.DataFrame({
    "order_id": [1, 2, 3, 4, 5],
    "customer_id": [101, 102, 103, 101, 104],
    "amount": [500, 300, 800, 200, 600],
})

# 客户表
customers = pl.DataFrame({
    "customer_id": [101, 102, 103, 105],
    "name": ["张三", "李四", "王五", "赵六"],
    "city": ["北京", "上海", "广州", "深圳"],
})

# 内连接：只保留两边都匹配到的记录
inner = orders.join(customers, on="customer_id", how="inner")

# 左连接：保留左表全部记录
left = orders.join(customers, on="customer_id", how="left")

# 全外连接：两边的记录都保留
outer = orders.join(customers, on="customer_id", how="full", coalesce=True)

# 半连接：只留左表中能匹配右表的行（但不带右表的列）
semi = orders.join(customers, on="customer_id", how="semi")

# 反连接：只留左表中在右表找不到匹配的行
anti = orders.join(customers, on="customer_id", how="anti")
print(f"没有客户信息的订单：\n{anti}")

半连接和反连接是我特别喜欢 Polars 的一个点。在 Pandas 里实现这两种操作得绕好大一圈，但在 Polars 里直接 how="semi" 或 how="anti" 就完事了，太省心了。

6.2 数据拼接

import polars as pl

df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})

# 垂直拼接（类似 SQL 的 UNION ALL）
vertical = pl.concat([df1, df2], how="vertical")

# 水平拼接
df3 = pl.DataFrame({"c": [9, 10]})
horizontal = pl.concat([df1, df3], how="horizontal")

# 对角拼接（垂直拼接但允许列不完全一致，缺的列自动填 null）
df4 = pl.DataFrame({"a": [7, 8], "c": [11, 12]})
diagonal = pl.concat([df1, df4], how="diagonal")
print(diagonal)

七、惰性求值与查询优化深度剖析

7.1 查询优化器的工作原理

Polars 的查询优化器是性能优势的核心来源——这一节我认为是理解 Polars 最关键的部分。当你用 LazyFrame 的时候，Polars 不会立刻执行任何操作，而是构建一棵查询计划树。在你调用 .collect() 时，优化器会对这棵树跑多轮优化。

主要的优化策略有这些：

谓词下推（Predicate Pushdown）：把过滤条件尽量往数据源那边推。举个例子，你代码里写的是先 join 再 filter，优化器会自动在 join 之前就把不需要的行过滤掉
投影下推（Projection Pushdown）：只读查询里真正用到的列。CSV 有100列但你只用了3列？Polars 就只解析那3列
公共子表达式消除：同一个表达式出现多次，只算一遍
类型强制优化：自动选择最高效的数据类型

7.2 用 .explain() 查看查询计划

想知道优化器到底做了什么？用 .explain() 就能看到优化后的执行计划：

import polars as pl

# 构建一个惰性查询
query = (
    pl.scan_csv("large_sales.csv")
    .filter(pl.col("region") == "华东")
    .filter(pl.col("amount") > 1000)
    .group_by("product")
    .agg(
        pl.col("amount").sum().alias("total_amount"),
        pl.col("amount").count().alias("order_count"),
    )
    .sort("total_amount", descending=True)
    .head(10)
)

# 查看优化后的查询计划
print(query.explain())
# 你会看到两个 filter 条件被合并了，
# 并且在 CSV 扫描阶段就会应用过滤条件（谓词下推）
# 同时只读取需要的列（投影下推）

7.3 流式处理：数据大到放不进内存怎么办

当数据集大到内存装不下的时候，Polars 的流式模式（Streaming）就该登场了。它会把数据分批处理，内存使用量保持在可控范围：

import polars as pl

# 流式读取并处理一个超大 CSV 文件
result = (
    pl.scan_csv("huge_dataset.csv")  # 100GB 文件也能搞
    .filter(pl.col("status") == "active")
    .group_by("category")
    .agg(
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("id").count().alias("record_count"),
    )
    .sort("total_revenue", descending=True)
    .collect(streaming=True)  # 启用流式执行
)

# 流式输出到文件（结果完全不用加载到内存）
(
    pl.scan_csv("huge_dataset.csv")
    .filter(pl.col("year") == 2026)
    .with_columns(
        (pl.col("price") * pl.col("quantity")).alias("total")
    )
    .sink_parquet("output.parquet")  # 流式写入 Parquet
)

# 也能流式写 CSV
(
    pl.scan_csv("huge_dataset.csv")
    .filter(pl.col("region") == "华北")
    .sink_csv("filtered_output.csv")  # 流式写入 CSV
)

7.4 完整的惰性处理示例

下面这个例子展示了如何纯用惰性模式来处理数据，让优化器发挥最大威力：

import polars as pl

# 假设我们有一个大型销售数据文件
# 全程惰性模式，优化器自动搞定剩下的事

result = (
    pl.scan_csv("sales_2026.csv")
    # 投影下推：只读需要的列
    .select("date", "product", "region", "amount", "quantity")
    # 谓词下推：在 IO 阶段就过滤
    .filter(
        (pl.col("date") >= "2026-01-01") &
        (pl.col("date") < "2026-02-01") &
        (pl.col("amount") > 0)
    )
    # 添加计算列
    .with_columns(
        (pl.col("amount") * pl.col("quantity")).alias("revenue"),
        pl.col("date").str.to_date("%Y-%m-%d").dt.weekday().alias("weekday"),
    )
    # 分组聚合
    .group_by("region", "weekday")
    .agg(
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("revenue").mean().round(2).alias("avg_revenue"),
        pl.len().alias("order_count"),
    )
    .sort("region", "weekday")
    .collect()  # 一次性优化并执行
)

print(result)

八、性能对比：Polars vs Pandas 实测

8.1 典型场景性能对比

说了这么多"Polars 快"，到底快多少呢？以下是在一台 Apple M2 Pro（10核）、32GB 内存的机器上，跑1000万行、20列的数据集测出来的结果：

CSV 读取：Polars 约 1.2 秒，Pandas 约 6.5 秒——快约5倍，内存使用减少约87%
排序：Polars 约 0.3 秒，Pandas 约 3.3 秒——快约11倍
分组聚合：Polars 约 0.15 秒，Pandas 约 1.2 秒——快约8倍
连接操作：Polars 约 0.5 秒，Pandas 约 2.8 秒——快约5.6倍
过滤：Polars 约 0.05 秒，Pandas 约 0.3 秒——快约6倍

当然，这些数字不是固定的——具体的加速比跟数据特征、操作类型、硬件配置都有关系。但总体趋势是很一致的：中大规模数据集上，Polars 通常快 3-10 倍，内存占用低 50-90%。这不是小数目。

8.2 基准测试代码

想自己跑一遍看看？这是可以直接复制运行的测试代码：

import polars as pl
import pandas as pd
import numpy as np
import time

# 生成测试数据
n_rows = 10_000_000
np.random.seed(42)

data = {
    "id": np.arange(n_rows),
    "group": np.random.choice(["A", "B", "C", "D", "E"], n_rows),
    "value1": np.random.randn(n_rows),
    "value2": np.random.uniform(0, 1000, n_rows),
}

# Pandas 基准
pdf = pd.DataFrame(data)

start = time.perf_counter()
pdf_result = pdf.groupby("group").agg({"value1": "mean", "value2": "sum"})
pandas_time = time.perf_counter() - start
print(f"Pandas groupby: {pandas_time:.3f}s")

# Polars 基准
plf = pl.DataFrame(data)

start = time.perf_counter()
plf_result = plf.group_by("group").agg(
    pl.col("value1").mean(),
    pl.col("value2").sum(),
)
polars_time = time.perf_counter() - start
print(f"Polars groupby: {polars_time:.3f}s")

print(f"加速比: {pandas_time / polars_time:.1f}x")

8.3 什么时候 Pandas 反而更合适

公平起见，Polars 也不是在所有场景下都碾压 Pandas。以下几种情况，Pandas 可能是更好的选择：

小数据集（几千到几万行）：速度差异基本可以忽略，而 Pandas 的 API 更成熟，功能也更全面
机器学习管道：scikit-learn、XGBoost 等库的输入主要还是 Pandas DataFrame 或 NumPy 数组
交互式数据探索：在 Jupyter Notebook 里，Pandas 的显示效果、索引切片等功能更加完善
维护老项目：如果你接手的是一个全 Pandas 的代码库，没必要为了性能做全面重写（除非真的很慢）
第三方生态：seaborn、statsmodels 这些库的 API 还是直接吃 Pandas 对象的

九、与现有生态的互操作

9.1 Pandas 互转

Polars 和 Pandas 之间的转换相当方便，基本上就是一两行代码的事：

import polars as pl
import pandas as pd

# Polars → Pandas
pl_df = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
pd_df = pl_df.to_pandas()
print(type(pd_df))  # <class 'pandas.core.frame.DataFrame'>

# Pandas → Polars
pd_df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
pl_df = pl.from_pandas(pd_df)
print(type(pl_df))  # <class 'polars.dataframe.frame.DataFrame'>

# 零拷贝转换（通过 Arrow，不复制数据）
arrow_table = pl_df.to_arrow()
pl_df_back = pl.from_arrow(arrow_table)

9.2 与 scikit-learn 集成

目前 scikit-learn 还没原生支持 Polars DataFrame，不过在需要的地方做个转换就行了。我自己的做法是：数据预处理全用 Polars（快），到模型训练那一步再转成 NumPy：

import polars as pl
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 用 Polars 搞定所有数据预处理
df = pl.DataFrame({
    "feature1": np.random.randn(1000),
    "feature2": np.random.randn(1000),
    "feature3": np.random.randn(1000),
    "label": np.random.choice([0, 1], 1000),
})

# 数据预处理在 Polars 里搞定（速度快）
df_processed = df.with_columns(
    (pl.col("feature1") - pl.col("feature1").mean()).alias("feature1_centered"),
    pl.col("feature2").abs().alias("feature2_abs"),
    (pl.col("feature1") * pl.col("feature3")).alias("interaction"),
)

# 到模型训练这一步才转成 NumPy
feature_cols = ["feature1_centered", "feature2_abs", "interaction", "feature3"]
X = df_processed.select(feature_cols).to_numpy()
y = df_processed["label"].to_numpy()

# 后面就是标准的 scikit-learn 流程了
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.3f}")

9.3 与可视化库集成

import polars as pl
import matplotlib.pyplot as plt

df = pl.DataFrame({
    "month": ["1月", "2月", "3月", "4月", "5月", "6月"],
    "revenue": [120, 150, 180, 165, 200, 220],
    "cost": [80, 95, 110, 100, 120, 130],
})

# 方法一：转为 Pandas 后绘图
pd_df = df.to_pandas()
pd_df.plot(x="month", y=["revenue", "cost"], kind="bar")
plt.title("月度收入与成本")
plt.ylabel("金额（万元）")
plt.tight_layout()
plt.savefig("revenue_cost.png")

# 方法二：直接提取列数据绘图
months = df["month"].to_list()
revenue = df["revenue"].to_list()

plt.figure(figsize=(10, 6))
plt.plot(months, revenue, marker="o", linewidth=2)
plt.title("月度收入趋势")
plt.ylabel("金额（万元）")
plt.grid(True, alpha=0.3)
plt.savefig("revenue_trend.png")

十、实战案例：构建完整的数据处理管道

好，到了最实际的部分。让我们用一个完整的 ETL 管道来综合运用前面学到的所有东西。假设我们在一家电商公司，需要处理多个城市的销售数据，生成月度分析报告。

import polars as pl
from datetime import date, datetime

# ============================================================
# 步骤一：读取多个数据源（使用惰性模式）
# ============================================================

# 扫描多个城市的销售数据（实际项目中这些是真实文件）
# 这里用内存数据演示，实际工作中替换为 pl.scan_csv()
orders = pl.LazyFrame({
    "order_id": range(1, 10001),
    "customer_id": [f"C{i % 500 + 1:04d}" for i in range(10000)],
    "product_id": [f"P{i % 50 + 1:03d}" for i in range(10000)],
    "quantity": [((i * 7 + 3) % 10) + 1 for i in range(10000)],
    "unit_price": [round(50 + (i * 13 % 500), 2) for i in range(10000)],
    "order_date": [
        date(2026, (i % 12) + 1, (i % 28) + 1) for i in range(10000)
    ],
    "city": [
        ["北京", "上海", "广州", "深圳", "杭州"][i % 5] for i in range(10000)
    ],
})

products = pl.LazyFrame({
    "product_id": [f"P{i:03d}" for i in range(1, 51)],
    "product_name": [f"商品{i}" for i in range(1, 51)],
    "category": [
        ["电子产品", "服装", "食品", "家居", "图书"][i % 5] for i in range(50)
    ],
})

# ============================================================
# 步骤二：数据清洗与转换
# ============================================================

cleaned_orders = (
    orders
    # 过滤掉无效数据
    .filter(
        (pl.col("quantity") > 0) &
        (pl.col("unit_price") > 0)
    )
    # 计算订单金额
    .with_columns(
        (pl.col("quantity") * pl.col("unit_price")).round(2).alias("total_amount"),
        pl.col("order_date").cast(pl.Date).dt.month().alias("month"),
        pl.col("order_date").cast(pl.Date).dt.quarter().alias("quarter"),
    )
    # 订单金额分级
    .with_columns(
        pl.when(pl.col("total_amount") > 2000)
        .then(pl.lit("大额订单"))
        .when(pl.col("total_amount") > 500)
        .then(pl.lit("中额订单"))
        .otherwise(pl.lit("小额订单"))
        .alias("order_level")
    )
)

# ============================================================
# 步骤三：关联产品信息
# ============================================================

enriched = cleaned_orders.join(products, on="product_id", how="left")

# ============================================================
# 步骤四：多维度聚合分析
# ============================================================

# 城市月度汇总
city_monthly = (
    enriched
    .group_by("city", "month")
    .agg(
        pl.col("total_amount").sum().round(2).alias("总销售额"),
        pl.col("total_amount").mean().round(2).alias("客单价"),
        pl.len().alias("订单数"),
        pl.col("customer_id").n_unique().alias("客户数"),
        pl.col("quantity").sum().alias("总销量"),
    )
    .with_columns(
        (pl.col("总销售额") / pl.col("客户数")).round(2).alias("人均消费")
    )
    .sort("city", "month")
)

# 品类分析
category_analysis = (
    enriched
    .group_by("category")
    .agg(
        pl.col("total_amount").sum().round(2).alias("总销售额"),
        pl.len().alias("订单数"),
        pl.col("total_amount").mean().round(2).alias("平均订单金额"),
        pl.col("product_id").n_unique().alias("商品种类数"),
    )
    .sort("总销售额", descending=True)
)

# ============================================================
# 步骤五：执行并输出结果
# ============================================================

# 一次性收集所有结果（优化器会统一优化整个查询计划）
city_monthly_result = city_monthly.collect()
category_result = category_analysis.collect()

print("=== 城市月度销售汇总（前10行）===")
print(city_monthly_result.head(10))
print()
print("=== 品类分析 ===")
print(category_result)

# 实际项目中，你可能会输出到 Parquet 文件
# city_monthly_result.write_parquet("output/city_monthly.parquet")
# category_result.write_parquet("output/category_analysis.parquet")

这个案例基本涵盖了 Polars 在实际项目中的典型用法：全程惰性模式、表达式驱动的数据转换、链式调用让代码可读性很高、最后一次性 collect 让优化器充分发挥。如果你觉得代码写起来很像在写 SQL，那你的感觉是对的——这正是 Polars 设计的初衷。

总结与展望

Polars vs Pandas：到底怎么选

聊了这么多，最后给一个比较清晰的选择建议吧：

选 Polars：数据量百万行以上、需要高性能 ETL 管道、对内存比较敏感、新项目没有历史包袱
选 Pandas：小规模数据探索、需要跟 ML 生态深度整合、维护已有代码库、用到 Pandas 独有功能（比如 MultiIndex）
两者配合用：数据清洗和转换交给 Polars（快），到模型训练和可视化的时候再转成 Pandas 或 NumPy

我自己现在的工作流就是第三种——老实说，体验相当不错。

Polars 的发展方向

Polars 的发展势头确实很猛。以下几个方向值得关注：

Polars Cloud：官方在做的云端分布式执行引擎，目标是让 Polars 能处理 PB 级数据
GPU 加速：通过 RAPIDS cuDF 后端，部分操作已经能跑在 GPU 上了
生态融合：越来越多的库开始原生支持 Polars，像 Great Expectations、Plotly 等
SQL 接口：pl.sql() 让你可以直接用 SQL 语法查询 Polars DataFrame，对 SQL 党来说迁移成本降了不少

写在最后

作为一个每天跟数据打交道的人，我的建议很简单：Pandas 和 Polars 都学，看场景选工具。Pandas 3.0 的升级让它焕发了新活力（可以参考我们之前写的 Pandas 3.0 完全升级指南），而 Polars 代表着 DataFrame 库的未来方向——更快、更省内存、更聪明的查询优化。

把这两个工具都收进工具箱里，不管碰到什么样的数据处理需求，你都能游刃有余。