ML Model Serving in Python: BentoML, Ray Serve, FastAPI, and Triton Compared (2026)

BentoML, Ray Serve, FastAPI, and Triton compared for production ML model serving in Python: latency overhead, GPU batching, autoscaling, and cost per prediction with working code examples.

ML Model Serving in Python (2026)

Updated: May 28, 2026

For most Python ML teams in 2026, BentoML is the best general-purpose model serving framework, Ray Serve wins for traffic-shaped autoscaling and multi-model pipelines, NVIDIA Triton wins for raw GPU throughput, and a plain FastAPI service is still the right call for low-QPS scikit-learn endpoints where dependency surface matters more than batching. Honestly, I've shipped all four to production, been paged in the middle of the night on three of them, and the choice almost always comes down to latency budget, hardware, and how much MLOps glue you want to own.

  • BentoML 1.3 is the most balanced choice — model packaging, runners, Bento images, and adaptive batching out of the box, with sub-10ms framework overhead on CPU.
  • Ray Serve 2.40 shines for compound AI systems (multiple models per request), traffic-driven autoscaling, and fractional-GPU placement, but adds a Ray cluster as operational burden.
  • NVIDIA Triton delivers the highest GPU throughput via dynamic batching and concurrent model execution, and is the only option that natively serves TensorRT engines.
  • FastAPI + Uvicorn is fine for tabular scikit-learn or XGBoost models at <200 QPS; beyond that you reinvent batching, model warmup, and metrics yourself.
  • TorchServe is in maintenance mode, and PyTorch's team now points users at vLLM (for LLMs) and Triton (for everything else).
  • Cost per prediction depends more on batching and GPU utilization than framework choice; pick the framework that exposes those knobs cleanly.

Quick comparison table

Before the deep dive, here's how the four frameworks stack up on the dimensions that have actually mattered in my last six on-call rotations. Numbers are approximate and based on a tabular XGBoost model (CPU) and a 350M-parameter transformer (GPU) on AWS g5.xlarge and m6i.large instances, averaged across three workloads. Treat them as order-of-magnitude guides for capacity planning, not hard benchmarks; your mileage will vary with model shape, batch profile, and network topology.

DimensionBentoML 1.3Ray Serve 2.40FastAPI 0.115Triton 24.10
Best forGeneral ML servingMulti-model pipelinesLow-QPS endpointsGPU throughput
Adaptive batchingBuilt-inBuilt-inDIYBuilt-in (dynamic)
GPU supportYes, per-runnerYes, fractionalManualYes, first-class
Cold start (warm container)~2.5s~6s (cluster init)~1s~3.5s
p50 CPU latency overhead~8ms~12ms~4ms~6ms
Multi-model orchestrationRunnersDeployments + DAGManualEnsemble / BLS
Native model formats15+ frameworksAny PythonAny PythonONNX, TRT, TF, Torch
Operational complexityLowMedium-HighLowestMedium
Kubernetes storyYatai / HelmKubeRayAnyTriton Operator

What is model serving and why you need a framework

Model serving is the production-runtime layer that takes a trained model artifact and exposes it as a network endpoint (usually HTTP/JSON, gRPC, or both) with the ergonomics of a regular microservice. A serving framework handles the parts that nobody enjoys writing twice: request parsing and validation, batching, GPU memory management, model warmup, graceful shutdown, health checks, Prometheus metrics, and version routing.

You can roll all of that yourself in Flask or FastAPI. I've seen teams do it, ship it, and regret it about eight months later when the model count crosses ten and the on-call ticket volume crosses zero. The break-even point where a dedicated framework pays for itself is roughly the moment you need adaptive batching, fractional-GPU scheduling, or A/B traffic splitting (anything beyond a single model on a single replica).

So, the four contenders covered here all solve those problems, but they target different parts of the cost surface. BentoML optimizes for developer ergonomics and model packaging. Ray Serve optimizes for elastic compound systems. FastAPI optimizes for nothing in particular, and that's the point: it's a generic web framework. Triton optimizes for GPU saturation. If you understand which axis your workload is bottlenecked on, the choice becomes mechanical, and the operational pain of getting it wrong is real. Switching frameworks mid-flight has cost teams I work with two-to-four engineer-months each time.

BentoML 1.3: model packaging done right

BentoML is the framework I reach for first when a team asks me to "make this notebook a service." Version 1.3, released in early 2026, doubles down on the Bento image concept: a reproducible, OCI-compatible container that bundles model weights, dependencies, runtime config, and the inference code into a single artifact you can promote through environments. The BentoML documentation covers the full API, but the 60-second mental model is: @bentoml.service decorates a class, @bentoml.api decorates its methods, and bentoml build turns the whole thing into a deployable image.

import bentoml
import numpy as np
from bentoml.io import JSON

@bentoml.service(
    resources={"cpu": "2", "memory": "2Gi"},
    traffic={"timeout": 30, "max_concurrency": 64},
)
class FraudDetector:
    model_ref = bentoml.models.get("xgb_fraud:latest")

    def __init__(self) -> None:
        self.model = bentoml.xgboost.load_model(self.model_ref)

    @bentoml.api(
        batchable=True,
        batch_dim=0,
        max_batch_size=128,
        max_latency_ms=20,
    )
    def score(self, features: np.ndarray) -> np.ndarray:
        # features shape: (batch, n_features) once batched by the runner
        return self.model.predict_proba(features)[:, 1]

That batchable=True plus max_latency_ms=20 pair is where BentoML earns its keep. The runner accumulates concurrent inbound requests for up to 20ms (or until the batch hits 128), runs a single XGBoost call, and demultiplexes the responses. On a tabular fraud model at 800 QPS I've measured a 3.8x throughput improvement vs single-request scoring, with p99 latency staying under 40ms.

BentoML's weakest area is multi-cluster orchestration. The open-source Yatai control plane exists but lags BentoCloud's commercial features. If you need GitOps-style deployment across regions, factor that in early. A few teams I work with ended up running ArgoCD against the raw Bento OCI artifacts rather than adopting Yatai, which works fine but means you're now responsible for the rollout primitives BentoCloud would have given you.

Ray Serve 2.40: compound AI and autoscaling

Ray Serve is the right answer when "the model" is actually a pipeline of three or four models (say, a retriever, a reranker, and a generator) that need to coordinate but scale independently. Each model becomes its own Deployment, with its own replica count, hardware requirements, and autoscaling policy. You wire them together with regular Python function calls; Ray handles the network plumbing, the actor placement, and the failure semantics.

from ray import serve
from starlette.requests import Request

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 8,
        "target_ongoing_requests": 5,
    },
    ray_actor_options={"num_gpus": 0.25},
)
class Embedder:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cuda")

    async def __call__(self, texts: list[str]) -> list[list[float]]:
        return self.model.encode(texts, batch_size=32).tolist()

@serve.deployment
class Reranker:
    def __init__(self, embedder):
        self.embedder = embedder

    async def __call__(self, query: str, docs: list[str]) -> list[str]:
        q_vec, *d_vecs = await self.embedder.remote([query] + docs)
        scored = sorted(zip(docs, d_vecs), key=lambda x: -dot(q_vec, x[1]))
        return [d for d, _ in scored[:5]]

embedder = Embedder.bind()
app = Reranker.bind(embedder)
serve.run(app, route_prefix="/rerank")

Two features make Ray Serve genuinely differentiated. First, num_gpus: 0.25 gives you fractional GPU placement, which lets four lightweight models share a single A10G. For embedding models that don't saturate a full GPU, this cuts cost per prediction by 3-4x. Second, the autoscaling policy keys off target_ongoing_requests, which adapts to actual request shape rather than CPU utilization (the wrong signal for I/O-bound serving). The Ray Serve documentation goes deeper on placement groups and DAG composition.

Is FastAPI good for ML inference?

FastAPI is good for ML inference up to about 200 QPS per replica, for models that respond in under 50ms on CPU, where you only need a single model per service. Past that point you're rebuilding BentoML by hand, badly. I bring this up because every quarter someone shows me a FastAPI service that's "almost done" and is missing batching, warmup, graceful shutdown, and any kind of resource governance. That's not "almost done," that's "almost not started."

That said, FastAPI is legitimately the right pick for a scikit-learn classifier behind an internal dashboard, or a feature transformer that needs to live next to an existing FastAPI app. The minimal version looks like this:

from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

class ScoreRequest(BaseModel):
    features: list[float]

class ScoreResponse(BaseModel):
    probability: float
    model_version: str

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.model = joblib.load("model.joblib")
    app.state.version = "v3.1.0"
    # Warmup: prevents first-request latency spike from lazy imports
    _ = app.state.model.predict_proba(np.zeros((1, 42)))
    yield

app = FastAPI(lifespan=lifespan)

@app.post("/score", response_model=ScoreResponse)
async def score(req: ScoreRequest) -> ScoreResponse:
    x = np.asarray(req.features, dtype=np.float32).reshape(1, -1)
    p = float(app.state.model.predict_proba(x)[0, 1])
    return ScoreResponse(probability=p, model_version=app.state.version)

Two non-obvious details: the lifespan context manager loads the model once at process startup (not per-request, which is the most common bug I see), and the warmup call avoids the cold-cache latency spike that scikit-learn's first prediction always pays. Combine with gunicorn -k uvicorn.workers.UvicornWorker -w 4 in production. For everything beyond that (batching, multi-model, GPU), use BentoML. The FastAPI lifespan docs have the full pattern.

NVIDIA Triton Inference Server

Triton is what you use when the GPU bill is the line item your CFO highlights. It runs as a single C++ server process that loads multiple models (TensorRT, ONNX, TorchScript, TensorFlow, Python) into one address space, with dynamic batching, concurrent model execution across CUDA streams, and instance groups that pin replicas to specific GPUs. The result, on the same hardware, is roughly 2-3x the throughput of any Python-native server for transformer inference. The Triton GitHub repo has the full feature matrix.

The trade-off is configuration. Each model needs a directory with a config.pbtxt manifest:

# models/sentiment/config.pbtxt
name: "sentiment"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  { name: "input_ids" data_type: TYPE_INT64 dims: [ -1 ] },
  { name: "attention_mask" data_type: TYPE_INT64 dims: [ -1 ] }
]
output [
  { name: "logits" data_type: TYPE_FP32 dims: [ 2 ] }
]

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 5000
}

instance_group [
  { count: 2 kind: KIND_GPU gpus: [ 0 ] }
]

Two replicas on GPU 0, dynamic batching that waits up to 5ms for batches of 8/16/32, ONNX backend. From Python, you call it over gRPC with the tritonclient library. For LLM-shaped workloads, Triton now ships a TensorRT-LLM backend that handles in-flight batching and paged attention, though for pure LLM serving I'd still default to vLLM. For mixed CV plus tabular plus small-transformer fleets, Triton is unmatched.

If your team already invests heavily in hyperparameter tuning and ensembling (see our guide on XGBoost, LightGBM, and CatBoost compared), Triton's Python backend is the cleanest path to running those tabular models alongside transformer models without operating two serving stacks.

How do you reduce ML inference latency?

Latency reductions in serving come from four levers, ranked by typical impact: model-level optimization (quantization, distillation, ONNX/TensorRT export), batching policy, request-path optimization (pre and post-processing on the right hardware), and finally framework tuning. In my experience teams jump straight to step four ("let's switch from FastAPI to Triton") and miss the larger wins.

Concretely: a FP32 PyTorch BERT-base model at 80ms p50 typically drops to 22ms after ONNX export with graph optimization, and to 9ms after INT8 quantization via TensorRT. That's a 9x improvement before you touch the serving framework. The framework choice then determines how much of that you can sustain under concurrent load.

For tabular models, the latency floor is usually feature-fetch, not scoring. A 5ms XGBoost prediction behind a 40ms feature store lookup means the model server is irrelevant; you need request coalescing or a feature cache. Profile end-to-end with py-spy record or perf before tuning the serving layer.

Cost per prediction in 2026

Cost per prediction is what your finance team will actually ask about, and it's almost entirely a function of GPU utilization for transformer workloads and replica count for CPU workloads. The framework matters only insofar as it lets you push utilization up. The numbers below are normalized per million predictions and assume on-demand AWS pricing as of Q1 2026; reserved instances roughly halve them.

  • Tabular XGBoost on m6i.large CPU, FastAPI single-request: $0.18/M predictions at 200 QPS
  • Same model, BentoML with adaptive batching: $0.06/M predictions at 800 QPS (same replica count)
  • BERT-base on g5.xlarge GPU, naive FastAPI: $14/M predictions (GPU pinned but underutilized)
  • BERT-base, Triton with dynamic batching and TensorRT INT8: $1.20/M predictions
  • BERT-base, Ray Serve with fractional GPU (4 models sharing one A10G): $4.80/M predictions, but with much better operational flexibility

The pattern is consistent: batching plus quantization gives an order-of-magnitude cost reduction; framework choice gives 2-3x on top. Get the batching right first, then pick the framework that doesn't fight you on instrumentation, rollout, and version routing. Those second-order properties are what determine whether you stay on it for two years or rip it out in six months. I learned this the expensive way on a project where we picked the "cool" framework first and spent a quarter undoing it.

Choosing the right framework

Here's the decision tree I actually use when a team asks. If you're serving a single model on CPU at low QPS and the team has no MLOps person, start with FastAPI plus a lifespan model load. If you have multiple Python ML frameworks (scikit-learn, XGBoost, PyTorch, transformers) and want one consistent serving story with batching, choose BentoML. That's the 70% case. If you're building a RAG system, an agent, or any pipeline with multiple coordinated models and want the autoscaling to react to actual traffic shape, choose Ray Serve. If GPU cost dominates your bill and you can absorb the configuration overhead, choose Triton.

Don't pick on framework fashion. Pick on the constraint that's actually binding: dev velocity (BentoML), pipeline complexity (Ray Serve), GPU economics (Triton), or "we just need a Python endpoint" (FastAPI). And whichever you pick, instrument it from day one — model server p99 latency, batch size distribution, and prediction throughput are the three metrics that have saved me on more incidents than I can count. Pair that telemetry with the kind of hyperparameter discipline covered in our Optuna and Bayesian optimization guide, plus regular reading on uncertainty quantification via conformal prediction with MAPIE, so the model you ship is actually worth serving.

Frequently Asked Questions

What is the best framework for serving ML models in production in 2026?

For most teams, BentoML 1.3 is the best general-purpose choice. It gives you model packaging, adaptive batching, and a clean container story without locking you into a particular orchestrator. Choose Ray Serve for multi-model pipelines, Triton for GPU-bound workloads, and FastAPI for simple low-QPS endpoints.

Can FastAPI handle GPU inference at scale?

FastAPI can run GPU inference, but it doesn't help you saturate the GPU. Without dynamic batching, GPU utilization stays low and cost per prediction is 5 to 10x higher than a batching-aware server like Triton or BentoML. For production GPU workloads above 100 QPS, FastAPI is the wrong layer.

Is TorchServe still maintained in 2026?

TorchServe is in maintenance mode. The PyTorch team now recommends vLLM for LLM serving and Triton (with the PyTorch backend) for general PyTorch model serving. Existing TorchServe deployments still work, but new projects should not start there.

BentoML vs Ray Serve: which is better?

BentoML is better for single-model services where packaging and reproducibility matter most. Ray Serve is better for compound systems where multiple models must coordinate and scale independently. If you're serving one model, BentoML has lower operational overhead; if you're orchestrating a pipeline, Ray Serve's deployment graph is worth the extra complexity.

How do you measure model serving latency correctly?

Measure end-to-end p50, p95, and p99 latency at the load balancer, not inside the server process. Server-side metrics miss queueing, TLS handshake, and network time. Use a synthetic client (k6 or Locust) running outside your cluster and a histogram-based recorder. Averages hide tail latency, which is what users actually feel.

Article changelog (1)
  • — Content revised
Arjun Krishnamurthy
About the Author Arjun Krishnamurthy

ML engineer focused on getting models out of notebooks and into production. Has war stories about every serving framework.