What is the fastest inference provider for Llama 4 70B in 2026?

SambaNova at 580 tokens/sec batch=1 streaming, followed by Cerebras at 450 tok/sec and Groq at 320 tok/sec. Hosted GPU providers (Together, Fireworks) deliver 70–110 tok/sec. Self-hosted H100 SXM ×4 runs around 85 tok/sec on vLLM with FP8.

How much does inference cost per million output tokens in 2026?

Between $0.60 and $10/M output tokens for Llama 4 70B depending on host. DeepInfra and SambaNova are cheapest at $0.60. Groq and Together are around $0.79–$0.88. Self-host on H100 SXM ×4 amortizes to ~$1.95/M at 80% utilization.

Is B200 worth it over H100 for inference in 2026?

Yes for sustained inference workloads. B200 delivers roughly 2× the throughput of H100 SXM at 1.6× the rental price — so dollars-per-million-tokens are 25% cheaper. The break-even tilts more toward B200 the higher your sustained utilization.

What is time-to-first-token (TTFT) and why does it matter?

TTFT is the milliseconds between sending a request and receiving the first output token. For chat UX, TTFT under 300ms feels instant. Groq and Cerebras lead at 120–200ms. Hyperscaler self-hosting on H100 typically lands at 380–650ms. Long context inflates TTFT linearly.

Should I self-host LLM inference in 2026?

Below 50M output tokens/month, hosted APIs almost always win. Above 500M/month, self-hosted on rented or owned GPUs starts to dominate. The operational overhead (driver updates, batching tuning, queue management) costs at least 0.5 FTE of platform engineering.

Why do Groq, Cerebras, and SambaNova post such fast numbers?

They use specialized inference silicon — LPUs (Groq), wafer-scale processors (Cerebras), or RDU chips (SambaNova) — designed for streaming decode at small batch sizes. NVIDIA GPUs (H100, B200) are optimized for training throughput; consumer-style chat UX is a different workload.

Blog

AI Inference Benchmark 2026: H100 vs A100 vs B200 vs Hosted APIs

Compare 22 inference hosts in 2026 — tokens/sec, latency, dollars per million tokens. Groq, Cerebras, SambaNova, Together, Fireworks, self-host on H100/B200.

Updated 2026-05-117 min read· By AITOT Editorial

AI inference performance in 2026 spans a 10× spread on the same model. Llama 4 70B runs at 580 tokens per second on SambaNova versus 38 tokens per second on Together AI's hosted endpoint — identical model weights, completely different hardware underneath. This guide benchmarks 22 inference providers across speed (tokens/sec), latency (TTFT), and cost (dollars per million tokens), and explains when the fast-but-pricey providers are actually worth it. For real-time math across all 22 hosts and your specific token volumes, use our AI Inference Benchmark calculator.

The "Fastest ≠ Cheapest" rule applies sharply here: Groq and Cerebras are nearly always fastest but often not cheapest. SambaNova sometimes manages both. Hyperscaler self-hosting is rarely either. Choosing right depends entirely on what you optimize for.

How fast does Llama 4 70B actually run in 2026?

Output tokens per second at batch=1 streaming decode, sorted fastest first:

Host	Tokens/sec	TTFT	Cost/1M out
SambaNova	580	110ms	$0.60
Cerebras	450	120ms	$0.85
Groq	320	180ms	$0.79
B200 ×4 self-host	165	220ms	$2.10
Fireworks	110	290ms	$0.90
Together	92	320ms	$0.88
Self-host H100 ×4 (vLLM)	85	380ms	$1.95
DeepInfra	70	410ms	$0.60

Three clusters are visible. Specialized silicon (SambaNova, Cerebras, Groq) at 300–580 tok/sec. B200 at ~165 tok/sec — twice an H100 cluster. NVIDIA GPUs at scale (Together, Fireworks, DeepInfra, self-host) at 70–110 tok/sec.

The specialized silicon vendors are a recent shift. As recently as 2024 they were research curiosities; in 2026 they're production-grade enough that companies use them in user-facing chat products.

Which inference provider should you use in 2026?

Decision tree by priority:

Lowest latency for chat UX (sub-200ms TTFT, 300+ tok/sec) — Groq, Cerebras, or SambaNova. Pay the premium when user perception of speed matters.
Cheapest at any speed — DeepInfra ($0.60/M output) or self-hosted Llama on rented GPU below $1/M amortized. Use for batch inference, summarization, or offline workloads.
Best balance of speed and cost — SambaNova is the standout in 2026 — fast AND cheapest. Together and Fireworks are reliable middle-of-the-pack alternatives.
Highest-quality model output (Llama 4 405B or DeepSeek V3) — Fireworks or Together. Specialized providers don't host these yet.
Predictable enterprise pricing — Together's reserved capacity or AWS Bedrock. Higher base rates but no surprise scaling.
Self-host for control — vLLM on H100 SXM or B200 cluster. Justified only above 500M tokens/month or when data residency is a hard requirement.

A common 2026 pattern is multi-host routing: use Groq or SambaNova for user-facing chat (where every 100ms matters), and Together or Fireworks for back-end batch jobs (where cost matters more than latency). Tools like OpenRouter and Helicone make this practical.

What is the formula for dollars-per-million-tokens?

The headline metric:

$/M_output = host_pricing_per_1M_output_tokens

effective_$/M = $/M_output + (input_tokens/output_tokens) × $/M_input

monthly_cost = effective_$/M × output_tokens_per_month / 1,000,000

The "effective dollars per million" calculation matters because input-token cost is often half or less of output cost. For chat workloads (typical input/output 70/30), effective rate is dominated by output. For RAG workloads (typical 95/5), effective rate is dominated by input — and providers like Groq with input at $0.59/M look much better than DeepInfra at $0.39/M output but $0.59/M input.

A worked example for 1,000 input + 500 output tokens per request, 100k requests/month:

Groq (Llama 4 70B):
  100k × 1000 × $0.59 / 1M = $59 input
  100k × 500  × $0.79 / 1M = $39.5 output
  Monthly: $99

Self-hosted on H100 ×4 ($2.99/h × 4 = $11.96/h):
  Throughput at 85 tok/sec output × 80% utilization = 68 tok/sec sustained
  68 tok/sec × 86,400 sec/day × 30 = 176M tok/month
  Workload output: 100k × 500 = 50M output tok/month — 28% utilization
  GPU cost: $11.96 × 24 × 30 = $8,611/month at 100% on
  Effective at 28% utilization: $8,611 ÷ 4 = $2,153 if can scale down
  Monthly: $2,153 (mostly stranded capacity)

This is why self-hosting at moderate volume is bad. The H100 cluster idles 72% of the time but costs the same. Hosted APIs charge only for what you use.

What hidden costs come with inference?

Five line items that catch most teams off-guard:

TTFT inflation on long context. Sending a 32k-token RAG context adds 1–3 seconds to TTFT on most hosts. Groq and Cerebras handle this better; Together and Fireworks scale worse.
Rate limits. Most hosted APIs cap at 5–20 requests/second per account. Spiky traffic gets throttled. Plan for a queue or upgrade to dedicated capacity.
Cold starts. The first request after 5+ minutes of idle is 3–8× slower. Production apps need keep-alive pings or paid "always-warm" tier.
Speculative decoding overhead. Some providers (Anthropic, OpenAI) charge for speculatively-decoded tokens even when rejected. Adds 5–15% to bill.
Failed requests don't always refund. Half-completed streams from network drops still bill for completed tokens. Build retry logic that doesn't double-bill.

For complete cost forecasting that captures inference plus the surrounding infrastructure, use our Agent Dev Cost Calculator. For inference-only comparison across 22 hosts, use the Inference Benchmark calculator.

When should I run inference on H100 vs B200 vs A100 in 2026?

GPU choice for self-hosted inference:

H100 SXM5 — sweet spot 2026 for most 7B–70B model serving. Mature vLLM/SGLang support, good FP8 inference, ~85 tok/sec on Llama 4 70B batch=1.
B200 — wins for sustained high-volume inference. 2× the throughput of H100 SXM at 1.6× the rental cost = 25% cheaper per million tokens. Worth it if you're running at >50% utilization.
A100 80GB — only worth it for 7B fine-tunes and embedding generation. For 70B+ inference, H100 PCIe at similar price wins on speed.
H100 PCIe — 35% cheaper than SXM5 with 80% the inference throughput. Best ROI for inference workloads that don't need NVLink.
L40S — surprisingly competitive for sub-7B inference and embedding work. Half the VRAM but 70% the throughput.

For pricing by GPU type across 12 cloud providers, see our GPU Pricing Calculator.

How does throughput scale with batch size?

The batch=1 numbers in this guide are streaming-decode (chat UX). Production back-ends can batch requests for 5–20× higher throughput:

Batch size	Llama 4 70B on H100 SXM ×4	Effective $/M output
1 (streaming)	85 tok/sec	$1.95
8	580 tok/sec	$0.29
32	1,800 tok/sec	$0.094
64 (max)	2,400 tok/sec	$0.071

So a back-end batch pipeline can hit 7× the throughput per dollar of a streaming chat pipeline on the same hardware. This is why providers like Together and Fireworks offer separate "batch" endpoints at lower rates — they're batching your requests with others to amortize.

If your application accepts higher latency (>5 seconds is fine), use batch endpoints. Together's batch API saves ~50% versus interactive. Fireworks similar.

What's coming next for inference in 2026?

Three trends to watch:

B200 supply normalization. By Q3 2026 expect B200 prices to drop 30–40% as supply catches up to demand. The premium over H100 will compress.
GB300 cluster availability. The 1KW Blackwell Ultra GPUs are starting to ship in late 2026. Expect inference-per-watt improvements of 2–3× over current B200.
Specialized chip competition. AMD MI400, Trainium 3, Tenstorrent are all positioning for inference market share. Competition will pressure even the niche specialty silicon (Groq, Cerebras) on price.

We refresh inference benchmark data the first of every month with verified numbers from each provider's reported pricing and our own runs. For broader infrastructure planning, the GPU Pricing Calculator covers the hardware side and the Token & Pricing Comparator covers proprietary model pricing.

The "Fastest ≠ Cheapest" rule isn't going away. SambaNova currently breaks it for Llama 4 70B — expect at least one provider to consistently break it on every popular model by end of 2026.