AITOT
Blog

AI Inference Benchmark 2026: H100 vs A100 vs B200 vs Hosted APIs

Compare 22 inference hosts in 2026 — tokens/sec, latency, dollars per million tokens. Groq, Cerebras, SambaNova, Together, Fireworks, self-host on H100/B200.

7 min read· By AITOT Editorial

AI inference performance in 2026 spans a 10× spread on the same model. Llama 4 70B runs at 580 tokens per second on SambaNova versus 38 tokens per second on Together AI's hosted endpoint — identical model weights, completely different hardware underneath. This guide benchmarks 22 inference providers across speed (tokens/sec), latency (TTFT), and cost (dollars per million tokens), and explains when the fast-but-pricey providers are actually worth it. For real-time math across all 22 hosts and your specific token volumes, use our AI Inference Benchmark calculator.

The "Fastest ≠ Cheapest" rule applies sharply here: Groq and Cerebras are nearly always fastest but often not cheapest. SambaNova sometimes manages both. Hyperscaler self-hosting is rarely either. Choosing right depends entirely on what you optimize for.

How fast does Llama 4 70B actually run in 2026?

Output tokens per second at batch=1 streaming decode, sorted fastest first:

HostTokens/secTTFTCost/1M out
SambaNova580110ms$0.60
Cerebras450120ms$0.85
Groq320180ms$0.79
B200 ×4 self-host165220ms$2.10
Fireworks110290ms$0.90
Together92320ms$0.88
Self-host H100 ×4 (vLLM)85380ms$1.95
DeepInfra70410ms$0.60

Three clusters are visible. Specialized silicon (SambaNova, Cerebras, Groq) at 300–580 tok/sec. B200 at ~165 tok/sec — twice an H100 cluster. NVIDIA GPUs at scale (Together, Fireworks, DeepInfra, self-host) at 70–110 tok/sec.

The specialized silicon vendors are a recent shift. As recently as 2024 they were research curiosities; in 2026 they're production-grade enough that companies use them in user-facing chat products.

Which inference provider should you use in 2026?

Decision tree by priority:

  • Lowest latency for chat UX (sub-200ms TTFT, 300+ tok/sec) — Groq, Cerebras, or SambaNova. Pay the premium when user perception of speed matters.
  • Cheapest at any speed — DeepInfra ($0.60/M output) or self-hosted Llama on rented GPU below $1/M amortized. Use for batch inference, summarization, or offline workloads.
  • Best balance of speed and cost — SambaNova is the standout in 2026 — fast AND cheapest. Together and Fireworks are reliable middle-of-the-pack alternatives.
  • Highest-quality model output (Llama 4 405B or DeepSeek V3) — Fireworks or Together. Specialized providers don't host these yet.
  • Predictable enterprise pricing — Together's reserved capacity or AWS Bedrock. Higher base rates but no surprise scaling.
  • Self-host for control — vLLM on H100 SXM or B200 cluster. Justified only above 500M tokens/month or when data residency is a hard requirement.

A common 2026 pattern is multi-host routing: use Groq or SambaNova for user-facing chat (where every 100ms matters), and Together or Fireworks for back-end batch jobs (where cost matters more than latency). Tools like OpenRouter and Helicone make this practical.

What is the formula for dollars-per-million-tokens?

The headline metric:

$/M_output = host_pricing_per_1M_output_tokens

effective_$/M = $/M_output + (input_tokens/output_tokens) × $/M_input

monthly_cost = effective_$/M × output_tokens_per_month / 1,000,000

The "effective dollars per million" calculation matters because input-token cost is often half or less of output cost. For chat workloads (typical input/output 70/30), effective rate is dominated by output. For RAG workloads (typical 95/5), effective rate is dominated by input — and providers like Groq with input at $0.59/M look much better than DeepInfra at $0.39/M output but $0.59/M input.

A worked example for 1,000 input + 500 output tokens per request, 100k requests/month:

Groq (Llama 4 70B):
  100k × 1000 × $0.59 / 1M = $59 input
  100k × 500  × $0.79 / 1M = $39.5 output
  Monthly: $99

Self-hosted on H100 ×4 ($2.99/h × 4 = $11.96/h):
  Throughput at 85 tok/sec output × 80% utilization = 68 tok/sec sustained
  68 tok/sec × 86,400 sec/day × 30 = 176M tok/month
  Workload output: 100k × 500 = 50M output tok/month — 28% utilization
  GPU cost: $11.96 × 24 × 30 = $8,611/month at 100% on
  Effective at 28% utilization: $8,611 ÷ 4 = $2,153 if can scale down
  Monthly: $2,153 (mostly stranded capacity)

This is why self-hosting at moderate volume is bad. The H100 cluster idles 72% of the time but costs the same. Hosted APIs charge only for what you use.

What hidden costs come with inference?

Five line items that catch most teams off-guard:

  • TTFT inflation on long context. Sending a 32k-token RAG context adds 1–3 seconds to TTFT on most hosts. Groq and Cerebras handle this better; Together and Fireworks scale worse.
  • Rate limits. Most hosted APIs cap at 5–20 requests/second per account. Spiky traffic gets throttled. Plan for a queue or upgrade to dedicated capacity.
  • Cold starts. The first request after 5+ minutes of idle is 3–8× slower. Production apps need keep-alive pings or paid "always-warm" tier.
  • Speculative decoding overhead. Some providers (Anthropic, OpenAI) charge for speculatively-decoded tokens even when rejected. Adds 5–15% to bill.
  • Failed requests don't always refund. Half-completed streams from network drops still bill for completed tokens. Build retry logic that doesn't double-bill.

For complete cost forecasting that captures inference plus the surrounding infrastructure, use our Agent Dev Cost Calculator. For inference-only comparison across 22 hosts, use the Inference Benchmark calculator.

When should I run inference on H100 vs B200 vs A100 in 2026?

GPU choice for self-hosted inference:

  • H100 SXM5 — sweet spot 2026 for most 7B–70B model serving. Mature vLLM/SGLang support, good FP8 inference, ~85 tok/sec on Llama 4 70B batch=1.
  • B200 — wins for sustained high-volume inference. 2× the throughput of H100 SXM at 1.6× the rental cost = 25% cheaper per million tokens. Worth it if you're running at >50% utilization.
  • A100 80GB — only worth it for 7B fine-tunes and embedding generation. For 70B+ inference, H100 PCIe at similar price wins on speed.
  • H100 PCIe — 35% cheaper than SXM5 with 80% the inference throughput. Best ROI for inference workloads that don't need NVLink.
  • L40S — surprisingly competitive for sub-7B inference and embedding work. Half the VRAM but 70% the throughput.

For pricing by GPU type across 12 cloud providers, see our GPU Pricing Calculator.

How does throughput scale with batch size?

The batch=1 numbers in this guide are streaming-decode (chat UX). Production back-ends can batch requests for 5–20× higher throughput:

Batch sizeLlama 4 70B on H100 SXM ×4Effective $/M output
1 (streaming)85 tok/sec$1.95
8580 tok/sec$0.29
321,800 tok/sec$0.094
64 (max)2,400 tok/sec$0.071

So a back-end batch pipeline can hit 7× the throughput per dollar of a streaming chat pipeline on the same hardware. This is why providers like Together and Fireworks offer separate "batch" endpoints at lower rates — they're batching your requests with others to amortize.

If your application accepts higher latency (>5 seconds is fine), use batch endpoints. Together's batch API saves ~50% versus interactive. Fireworks similar.

What's coming next for inference in 2026?

Three trends to watch:

  1. B200 supply normalization. By Q3 2026 expect B200 prices to drop 30–40% as supply catches up to demand. The premium over H100 will compress.
  2. GB300 cluster availability. The 1KW Blackwell Ultra GPUs are starting to ship in late 2026. Expect inference-per-watt improvements of 2–3× over current B200.
  3. Specialized chip competition. AMD MI400, Trainium 3, Tenstorrent are all positioning for inference market share. Competition will pressure even the niche specialty silicon (Groq, Cerebras) on price.

We refresh inference benchmark data the first of every month with verified numbers from each provider's reported pricing and our own runs. For broader infrastructure planning, the GPU Pricing Calculator covers the hardware side and the Token & Pricing Comparator covers proprietary model pricing.

The "Fastest ≠ Cheapest" rule isn't going away. SambaNova currently breaks it for Llama 4 70B — expect at least one provider to consistently break it on every popular model by end of 2026.