Inference is the serving stage of a model’s life: the weights are already trained, and you are now running forward passes to produce predictions, completions, embeddings or images for real users. That changes the hardware math compared to training. There are no backward passes, no optimizer states, and no gradient buffers to hold in memory, so the per-request memory footprint is dominated by the model weights plus the key-value cache (for transformer/LLM workloads) rather than by training scratch space. The practical consequence is that you can usually serve a model on far less GPU than it took to train it, and the smart rental decision is about fitting the model and its working set into VRAM without paying for compute you will never saturate.

When you read the comparison above, weigh each option against the three things inference is genuinely sensitive to:

VRAM capacity — the model weights have to live in GPU memory. A useful rule of thumb is roughly 2 bytes per parameter at FP16/BF16, so a 7B model wants on the order of 14 GB just for weights, before the KV cache; quantizing to INT8 or FP8 roughly halves that, and 4-bit lower still. The KV cache then grows with context length and the number of concurrent sequences, so long-context or high-concurrency serving needs noticeably more headroom than the weight figure alone suggests.
Memory bandwidth — token generation is memory-bandwidth-bound, not compute-bound. Each generated token reads the full set of weights from memory, so cards with high-bandwidth memory (HBM) deliver more tokens per second than GDDR-based cards of similar raw FLOPS. If latency per token matters to you, bandwidth is the number to chase.
Low-precision support — modern inference leans heavily on INT8, FP8 and 4-bit quantization. Hardware with native tensor-core support for these formats serves more requests per dollar, because quantized weights are smaller (less bandwidth, less VRAM) and execute faster.

Batch vs real-time inference, and how it shapes your pick

Not all inference is the same workload, and the right rental depends on which side you sit on.

Real-time / interactive serving (chatbots, copilots, live APIs) cares about time-to-first-token and inter-token latency. Here you want fast single-stream performance and enough VRAM to keep the model resident permanently so you are not reloading weights. You typically run on-demand instances that stay up, because cold starts and interruptions are visible to end users.
Batch / offline inference (bulk embeddings, document classification, nightly scoring, dataset labeling) cares about throughput per dollar, not latency. Large batch sizes keep the GPU busy and amortize the weight reads across many requests. This is the workload where interruptible or spot-style capacity shines, because a restarted job just resumes — you trade reliability for a materially lower effective rate.

A lot of inference also fits comfortably on a single GPU. Unless the model is too large for one card’s VRAM, multi-GPU and fast interconnect (NVLink and the like) matter less for serving than they do for training. When a model genuinely will not fit, tensor parallelism across GPUs becomes necessary, and at that point interconnect bandwidth re-enters the conversation. For most sub-30B models served at moderate concurrency, though, a single well-chosen accelerator is the cost-effective answer.

Provider features that matter more for inference than for training

Inference is bursty and long-lived in a way training is not, so the surrounding platform features carry real weight:

Billing granularity — per-second or per-minute billing rewards spiky traffic; coarse hourly minimums punish workloads that scale up and down through the day.
Autoscaling and serverless options — if traffic is uneven, the ability to scale replicas (or pay only while a request runs) often beats renting a fixed instance 24/7.
Cold-start behavior — for scale-to-zero setups, how fast a GPU spins up and loads weights directly affects user-facing latency.
Egress fees and storage — image and video inference outputs, plus model artifacts, can make data-transfer and persistent-storage pricing a meaningful share of the bill. Check these in the fine print, not just the headline GPU rate.
Availability and interruptibility — confirm whether the listed capacity is on-demand or spot, and whether your serving stack tolerates eviction.

How to read the comparison above for an inference workload

Start from your model, not from the cheapest row. Work out the VRAM you need (weights at your chosen precision, plus realistic KV-cache headroom for your context length and concurrency), then filter the list to options that clear that bar with margin. Among those, prefer higher memory bandwidth if latency matters and broad low-precision support so you can quantize. Only then sort by price.

Estimate VRAM for your model at the precision you will actually serve (FP16, FP8, INT8 or 4-bit).
Add KV-cache headroom for your longest context and expected concurrency.
Decide real-time vs batch — this tells you whether to value latency/uptime or throughput/spot pricing.
Check billing granularity, autoscaling, egress and storage against your traffic shape.
Compare live rates in the table above, since pricing moves and varies by region and commitment.

The goal is the smallest, cheapest card that holds your model and its working set comfortably while hitting your latency or throughput target — over-provisioning VRAM you never touch is the most common way inference budgets get wasted.

Frequently asked questions

Do I need as much GPU memory for inference as for training?

Usually no. Training has to hold gradients, optimizer states and activation buffers on top of the weights, which inflates memory needs several times over. Inference mainly needs room for the weights plus the key-value cache, so you can often serve a model on a fraction of the VRAM it took to train — and quantizing to INT8, FP8 or 4-bit shrinks the requirement further.

Are spot or interruptible GPUs safe to use for inference?

It depends on the workload. For offline batch inference that can checkpoint and resume, interruptible capacity is an excellent way to cut cost. For live, user-facing serving, eviction means dropped requests and cold-start latency, so on-demand instances are the safer default unless you have built redundancy and fast failover across replicas.

Should I rent a multi-GPU instance to serve a model?

Only if the model will not fit in a single GPU’s VRAM at your chosen precision. Many models under roughly 30B parameters serve well on one card, especially when quantized. When you do need multiple GPUs, fast interconnect such as NVLink becomes important because tensor parallelism moves data between cards on every forward pass.

What makes one GPU faster than another at generating tokens?

Token generation is bound by memory bandwidth, because each new token reads the entire weight set from GPU memory. Cards with high-bandwidth HBM memory generate tokens faster than GDDR-based cards of comparable raw compute. Native support for low-precision formats also helps, since quantized weights are smaller and move through memory more quickly.

MI350X vs MI325X vs B200 — pilihan utama dari panduan ini

MI350X vs MI325X vs B200
	MI350X CDNA 4 · 288 GB	MI325X CDNA 3 · 256 GB	B200 Blackwell · 192 GB
Spesifikasi
Produsen	AMD	AMD	NVIDIA
Arsitektur	CDNA 4	CDNA 3	Blackwell
VRAM	288 GB HBM3e	256 GB HBM3e	192 GB HBM3e
Bandwidth	8,000 GB/s	6,000 GB/s	8,000 GB/s
FP16 (Tensor)	1,800 TFLOPS	1,307 TFLOPS	2,250 TFLOPS
FP32	72 TFLOPS	163.4 TFLOPS	75 TFLOPS
TDP	1000 W	1000 W	1000 W
Tahun Rilis	2025	2024	2024
Segmen	Pusat data	Pusat data	Pusat data
Harga Cloud
Termurah On-Demand	—	$2.00/hr	$1.99/hr
Penyedia	1	2	2

GPU Cloud Terbaik untuk AI Inference — June 2026

What AI inference actually demands from a rented GPU