This list isolates cloud GPUs that deliver at least 200 teraFLOPS of half-precision (FP16) compute. FP16 TFLOPS is one of the most honest single numbers for ranking GPUs by AI throughput, because nearly all modern training and inference runs in 16-bit precision (FP16 or BF16) rather than full FP32. A higher FP16 rate translates fairly directly into faster matrix multiplies, which is what transformer training and inference spend most of their time doing.

A 200 TFLOPS floor is a meaningful dividing line. It sits comfortably above older or entry tier accelerators and lands you in the territory of data-center class cards with dedicated tensor/matrix engines. One important caveat to read carefully when comparing the entries above: vendors quote FP16 numbers in two different ways.

Tensor-core FP16 with FP16 accumulate, often quoted “with sparsity,” produces the largest headline figures.
Dense tensor-core FP16 (no sparsity) is the number that matters for most real training workloads, and it is typically about half the sparse figure.
Vector/CUDA-core FP16, without tensor cores, is far lower and rarely the basis for these comparisons.

When a card is listed at 200+ here, it is almost always on the strength of its tensor-core engine. That is the right thing to filter on, because if your framework is not using tensor cores you are leaving the bulk of the silicon idle regardless of the headline number.

What hardware typically clears the 200 TFLOPS bar

Crossing 200 dense FP16 TFLOPS generally requires a current or recent-generation data-center GPU with mature tensor cores. Cards in this class tend to share a recognizable profile:

High-bandwidth memory rather than commodity GDDR. Expect HBM2e or HBM3-class memory on the server parts, which keeps the tensor engines fed. Compute this high is bandwidth-hungry, and a card that can hit 200 TFLOPS on paper but is starved for memory bandwidth will underperform on real models.
Large VRAM pools, frequently in the tens of gigabytes, which is what makes these cards usable for large-model training and inference rather than just fast on small ones.
Lower-precision support beyond FP16 — BF16 for more stable training dynamics, and on newer generations FP8 and INT8 for inference, which can multiply effective throughput well past the FP16 figure.
Fast interconnect, such as NVLink between GPUs in a node, so you can scale a job across multiple cards without the PCIe bus becoming the bottleneck.

High-end consumer and prosumer cards can also clear 200 TFLOPS on tensor-core FP16, but they typically pair it with GDDR memory, smaller VRAM, and no high-speed inter-GPU link. That makes them excellent for single-GPU work and price-sensitive jobs, but weaker for multi-GPU scaling. The comparison above mixes both kinds, so look past the raw TFLOPS to memory and interconnect before you choose.

Workloads this tier fits well

Fine-tuning and LoRA on mid-to-large language and diffusion models, where 200+ FP16 TFLOPS cuts iteration time substantially.
High-throughput batch inference, where you can keep the GPU saturated and the extra FLOPS turn directly into higher tokens-per-second or images-per-second.
Single-node training of small-to-medium models, and one slice of a larger distributed run.
Mixed AI plus rendering or HPC pipelines that benefit from both the tensor engines and strong general FP performance.

Where 200 TFLOPS is overkill or the wrong filter

Low-volume real-time inference on small models, where latency and per-second billing matter more than peak FLOPS, and a cheaper card sits idle most of the time anyway.
VRAM-bound jobs, where a model simply does not fit. Here you should be filtering on memory capacity first; raw compute will not rescue a model that spills to host memory.
Frontier-scale training, where 200 TFLOPS per card is fine but the real constraints become interconnect bandwidth, multi-node networking, and total cluster size rather than a single GPU’s FP16 rate.

How this tier behaves on the rental market

Cards that clear 200 FP16 TFLOPS sit in the upper-middle to premium band of cloud pricing. They are the workhorses of AI infrastructure, so they are usually well stocked on the major platforms, but the newest and highest-end models in this group are the ones most prone to scarcity and waitlists during demand spikes. A few practical points to weigh against the live figures above:

On-demand versus spot/interruptible: spot pricing on this tier can be dramatically cheaper, which suits checkpointable training and batch inference that tolerates restarts. Latency-sensitive serving usually wants on-demand or reserved capacity.
Per-second versus per-hour billing changes the economics for short, bursty jobs more than the headline rate does.
Effective price-per-TFLOP is the number to optimize, not the absolute hourly rate. A slightly pricier card with much higher utilization on your workload can be cheaper per unit of work.

Because prices move constantly and differ by region and provider, treat the table above as the source of truth for current rates and use the points here to interpret them.

Frequently asked questions

Is 200 FP16 TFLOPS enough to train a large language model?

For full from-scratch training of a very large model, a single 200 TFLOPS GPU is only one piece — you would distribute the job across many of them, and networking and total VRAM become the real limits. For fine-tuning, LoRA, and training small-to-medium models, a card at this tier is a strong, practical choice.

Does the 200 figure include sparsity?

It depends on how each vendor quoted it. Sparse tensor-core figures are roughly double the dense figures. For real-world training that does not exploit structured sparsity, focus on the dense FP16 (or BF16) number, since that reflects the throughput you will actually see.

Should I pick by FP16 TFLOPS or by VRAM?

Filter by VRAM first if your model or batch size is large enough to risk not fitting in memory, because no amount of compute compensates for a model that does not fit. Once capacity is satisfied, use the 200+ TFLOPS floor to rank for speed and cost-efficiency.

Will FP8 or INT8 give me more than the FP16 rating?

On newer generations that support them, yes — FP8 and INT8 tensor throughput is typically well above the FP16 figure and is very effective for inference. If your serving stack supports lower precision, a card’s FP8/INT8 capability may matter more than its FP16 number, so check that column on the cards above.

GB200 Superchip vs B300 vs MI350X — principais escolhas deste guia

GB200 Superchip vs B300 vs MI350X
	GB200 Superchip Blackwell · 384 GB	B300 Blackwell Ultra · 288 GB	MI350X CDNA 4 · 288 GB
Especificações
Fabricante	NVIDIA	NVIDIA	AMD
Arquitetura	Blackwell	Blackwell Ultra	CDNA 4
VRAM	384 GB HBM3e	288 GB HBM3e	288 GB HBM3e
Largura de Banda	16,000 GB/s	8,000 GB/s	8,000 GB/s
FP16 (Tensor)	4,500 TFLOPS	2,250 TFLOPS	1,800 TFLOPS
FP32	150 TFLOPS	75 TFLOPS	72 TFLOPS
TDP	2700 W	1400 W	1000 W
Ano de Lançamento	2024	2025	2025
Segmento	Data center	Data center	Data center
Preços na Nuvem
Mais Barato Sob Demanda	—	—	—
Provedores	0	1	1

Melhores GPUs de Nuvem com mais de 200 TFLOPS FP16 — June 2026

What “200 FP16 TFLOPS” actually filters for