This list is filtered to cloud GPUs that deliver at least 100 FP16 TFLOPS of half-precision throughput. FP16 (16-bit floating point) is the workhorse precision for most AI training and inference, because it halves memory traffic versus FP32 while keeping enough numeric range for neural networks. A 100 TFLOPS floor is a meaningful dividing line: it filters out older or entry-level accelerators and leaves you with cards that have real, modern tensor cores (or matrix engines) rather than just shading cores doing math.

One important caveat about reading any FP16 number: vendors quote it in two very different ways. The raw vector FP16 rate on the CUDA/stream cores is usually modest, while the tensor-core FP16 rate is several times higher and is what people mean when they say a card does “hundreds of TFLOPS.” Marketing figures sometimes also assume structured sparsity, which can double the headline number. When comparing the entries above, make sure you are comparing dense tensor-core FP16 against dense tensor-core FP16, not a dense figure on one card against a sparse figure on another.

What hardware lands above this threshold

Crossing 100 dense FP16 TFLOPS on tensor cores generally requires a GPU from the last several hardware generations with dedicated matrix hardware. In practice the qualifying pool tends to include:

Data-center training/inference cards with HBM memory (HBM2e or HBM3-class), wide memory bandwidth measured in terabytes per second, and high VRAM capacities that suit large models and big batches.
High-end workstation and prosumer cards built on recent architectures with GDDR6/GDDR6X memory, which often clear 100 FP16 TFLOPS on their tensor cores even though their memory bandwidth and VRAM are lower than HBM parts.
Newer-generation accelerators that additionally support FP8 and aggressive INT8, pushing far past 100 TFLOPS at those lower precisions — useful headroom if your stack can exploit them.

The key insight is that 100 FP16 TFLOPS is a compute bar, not a memory bar. Two cards can both clear it while differing enormously in VRAM, memory bandwidth, and interconnect. That difference usually matters more for your real-world throughput than the raw TFLOPS figure does.

Why compute-bound vs memory-bound changes everything

Whether the FP16 rating helps you depends on which resource your workload is starved for:

Training and fine-tuning large transformers is frequently memory-capacity and bandwidth bound. Here, two cards at the same 100+ TFLOPS can perform very differently because the one with more HBM and higher bandwidth keeps its tensor cores fed instead of stalling.
Dense matrix multiply / GEMM-heavy work (large batch inference, some scientific kernels) is genuinely compute bound, so the FP16 number tracks real speed more closely.
Real-time, low-batch inference often cannot saturate 100 TFLOPS at all — latency, not throughput, dominates, and you may be paying for compute you never use.

Choosing among the qualifying GPUs

Once the list is narrowed to 100+ FP16 TFLOPS cards, the FP16 figure stops being the deciding factor. Weigh these dimensions against the comparison above:

VRAM capacity — decides whether your model, optimizer states, and activations fit at all. A card that clears the TFLOPS bar but OOMs on your model is useless.
Memory bandwidth and type — HBM parts dramatically outperform GDDR parts on bandwidth-bound jobs, which is most LLM work, even at equal TFLOPS.
Interconnect — NVLink or a comparable high-speed fabric matters the moment you scale past one GPU for distributed training; PCIe-only links become the bottleneck for tensor/pipeline parallelism.
Supported precisions — if a card adds FP8/BF16 with strong rates, you can often get well beyond 100 TFLOPS of effective throughput, but only if your framework and kernels support those formats.
Power and thermal class — higher-TFLOPS data-center parts draw more power and are typically thinner-provisioned, which feeds into availability and pricing.

On the rental economics: cards comfortably above 100 FP16 TFLOPS span a wide cost spectrum. The HBM-equipped data-center accelerators sit at the premium end and are the most likely to be scarce or require spot/interruptible queues during demand spikes, while recent GDDR-based workstation cards can clear the same compute bar at materially lower hourly cost — with the tradeoff of less VRAM and bandwidth. Use the live prices in the table above rather than any fixed figure, because rates move with supply, region, and on-demand versus spot availability.

When this threshold is the right filter — and when it isn’t

Filtering at 100 FP16 TFLOPS is a sensible default if you are doing modern AI work and simply want to exclude underpowered hardware. It is the wrong primary filter if your real constraint is VRAM (filter on memory first) or if you run latency-sensitive inference that never approaches that throughput (you may be overpaying). For multi-GPU training, treat 100 TFLOPS as a minimum and let interconnect and per-GPU memory drive the final choice.

Frequently asked questions

Is 100 FP16 TFLOPS enough for training large language models?

It is a reasonable floor for the compute side, but training large models is usually limited by VRAM and memory bandwidth, not raw FP16 throughput. Past this threshold, prioritize cards with large HBM capacity, high bandwidth, and fast interconnect for multi-GPU scaling.

Does the 100 TFLOPS figure include sparsity?

It can, depending on the vendor. Headline FP16 numbers sometimes assume structured sparsity, which roughly doubles the figure, and some quote tensor-core rates while others quote vector rates. Compare dense tensor-core FP16 against dense tensor-core FP16 so the entries above are measured on the same basis.

Will a cheaper GDDR card that clears 100 TFLOPS match an HBM data-center card?

Only on genuinely compute-bound work. For most LLM and large-batch jobs, the HBM card’s far higher bandwidth and larger VRAM let it sustain throughput that the GDDR card cannot, even when both report similar FP16 TFLOPS. The GDDR card wins mainly on price and for smaller models that fit its memory.

Why not just pick the highest FP16 TFLOPS in the list?

Because the top FP16 number rarely translates into proportional real-world speed. If your job is memory-bound or latency-bound, extra TFLOPS go unused while you pay a premium. Match VRAM, bandwidth, and interconnect to your workload first, and treat 100+ TFLOPS as a satisfied minimum.

GB200 Superchip 대 B300 대 MI350X — 이 가이드의 주요 추천

GB200 Superchip vs B300 vs MI350X
	GB200 Superchip 블랙웰 · 384 GB	B300 블랙웰 울트라 · 288 GB	MI350X CDNA 4 · 288 GB
사양
제조사	NVIDIA	NVIDIA	AMD
아키텍처	블랙웰	블랙웰 울트라	CDNA 4
VRAM	384 GB HBM3e	288 GB HBM3e	288 GB HBM3e
대역폭	16,000 GB/s	8,000 GB/s	8,000 GB/s
FP16 (텐서)	4,500 TFLOPS	2,250 TFLOPS	1,800 TFLOPS
FP32	150 TFLOPS	75 TFLOPS	72 TFLOPS
TDP	2700 W	1400 W	1000 W
출시 연도	2024	2025	2025
세그먼트	데이터 센터	데이터 센터	데이터 센터
클라우드 가격
가장 저렴한 온디맨드	—	—	—
공급업체	0	1	1

100 FP16 TFLOPS 이상 클라우드 GPU 최고 — June 2026

What “100 FP16 TFLOPS” actually filters for