Quantifying the Processing-in-Memory compute gap for LLM inference under DRAM process constraints across nodes, power budgets, model sizes, and precision formats.
| Model | Arch | GOPS | Latency (ms) | Sufficiency | GOPS/W |
|---|---|---|---|---|---|
| 7B | PIM | 7.1 | 684 | 0.073 | 5.06 |
| PNM | 200 | 35.7 | 1.399 | 50.6 | |
| GPU | 623K | 2.3 | 21.86 | 1558 | |
| 70B | PIM | 7.0 | 6513 | 0.008 | 5.06 |
| PNM | 200 | 315 | 0.159 | 50.6 | |
| GPU | 624K | 20.2 | 2.477 | 1561 |
| Power (W) | 7B | 13B | 30B | 70B |
|---|---|---|---|---|
| 0.5 | 0.018 | 0.010 | 0.004 | 0.002 |
| 1.0 | 0.037 | 0.019 | 0.008 | 0.004 |
| 2.0 | 0.073 | 0.038 | 0.016 | 0.008 |
| 3.0 | 0.110 | 0.058 | 0.023 | 0.012 |
| 5.0 | 0.184 | 0.096 | 0.039 | 0.020 |
| Precision | 7B | 13B | 30B | 70B |
|---|---|---|---|---|
| FP16 | 0.073 | 0.038 | 0.016 | 0.008 |
| INT8 | 0.074 | 0.038 | 0.016 | 0.008 |
| INT4 | 0.073 | 0.038 | 0.015 | 0.008 |