Evidence & peer review
Reviewed. Bruised.
Better for it.
StochastiCore went through four IEEE TVLSI reviewers. The revision found a real bug, retracted an energy claim, and rebuilt every number with one reproducible flow. This page is the whole ledger — what broke, what we conceded, and what survived scrutiny.
- ⚒A real RTL bug — the S2B output scaling halved every stochastic result; found by building a gate-level netlist simulator, fixed, and verified on the synthesized netlist.
- ⚒Irreproducible synthesis numbers — the old 4,770-cell figure could not be regenerated; one documented Yosys flow now reproduces every count, calibrated against a textbook binary MAC.
- ⚒NumPy shortcuts — all rendering, dithering, and fault experiments were moved onto the bit-accurate LFSR model, with streams shared across pixels as in real hardware.
- ⚒A rasterizer that didn't synthesize — six per-pixel variable divisions were replaced by a per-triangle reciprocal; now 8,623 LUT4, verified within ≤3 levels of exact division.
- ⚒Speculative energy claims — replaced by an activity-driven gate-level power model on the real netlist, which reversed the original conclusion.
- ✕SC is not a per-operation throughput win: 29×–116× longer per op at useful quality.
- ✕SC is not a per-operation energy win: 71 fJ vs 17.6k–69k fJ per op.
- ✕Fault tolerance is a compute-domain property only; storage and control need conventional protection.
- ✕High-performance rendering is explicitly not the target.
- ✓Area parity at higher functional density: a full SC compute unit (4 mul + MAC + tanh + lerp) in fewer cells than ONE binary MAC-4.
- ✓The one-gate marginal multiply — massive parallelism priced in single gates.
- ✓Graceful degradation above ~0.5% BER in the compute domain, with an analytic crossover that matches simulation.
- ✓Free anti-banding: the white-noise dithering theorem, validated in rendered scenes across resolutions.
- ✓A complete, open, reproducible GPU pipeline that fits a $25–55 FPGA — buildable by hand for $80–150.
The S2B story deserves a sentence: building the gate-level power simulator the reviewers demanded revealed a scaling bug that had been silently halving every stochastic result. The criticism made the design correct. That is what review is for.
The numbers
Four measurements that define the niche.
A complete SC compute unit (4 multiply lanes + MAC + tanh + lerp) against 1,912 cells / 674 LUT4 for a single binary MAC-4. The whole SC cluster is 2,395 LUT4 — it smallest iCE40 HX8K-class FPGAs (~30% of the device).
SC costs ~250–970× more energy per operation at useful stream lengths. The earlier arithmetic-only model was wrong, and the revision says so. The deterministic generator claws back ~60% of SC energy — still not parity, and we don't pretend it is.
Site-resolved injection, N = 10,000 with 95% confidence intervals; the analytic crossover (0.6% (Eq. 6)) matches simulation. Scoped honestly: the tolerance is a streaming-domain property — storage and control faults are equally vulnerable and are best protected by SECDED/TMR, which the revised paper says plainly.
Resolution-invariant (36.66–36.69 dB across 80×60 → 320×240); SC quantization noise is spectrally white (proved) — gradients render banding-free where binary low-precision pipelines band. Binary baseline: ~520 fps — about 8× faster, stated plainly.
The claims ledger
What we claim. What we refuse to.
- ✓A complete, synthesizable, open-source stochastic-computing GPU pipeline — the first to integrate SC into full 3-D rendering (peer-reviewed at IEEE TVLSI revision stage).
- ✓Multiplication for one gate, a full compute unit at area parity with a single binary MAC-4, and a marginal multiply lane for one gate.
- ✓Graceful degradation under bit errors above ~0.5% BER in the compute domain — with theory matching simulation.
- ✓Banding-free low-precision gradients by physics (white-noise dithering), not by extra hardware.
- ✓A working physical GPU buildable from $80–150 of off-the-shelf parts with a fully open toolchain.
- ✕A throughput win — SC takes 29×–116× longer per operation at useful quality; a binary baseline renders ~8× more frames.
- ✕An energy win — per operation, SC costs hundreds of times more than binary arithmetic; we publish the gate-level numbers.
- ✕A gaming-class or datacenter GPU — the honest niche is minimum-silicon, reliability-critical, display-class and edge devices.
- ✕Measured silicon results — physical FPGA validation (timing closure, measured power) is in progress and clearly labelled as pending.
Sources: the revised IEEE TVLSI manuscript, the point-by-point response to four reviewers, and the released RTL + Yosys flow that regenerates every count.