Taraṅga

Evidence & peer review

Reviewed. Bruised.
Better for it.

StochastiCore went through four IEEE TVLSI reviewers. The revision found a real bug, retracted an energy claim, and rebuilt every number with one reproducible flow. This page is the whole ledger — what broke, what we conceded, and what survived scrutiny.

What review broke — and we fixed
  • A real RTL bug — the S2B output scaling halved every stochastic result; found by building a gate-level netlist simulator, fixed, and verified on the synthesized netlist.
  • Irreproducible synthesis numbers — the old 4,770-cell figure could not be regenerated; one documented Yosys flow now reproduces every count, calibrated against a textbook binary MAC.
  • NumPy shortcuts — all rendering, dithering, and fault experiments were moved onto the bit-accurate LFSR model, with streams shared across pixels as in real hardware.
  • A rasterizer that didn't synthesize — six per-pixel variable divisions were replaced by a per-triangle reciprocal; now 8,623 LUT4, verified within ≤3 levels of exact division.
  • Speculative energy claims — replaced by an activity-driven gate-level power model on the real netlist, which reversed the original conclusion.
What we conceded — in print
  • SC is not a per-operation throughput win: 29×–116× longer per op at useful quality.
  • SC is not a per-operation energy win: 71 fJ vs 17.6k–69k fJ per op.
  • Fault tolerance is a compute-domain property only; storage and control need conventional protection.
  • High-performance rendering is explicitly not the target.
What survived scrutiny
  • Area parity at higher functional density: a full SC compute unit (4 mul + MAC + tanh + lerp) in fewer cells than ONE binary MAC-4.
  • The one-gate marginal multiply — massive parallelism priced in single gates.
  • Graceful degradation above ~0.5% BER in the compute domain, with an analytic crossover that matches simulation.
  • Free anti-banding: the white-noise dithering theorem, validated in rendered scenes across resolutions.
  • A complete, open, reproducible GPU pipeline that fits a $25–55 FPGA — buildable by hand for $80–150.

The S2B story deserves a sentence: building the gate-level power simulator the reviewers demanded revealed a scaling bug that had been silently halving every stochastic result. The criticism made the design correct. That is what review is for.

The numbers

Four measurements that define the niche.

Area — the win
1,709 cells / 597 LUT4

A complete SC compute unit (4 multiply lanes + MAC + tanh + lerp) against 1,912 cells / 674 LUT4 for a single binary MAC-4. The whole SC cluster is 2,395 LUT4 — it smallest iCE40 HX8K-class FPGAs (~30% of the device).

Energy — the loss, printed large
71 fJ/op vs 17,600 fJ/op–69,000 fJ/op

SC costs ~250–970× more energy per operation at useful stream lengths. The earlier arithmetic-only model was wrong, and the revision says so. The deterministic generator claws back ~60% of SC energy — still not parity, and we don't pretend it is.

Faults — the crossover
graceful above ~0.5% BER

Site-resolved injection, N = 10,000 with 95% confidence intervals; the analytic crossover (0.6% (Eq. 6)) matches simulation. Scoped honestly: the tolerance is a streaming-domain property — storage and control faults are equally vulnerable and are best protected by SECDED/TMR, which the revised paper says plainly.

Rendering — quality and rate
36.7 dB at L=1024 · 68 fps at L=256

Resolution-invariant (36.66–36.69 dB across 80×60 → 320×240); SC quantization noise is spectrally white (proved) — gradients render banding-free where binary low-precision pipelines band. Binary baseline: ~520 fps — about 8× faster, stated plainly.

The claims ledger

What we claim. What we refuse to.

Claimed — supported by the record
  • A complete, synthesizable, open-source stochastic-computing GPU pipeline — the first to integrate SC into full 3-D rendering (peer-reviewed at IEEE TVLSI revision stage).
  • Multiplication for one gate, a full compute unit at area parity with a single binary MAC-4, and a marginal multiply lane for one gate.
  • Graceful degradation under bit errors above ~0.5% BER in the compute domain — with theory matching simulation.
  • Banding-free low-precision gradients by physics (white-noise dithering), not by extra hardware.
  • A working physical GPU buildable from $80–150 of off-the-shelf parts with a fully open toolchain.
Not claimed — and we'll say it first
  • A throughput win — SC takes 29×–116× longer per operation at useful quality; a binary baseline renders ~8× more frames.
  • An energy win — per operation, SC costs hundreds of times more than binary arithmetic; we publish the gate-level numbers.
  • A gaming-class or datacenter GPU — the honest niche is minimum-silicon, reliability-critical, display-class and edge devices.
  • Measured silicon results — physical FPGA validation (timing closure, measured power) is in progress and clearly labelled as pending.

Sources: the revised IEEE TVLSI manuscript, the point-by-point response to four reviewers, and the released RTL + Yosys flow that regenerates every count.