FengHuang Tensor Addresable Bridge

Implementation of the Tensor Addressable Bridge from
"FengHuang: Next-Generation Memory Orchestration for AI Inferencing", Microsoft Research, arXiv 2511.10753v1.

For our final project for CPSC 4261 (Spring 2026, Prof. Richard Yang), we implemented FengHuang, a hardware-software co-design proposal from Microsoft Research Asia for next-generation AI inference infrastructure. FengHuang's core primitive is the Tensor Addressable Bridge (TAB), a shared remote-memory fabric that decouples model-weight storage from per-GPU HBM, streaming weights and KV-cache on demand to a pool of LPDDR6 banks over 224G/448G SerDes links. We implemented the TAB as four synthesizable Verilog modules (top-level integrator, round-robin crossbar, banked memory, and global completion tracker), built a roofline-based Python simulator whose bandwidth-sensitivity curve reproduces the paper's Figure 4.2 to within 1% at the 4.8 TB/s design point, and profiled Qwen3-235B-A22B on Yale's 8-GPU H200 cluster. In the accompanying report we discuss the challenges surfaced during implementation and empirical evaluation and propose a redesign that addresses them.

Team Members

Anton Melnychuk, am3785, anton.melnychuk@yale.edu

(a) In a conventional cluster each xPU owns a full HBM stack and communicates with peers through a multi-tier switch fabric. AllReduce/AllGather traffic crosses the switches at every step.

(a) In FengHuang, a shared Tensor Addressable Bridge (TAB) replaces the switch fabric. Per-xPU HBM shrinks to 20 GB; bulk weight storage moves to a pool of LPDDR6 banks.

RTL

tab_mem_bank.v - memory bank FSM (OP_READ / OP_WRITE / OP_WR_ACC)
tab_crossbar.v - N×M crossbar with round-robin arbitration
tab_compl_tracker.v - pending write counter + WC_SYNC notification
fenghuang_tab.v - all submodules
tb/tb_fenghuang_tab.v - testbench (P2P, AllReduce, AllGather)
All 4 tests passing under iverilog

Run this to see output:

cd rtl && ./run.sh

Nsight Simulation

Profiled Qwen3-235B-A22B on Yale cluster (H100 80GB × 8) with Nsight Systems
Extracted GPU utilization and compute/overhead breakdown from trace
Built Python simulator: TAB bandwidth prefetch model (simulator_v2.py)
- Model: paper scenario (20 GB local HBM, 38.75 GB remote per GPU)
- Sweep: 4.0 => 6.4 TB/s remote bandwidth
Reproduced Figure 4.2 from paper: see sim/results/comparison_final.png
- Our sim @ 4.8 TB/s: −18.2% TPOT vs Baseline8 (paper: −19.2%)

Run command (requires nsys .sqlite traces from the Yale cluster; pre-computed outputs in sim/results/*.json):

python sim/simulator_v2.py \
  --baseline-trace trace_baseline8.sqlite \
  --fh4-trace      trace_fh4_qa.sqlite \
  --bw-sweep 4.0 4.8 6.4 \
  --output sim/results/results_final.json

Write-Up

Introduction
Problem Statement
Prefetch Simulation
Design and RTL Implementation
Architectural Gaps and Failure Modes
TAB Hybrid Redesign
Conclusion

Proposal Milestones & Deliverables

Simulation

iverilog -g2005-sv -ofenghuang_sim \
  rtl/tab_mem_bank.v rtl/tab_crossbar.v rtl/tab_compl_tracker.v \
  rtl/fenghuang_tab.v tb/tb_fenghuang_tab.v
vvp fenghuang_sim

Expected output:

[PASS] P2P_READ
[PASS] AllReduce_result  got=0xa0
[PASS] U0_data_via_xPU1
[PASS] U3_data_via_xPU0
ALL TESTS PASSED

FPGA Synthesys

Board: Versal VMK 180

Operation	Latency
Read	220 ns
Write	90 ns
Write-Accumulate	90 ns
WC-Sync notification	40 ns

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
figures		figures
rtl		rtl
sim		sim
tb		tb
.gitignore		.gitignore
Proposal_Milestones_Deliverables.pdf		Proposal_Milestones_Deliverables.pdf
README.md		README.md
Report_Anton_Melnychuk.pdf		Report_Anton_Melnychuk.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FengHuang Tensor Addresable Bridge

Team Members

RTL

Nsight Simulation

Write-Up

Proposal Milestones & Deliverables

Simulation

FPGA Synthesys

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FengHuang Tensor Addresable Bridge

Team Members

RTL

Nsight Simulation

Write-Up

Proposal Milestones & Deliverables

Simulation

FPGA Synthesys

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages