Skip to content

Conversation

@emustafa96
Copy link
Contributor

@emustafa96 emustafa96 commented Apr 29, 2025

Replace leading zero counter with leading zero anticipator in FMA sum path

Summary

This PR optimizes the floating-point multiply-add (FMA) unit by replacing the sequential leading zero counter (LZC) in the sum path with a parallel leading zero anticipator (LZA). This change removes normalization from the critical path, significantly improving FMA performance.

Problem

The previous implementation computed the sum first, then counted leading zeros for normalization:

Multiply → Align → Add/Subtract → Count Leading Zeros → Normalize → Result
                                      ↑
                               Critical path bottleneck

This sequential approach added unnecessary latency to the FMA operation, as normalization had to wait for the complete sum calculation.

Solution

Added Schmookler's leading zero anticipation algorithm IEEEX, implemented in the Walley Core that predicts the normalization shift count in parallel with the sum computation:

Multiply → Align → Add/Subtract ──────────→ Normalize → Result
           ↓                               ↗
           └── Leading Zero Anticipator ──┘
           (in parallel)

Technical Details

The LZA implementation:

  • Uses carry-lookahead logic (P/G/K signals) to predict leading zero patterns
  • Handles both addition and subtraction operations via the sub control signal
  • Added logic to detect and handle miss-predictions by one
  • Feeds the predicted shift count directly to the normalization stage

Testing

  • Verified with Synopsys VC formal 's sequential equivalence check
  • Proven to be equal
  Summary Proofs:
   ----------------------------------------------------------------------------------------------------------------------
    VpId |           Name |      Type |         Parent |     #A |     #C |     #S |     #F |     #I |    Status |     %
   ----------------------------------------------------------------------------------------------------------------------
       0 |         seqdef |      root |            nil |     13 |      3 |     13 |      0 |      0 |   success |   100
       0 |      seqdef-rw |        or |         seqdef |      - |      - |      - |      - |      - |         - |     -
       0 |          rw1_1 |       int |      seqdef-rw |      5 |      0 |      5 |      0 |      0 |   success |   100
       0 |       rw1_1-ur |        or |          rw1_1 |      - |      - |      - |      - |      - |         - |     -
       0 |           ur_1 |      leaf |       rw1_1-ur |      4 |      0 |      4 |      0 |      0 |   success |   100
       0 |      rw1_1-dcp | decompose |          rw1_1 |      - |      - |      - |      - |      - |         - |     -
       0 |         idcp_1 |      leaf |      rw1_1-dcp |      4 |      0 |      4 |      0 |      0 |   success |   100
   ----------------------------------------------------------------------------------------------------------------------

@emustafa96
Copy link
Contributor Author

The following script can be used to verify that the proposed changes are sequentially equivalent to the current implementation with Synopsys VC formal 's sequential equivalence check (vcf -file script_below.tcl):

set_fml_appmode SEQ

set SCRIPT_DIR [file normalize [file join [file dirname [info script]] ]]

set flist_golden [list \
 "common_cells/src/cf_math_pkg.sv" \
 "common_cells/src/lzc.sv" \
 "cvfpu/src/fpnew_pkg.sv" \
 "cvfpu/src/fpnew_classifier.sv" \
 "cvfpu/src/fpnew_rounding.sv" \
 "cvfpu/src/fpnew_fma_multi.sv" \
]
set flist_impl [list \
 "common_cells/src/cf_math_pkg.sv" \
 "common_cells/src/lzc.sv" \
 "cvfpu/src/fpnew_pkg.sv" \
 "cvfpu/src/fpnew_classifier.sv" \
 "cvfpu/src/fpnew_rounding.sv" \
 "cvfpu/src/fpnew_fma_multi_new.sv" \
 "cvfpu/vendor/cvw/fma/fmalza.sv" \
]

analyze -format sverilog -library spec -vcs $flist_golden +incdir+common_cells/include
analyze -format sverilog -library impl -vcs $flist_impl +incdir+common_cells/include


elaborate_seq -spectop fpnew_fma_multi -impltop fpnew_fma_multi

map_by_name -clock spec.clk_i

create_clock -period 100 spec.clk_i
create_reset spec.rst_ni -sense low

fvassume -expr {spec.src_fmt_i == 0}
fvassume -expr {spec.src2_fmt_i == 0}
fvassume -expr {spec.dst_fmt_i == 0}

sim_run -stable
sim_set_state -uninitialized -apply 0

check_fv -block

report_proofs

Make sure to have the correct paths to cvfpuand common_cells relative to where vcf is called.

cvfpu/src/fpnew_fma_multi_new.sv contains the changes of this patch, while cvfpu/src/fpnew_fma_multi.sv and all other source files are the current version from develop. Different source and destination formats can be tried manually (unfortunately, runtime explodes when attempting to constrain these more loosely via, e.g., spec.src_fmt_i inside {0,1,2,3,4}).

@rgiunti
Copy link

rgiunti commented Oct 28, 2025

Hi @emustafa96. I tested the PR making use of the UVM testbench https://github.com/openhwgroup/cvfpu-uvm.git. In my test I set the FPU instance implementation in order to have merged slice for FMA unit so that the ADD MUL operations can stress your modifications. As a regression test I ran 10000 random transactions with random operation, operands, FP format and FP rounding mode repeated for 10 different seeds then the results have been compared with those given by the MPFR golden model. I can see that everything is fine so if you agree with my test and results I think that the PR can be merged.

@emustafa96
Copy link
Contributor Author

Hi @rgiunti, Thank you for the efforts! Concluding from the formal equivalence check and your testing, I also think we can merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants