Design document of the Branch Prediction Unit #233

dragon540 · 2024-12-02T16:28:57Z

This is the design document for the branch prediction unit to be added to Olympia simulator. This document aims to give an overview of the micro-architectural and implementation detail of the BPU.

Document will be further updated as development progresses. Any suggestions are appreciated.

jeffnye-gh

There are organization changes I would suggest. I would compare this to the template. I would like a better understanding of your class sets and their hierarchy, their methods.

I would like to get a better understanding of how you intend the unit to operate and a sense of how much code needs to be written. With some mention of the existing classes and APIs.

Global comments;
take advantage of the adoc syntax when inserting code fragments to help readability.
Be sure to document your material references when external information is used. Such as figures etc.

jeffnye-gh · 2024-12-04T16:20:33Z

docs/bpu_design_document/BPUDesignDoc.adoc

+outcomes and branch targets before branch instructions are actually
+resolved in the pipeline in order to reduce latency between a branch and a
+subsequent instruction.
+


"in order to reduce latency between a branch and a subsequent instruction."

this is unclear. For this kind of introductory matter, there a number of bpu descriptions available on the web that you can adapt. Use your own style but:

"This prediction allows the processor to prefetch and execute subsequent instructions without waiting to resolve the branch condition, minimizing stalls and maintaining high throughput in the pipeline."

jeffnye-gh · 2024-12-04T16:21:01Z

docs/bpu_design_document/BPUDesignDoc.adoc

+|0.2        | 2024.11.18 | Shobhit Sinha | BPU design documentation
+|0.1        | 2024.11.12 | Jeff Nye | initial template
+|===
+


keep the revision history updated with summary of changes.

there are no change bars in adoc so this summary save time re-reading unchanged portions of the document

jeffnye-gh · 2024-12-04T16:30:28Z

docs/bpu_design_document/BPUDesignDoc.adoc

+=== Overview Block Diagram
+
+image:media/bpu_overview.png[image,width=576,height=366]
+


you might consider removing this diagram and/or moving the subsequent diagram to this location. The 2nd diagram conveys more information.

Also if you are using diagrams from the web be sure you quote the references.

jeffnye-gh · 2024-12-04T16:35:03Z

docs/bpu_design_document/BPUDesignDoc.adoc

+first tier provides a simple but fast prediction. The second tier consists
+of a more accurate predictor which can predict even complex branches but takes an
+additional cycle.
+


"Branch Prediction Unit in Olympia is a two-tiered branch predictor"

Are you describing the current state of Olympia or the future state with this design completed?

jeffnye-gh · 2024-12-04T16:37:49Z

docs/bpu_design_document/BPUDesignDoc.adoc

+   slots.
+
+** `out_bpu_prediction_req` - in Fetch.cpp. To send PredictionInput to BPU.
+


There should be a reference to the existing BPU API found in BranchPredIF.hpp and how these ports interact with that API. The intent of the API is to allow parallel development of other BPUs with a common interface.

jeffnye-gh · 2024-12-04T16:50:02Z

docs/bpu_design_document/BPUDesignDoc.adoc

+** `pred_req_num` - Total number of prediction requests made to BPU
+** `num_mispred` - Total number of mis-predictions
+** `mispred_perc` - Percentage of mis-predictions
+


i would expect stats/counters for most of the structures in the bpu. These statistics are often used to assess the qualities/configuration of a branch predictor implementation but also used to support application analysis.

Consider adding these stats/counters
prediction hit/miss rates by conditional branch type and direction
structure specific stats such as PHT hit/miss/aliasing conflicts, similar for BTB
histograms of entries usage, PHT entries for example
ras high/low water marks, ras utilization
global distribution of taken/not taken, speculative and resolved T/NT

some measure of false sharing in the tables.

tage will add additional stats/counters. I would mention this but I think you could leave the exact list for the future.

jeffnye-gh · 2024-12-04T16:51:10Z

docs/bpu_design_document/BPUDesignDoc.adoc

+image:media/bpu_uarch.png[image,width=800,height=366]
+
+Figure 2 - Unit block diagram of BPU
+


see previous comment, perhaps replace the previous figure with this one.

jeffnye-gh · 2024-12-04T16:52:32Z

docs/bpu_design_document/BPUDesignDoc.adoc

+
+Olympia's Branch Predictor API intends the implementation of Branch Predictor to
+define custom PredictionInput
+


I do not understand this sentence.

jeffnye-gh · 2024-12-04T16:53:49Z

docs/bpu_design_document/BPUDesignDoc.adoc

+PredictionInput is received by the BPU from the Fetch unit whenever a request for
+the prediction is made. Based on the data provided by this input, BPU makes the
+prediction.
+


see previous comment on the existing BPU API. This design should conform to that API.

If you propose future changes to that API it is best to put that in a section of it's own.

jeffnye-gh · 2024-12-04T17:03:04Z

docs/bpu_design_document/BPUDesignDoc.adoc

+define custom PredictionInput
+
+[[Overview_of_PredictionInput]]
+===  Overview


Are there header formatting issues with the document? The organization from this point on does not follow a progression I understand.

jeffnye-gh · 2025-01-14T03:56:35Z

This is similar to Shobhit's other PR. Not expected to compile.

We asked that interns post their code in what ever state as draft PRs at the end of the internship so as not to lose any progress.

knute-mips · 2025-03-10T16:21:02Z

Shobit, can you pull the latest master in your branches? CI should pass now

… Fetch

dragon540 · 2025-03-16T22:53:55Z

@klingaard @arupc this document is still a work in progress however some details are much more clearer than before. Can you take a look at this, and check if the high level design of the unit (interaction of BPU with Fetch and FTQ, organization of its constituent predictors, etc) looks okay?

klingaard · 2025-03-16T23:18:11Z

@klingaard @arupc this document is still a work in progress however some details are much more clearer than before. Can you take a look at this, and check if the high level design of the unit (interaction of BPU with Fetch and FTQ, organization of its constituent predictors, etc) looks okay?

Can do!

oms-vmicro · 2025-05-14T21:20:00Z

@dragon540 I know some of the code development is already being done in #243 but providing some feedback specific to the design document.

Overall I think the document is well done and covers a lot of the components one would like to see in a more high performance BPU & Front-end.

At a high level, my understanding from the design doc is the BranchPredIF.hpp interface is intended to be used for the BPU unit. PredictionRequest, PredictionOutput, and UpdateInput correspond to the interface "prediction input", "prediction output" and "update input" types from the interface.

I don't believe the BranchPredIF.hpp interface is currently being used by the individual branch predictors themselves though (e.g. PHT, BTB, TAGE, etc). I believe the intent behind the interface is that individual predictors can and should use the full interface as well. Specifically, inputs and outputs to the individual predictors should be templatized such that they can be redefined as needed. The existing predictors demonstrate where this would be useful as the PHT takes a hash of PC and global history as input vs PC being used for the other predictors. With using predictor-specific PredictionRequest, PredictionOutput, and UpdateInput types, a generic interface can be used with any individual predictor and defined to suit that predictor (e.g. PHT -> hash of PC & global hist, BTB -> PC). They can then be defaulted as is currently defined by the design doc. This would make for a less "hardened" approach to interfacing with the individual predictors along with the interfaces to the overall BPU unit.

In a similar thought, if there is a way to enable or disable individual predictors via command-line or configuration, I think that would be helpful. Meaning, an ability to define the collection of actual predictors implemented in the BPU would be useful along with a modular branch-predictor approach so then using a configuration file one can specify the collection of predictors implemented in the BPU (e.g. a BOOM uarch configuration, etc). The thought being that one can then enable, disable, or even add additional modular predictors that can be selected at via configuration. A simple example would be with the existing design, if one wanted to evaluate TAGE with and without the Statistical Corrector predictor. This would likely mean having classes for the individual predictors and then parent classes coupling them into "BasePredictor" and "TAGE_SC_L".

I realize that brings up a question of how does the design function/what assumptions can be made if someone were to disable all individual predictors, but I think we can prevent that and assume there will be one "base" (simple) and possibly one "TAGE_SC_L" (complex) predictor.

oms-vmicro · 2025-05-14T21:32:12Z

docs/bpu_design_document/BPUDesignDoc.adoc

+None
+
+[[Return_Address_Stack]]
+== Return Address Stack


I know BOOM uses a similar name for their return address predictor, but we might want to consider avoiding the acronym RAS.

When I see the term RAS with respect to computing I think of features related to "Reliability, Availability, and Servicability" (wiki). You'll find the acronym used in this manner in a variety of places:

https://github.com/riscv-admin/riscv-ras/blob/main/Specification.adoc

https://www.kernel.org/doc/html/v4.10/admin-guide/ras.html

https://developer.arm.com/documentation/107790/0100/Introduction-to-RAS

https://www.intel.com/content/www/us/en/developer/articles/technical/pmem-RAS.html

Not dead set on this, but to potentially avoid confusion we could consider naming it something different. Some suggestions:

Return Stack Buffer (RSB): https://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_function_returns

Return Address Predictor (RAP)

Return Address Buffer (RAB)

oms-vmicro · 2025-05-14T21:37:36Z

docs/bpu_design_document/BPUDesignDoc.adoc

+=== Functional Description
+
+The proposed Branch Prediction Unit (BPU) is a two-tiered branch predictor where the
+first tier(BasePredictor) provides a simple but fast prediction, whereas the second tier(TAGE_SC_L)


This is a suggestion, but wanted to see if we would want to use terms other than first and second to describe the branch predictor tiers here and in the interfaces/functions below.

The thought was if there is a more defined way of describing these tiers that is still generic but provides some guidance as to what predictors fit in which tier. My suggestion would be to potentially call them "fast" vs "slow" or "single-cycle" vs "multi-cycle" as I believe first-tier predictors are expected to complete in a single cycle whereas second-tier predictors take more than one cycle.

oms-vmicro · 2025-05-14T21:43:57Z

docs/bpu_design_document/BPUDesignDoc.adoc

+to the BTB, BasePredictor and the TAGE_SC_L-Predictor.
+*** If it is a hit on BTB, and the BasePredictor predicts a taken
+branch, then the output is sent to Fetch unit
+*** If it is a hit on BTB, but


Complete sentence. Possibly: "... but the BasePredictor predicts a not taken branch, then fall-through is sent to Fetch unit".

oms-vmicro · 2025-05-14T21:54:04Z

docs/bpu_design_document/BPUDesignDoc.adoc

+1. `const uint32_t pht_size_`
+2. `const uint8_t  ctr_bits_`
+3. `const uint8_t  ctr_bits_val_`
+4. `std::map<uint64_t, uint8_t> pht_`


I believe PHT will use a uint32_t index representing a hash of PC and global history. I think this is noted by the use of uint32_t in the increment, decrement, and getPrediction functions noted below.

Should the key/index for the PHT std::map here also be a uint32_t? (e.g. std::map<uint32_t, uint8_t> pht_)

oms-vmicro · 2025-05-14T22:30:15Z

docs/bpu_design_document/BPUDesignDoc.adoc

+==== Private data members
+1. `uint32_t tage_bim_max_size_` - Represents maximum size of the BIM table of TAGE
+2. `uint8_t tage_bim_ctr_bits_` - Represents the number of bits used in counter of BIM table
+3. `std::vector<uint8_t> Tage_Bimodal_` - Represents the container used for BIM in the


Should this be uint32_t if tage_bim_table_size_ is a uint32_t as defined above?

jeffnye-gh · 2025-05-19T18:32:18Z

From todays call 2025.06.19, regarding changes to the api to support staged predictions.

Short of it is: I think we want to expose signals in the API to allow override of previous predictions. This has some implications to what is behind the BP-API.

In recent open source designs there is the concept of staged prediction, the uBTB/LP/TAGE/SC/ITTAGE all deliver predictions, some with longer latency. Not all of these will exist, not all latency values will be different.

In terms of the visible methods in the BP-API I think you want to add a method to signal an override to the front end logic and pass a struct with override info, BP request queue idx, basic block address, new prediction, maybe an extension for debug/metadata info.

I am making an assumption that the decision to signal an override has been made behind the BP-API, a topic for discussion, but seems like a way to allow greatest freedom/generality to developers to explore choices and keep the API smaller.

You could imagine I would test the effectiveness of a tournament selector vs simply always choosing the latest prediction that is different than an earlier one, and these differences would not need to be exposed through the api.

One other thing I did not mention in the call, I believe the BP-API will want to support multiple prediction requests, updates, and returns. If you read the trade press 2 or 4 predictions at a time will become the bar. I believe this just means that request/result data is grouped. Not a significant change on the surface, but occurred to me while writing.

jeffnye-gh · 2025-05-26T21:41:54Z

~~I can put this in a PR but it's one file~~,

I added to Arup's BP interface class, support for 2T/nT(multiple predictions) and stages prediction results. This is for discussion.

I put this in a draft pr, seemed easier.
#259

dragon540 added 7 commits November 18, 2024 23:07

BPU Design Documents

cb84444

Design update

b186afb

Document update

0008c72

Update FTQ description

27e9ec9

added corresponding ports and updated parameter names

b965ec4

updated BasePredictor part

f54f802

updated SC and TAGE section

3d560fd

dragon540 marked this pull request as draft December 2, 2024 16:29

Added paramters of TAGE

6c04a98

jeffnye-gh requested changes Dec 4, 2024

View reviewed changes

dragon540 added 3 commits December 10, 2024 10:17

Added sections for data members, functions for each class

e37435b

Added ports, counters, data members, and functions list

2265cbc

Updated BPU overview diagram with BPU interaction diagram

d56147b

dragon540 mentioned this pull request Dec 16, 2024

Branch prediction unit implementation #240

Closed

Added details of FTQ structure

900f462

dragon540 added 6 commits March 13, 2025 03:18

Updated FTQ ports and bpu_uarch diagram

1ecd00f

Merge branch 'riscv-software-src:master' into shobhit/bpu_design_doc

4ac1159

Added interaction mechanism detail between BPU, FTQ and Fetch

37cc04c

Defined mechanism and functions for internal working of TAGE

8274db4

Update BPU ports and functions to facilitate interaction with FTQ and…

eed4884

… Fetch

Updated formating

c150540

dragon540 requested review from arupc and klingaard March 16, 2025 22:49

oms-vmicro reviewed May 14, 2025

View reviewed changes

		=== Overview Block Diagram

		image:media/bpu_overview.png[image,width=576,height=366]

		slots.

		** `out_bpu_prediction_req` - in Fetch.cpp. To send PredictionInput to BPU.

		image:media/bpu_uarch.png[image,width=800,height=366]

		Figure 2 - Unit block diagram of BPU


		Olympia's Branch Predictor API intends the implementation of Branch Predictor to
		define custom PredictionInput

Design document of the Branch Prediction Unit #233

Are you sure you want to change the base?

Design document of the Branch Prediction Unit #233

Uh oh!

Conversation

dragon540 commented Dec 2, 2024

Uh oh!

jeffnye-gh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffnye-gh Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffnye-gh commented Jan 14, 2025

Uh oh!

knute-mips commented Mar 10, 2025

Uh oh!

dragon540 commented Mar 16, 2025

Uh oh!

klingaard commented Mar 16, 2025

Uh oh!

oms-vmicro commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffnye-gh commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffnye-gh commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jeffnye-gh left a comment •

edited

Loading

jeffnye-gh Dec 4, 2024 •

edited

Loading

jeffnye-gh commented May 19, 2025 •

edited

Loading

jeffnye-gh commented May 26, 2025 •

edited

Loading