-
Notifications
You must be signed in to change notification settings - Fork 69
Design document of the Branch Prediction Unit #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Design document of the Branch Prediction Unit #233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are organization changes I would suggest. I would compare this to the template. I would like a better understanding of your class sets and their hierarchy, their methods.
I would like to get a better understanding of how you intend the unit to operate and a sense of how much code needs to be written. With some mention of the existing classes and APIs.
Global comments;
take advantage of the adoc syntax when inserting code fragments to help readability.
Be sure to document your material references when external information is used. Such as figures etc.
outcomes and branch targets before branch instructions are actually | ||
resolved in the pipeline in order to reduce latency between a branch and a | ||
subsequent instruction. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in order to reduce latency between a branch and a subsequent instruction."
this is unclear. For this kind of introductory matter, there a number of bpu descriptions available on the web that you can adapt. Use your own style but:
"This prediction allows the processor to prefetch and execute subsequent instructions without waiting to resolve the branch condition, minimizing stalls and maintaining high throughput in the pipeline."
|0.2 | 2024.11.18 | Shobhit Sinha | BPU design documentation | ||
|0.1 | 2024.11.12 | Jeff Nye | initial template | ||
|=== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the revision history updated with summary of changes.
there are no change bars in adoc so this summary save time re-reading unchanged portions of the document
=== Overview Block Diagram | ||
|
||
image:media/bpu_overview.png[image,width=576,height=366] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might consider removing this diagram and/or moving the subsequent diagram to this location. The 2nd diagram conveys more information.
Also if you are using diagrams from the web be sure you quote the references.
first tier provides a simple but fast prediction. The second tier consists | ||
of a more accurate predictor which can predict even complex branches but takes an | ||
additional cycle. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Branch Prediction Unit in Olympia is a two-tiered branch predictor"
Are you describing the current state of Olympia or the future state with this design completed?
slots. | ||
|
||
** `out_bpu_prediction_req` - in Fetch.cpp. To send PredictionInput to BPU. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a reference to the existing BPU API found in BranchPredIF.hpp and how these ports interact with that API. The intent of the API is to allow parallel development of other BPUs with a common interface.
** `pred_req_num` - Total number of prediction requests made to BPU | ||
** `num_mispred` - Total number of mis-predictions | ||
** `mispred_perc` - Percentage of mis-predictions | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would expect stats/counters for most of the structures in the bpu. These statistics are often used to assess the qualities/configuration of a branch predictor implementation but also used to support application analysis.
Consider adding these stats/counters
prediction hit/miss rates by conditional branch type and direction
structure specific stats such as PHT hit/miss/aliasing conflicts, similar for BTB
histograms of entries usage, PHT entries for example
ras high/low water marks, ras utilization
global distribution of taken/not taken, speculative and resolved T/NT
some measure of false sharing in the tables.
tage will add additional stats/counters. I would mention this but I think you could leave the exact list for the future.
image:media/bpu_uarch.png[image,width=800,height=366] | ||
|
||
Figure 2 - Unit block diagram of BPU | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment, perhaps replace the previous figure with this one.
|
||
Olympia's Branch Predictor API intends the implementation of Branch Predictor to | ||
define custom PredictionInput | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand this sentence.
PredictionInput is received by the BPU from the Fetch unit whenever a request for | ||
the prediction is made. Based on the data provided by this input, BPU makes the | ||
prediction. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment on the existing BPU API. This design should conform to that API.
If you propose future changes to that API it is best to put that in a section of it's own.
define custom PredictionInput | ||
|
||
[[Overview_of_PredictionInput]] | ||
=== Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there header formatting issues with the document? The organization from this point on does not follow a progression I understand.
This is similar to Shobhit's other PR. Not expected to compile. We asked that interns post their code in what ever state as draft PRs at the end of the internship so as not to lose any progress. |
Shobit, can you pull the latest master in your branches? CI should pass now |
@klingaard @arupc this document is still a work in progress however some details are much more clearer than before. Can you take a look at this, and check if the high level design of the unit (interaction of BPU with Fetch and FTQ, organization of its constituent predictors, etc) looks okay? |
Can do! |
@dragon540 I know some of the code development is already being done in #243 but providing some feedback specific to the design document. Overall I think the document is well done and covers a lot of the components one would like to see in a more high performance BPU & Front-end. At a high level, my understanding from the design doc is the I don't believe the In a similar thought, if there is a way to enable or disable individual predictors via command-line or configuration, I think that would be helpful. Meaning, an ability to define the collection of actual predictors implemented in the BPU would be useful along with a modular branch-predictor approach so then using a configuration file one can specify the collection of predictors implemented in the BPU (e.g. a BOOM uarch configuration, etc). The thought being that one can then enable, disable, or even add additional modular predictors that can be selected at via configuration. A simple example would be with the existing design, if one wanted to evaluate TAGE with and without the Statistical Corrector predictor. This would likely mean having classes for the individual predictors and then parent classes coupling them into "BasePredictor" and "TAGE_SC_L".
|
None | ||
|
||
[[Return_Address_Stack]] | ||
== Return Address Stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know BOOM uses a similar name for their return address predictor, but we might want to consider avoiding the acronym RAS
.
When I see the term RAS
with respect to computing I think of features related to "Reliability, Availability, and Servicability" (wiki). You'll find the acronym used in this manner in a variety of places:
- https://github.com/riscv-admin/riscv-ras/blob/main/Specification.adoc
- https://www.kernel.org/doc/html/v4.10/admin-guide/ras.html
- https://developer.arm.com/documentation/107790/0100/Introduction-to-RAS
- https://www.intel.com/content/www/us/en/developer/articles/technical/pmem-RAS.html
Not dead set on this, but to potentially avoid confusion we could consider naming it something different. Some suggestions:
- Return Stack Buffer (
RSB
): https://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_function_returns - Return Address Predictor (
RAP
) - Return Address Buffer (
RAB
)
=== Functional Description | ||
|
||
The proposed Branch Prediction Unit (BPU) is a two-tiered branch predictor where the | ||
first tier(BasePredictor) provides a simple but fast prediction, whereas the second tier(TAGE_SC_L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a suggestion, but wanted to see if we would want to use terms other than first
and second
to describe the branch predictor tiers here and in the interfaces/functions below.
The thought was if there is a more defined way of describing these tiers that is still generic but provides some guidance as to what predictors fit in which tier. My suggestion would be to potentially call them "fast" vs "slow" or "single-cycle" vs "multi-cycle" as I believe first-tier predictors are expected to complete in a single cycle whereas second-tier predictors take more than one cycle.
to the BTB, BasePredictor and the TAGE_SC_L-Predictor. | ||
*** If it is a hit on BTB, and the BasePredictor predicts a taken | ||
branch, then the output is sent to Fetch unit | ||
*** If it is a hit on BTB, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complete sentence. Possibly: "... but the BasePredictor predicts a not taken branch, then fall-through is sent to Fetch unit".
1. `const uint32_t pht_size_` | ||
2. `const uint8_t ctr_bits_` | ||
3. `const uint8_t ctr_bits_val_` | ||
4. `std::map<uint64_t, uint8_t> pht_` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe PHT will use a uint32_t
index representing a hash of PC and global history. I think this is noted by the use of uint32_t
in the increment, decrement, and getPrediction functions noted below.
Should the key/index for the PHT std::map
here also be a uint32_t
? (e.g. std::map<uint32_t, uint8_t> pht_
)
==== Private data members | ||
1. `uint32_t tage_bim_max_size_` - Represents maximum size of the BIM table of TAGE | ||
2. `uint8_t tage_bim_ctr_bits_` - Represents the number of bits used in counter of BIM table | ||
3. `std::vector<uint8_t> Tage_Bimodal_` - Represents the container used for BIM in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be uint32_t
if tage_bim_table_size_
is a uint32_t
as defined above?
From todays call 2025.06.19, regarding changes to the api to support staged predictions. Short of it is: I think we want to expose signals in the API to allow override of previous predictions. This has some implications to what is behind the BP-API. In recent open source designs there is the concept of staged prediction, the uBTB/LP/TAGE/SC/ITTAGE all deliver predictions, some with longer latency. Not all of these will exist, not all latency values will be different. In terms of the visible methods in the BP-API I think you want to add a method to signal an override to the front end logic and pass a struct with override info, BP request queue idx, basic block address, new prediction, maybe an extension for debug/metadata info. I am making an assumption that the decision to signal an override has been made behind the BP-API, a topic for discussion, but seems like a way to allow greatest freedom/generality to developers to explore choices and keep the API smaller. You could imagine I would test the effectiveness of a tournament selector vs simply always choosing the latest prediction that is different than an earlier one, and these differences would not need to be exposed through the api. One other thing I did not mention in the call, I believe the BP-API will want to support multiple prediction requests, updates, and returns. If you read the trade press 2 or 4 predictions at a time will become the bar. I believe this just means that request/result data is grouped. Not a significant change on the surface, but occurred to me while writing. |
I added to Arup's BP interface class, support for 2T/nT(multiple predictions) and stages prediction results. This is for discussion. I put this in a draft pr, seemed easier. |
This is the design document for the branch prediction unit to be added to Olympia simulator. This document aims to give an overview of the micro-architectural and implementation detail of the BPU.
Document will be further updated as development progresses. Any suggestions are appreciated.