Skip to content

Design document of the Branch Prediction Unit #233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

dragon540
Copy link
Collaborator

This is the design document for the branch prediction unit to be added to Olympia simulator. This document aims to give an overview of the micro-architectural and implementation detail of the BPU.

Document will be further updated as development progresses. Any suggestions are appreciated.

@dragon540 dragon540 marked this pull request as draft December 2, 2024 16:29
Copy link
Collaborator

@jeffnye-gh jeffnye-gh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are organization changes I would suggest. I would compare this to the template. I would like a better understanding of your class sets and their hierarchy, their methods.

I would like to get a better understanding of how you intend the unit to operate and a sense of how much code needs to be written. With some mention of the existing classes and APIs.

Global comments;
take advantage of the adoc syntax when inserting code fragments to help readability.
Be sure to document your material references when external information is used. Such as figures etc.

outcomes and branch targets before branch instructions are actually
resolved in the pipeline in order to reduce latency between a branch and a
subsequent instruction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in order to reduce latency between a branch and a subsequent instruction."

this is unclear. For this kind of introductory matter, there a number of bpu descriptions available on the web that you can adapt. Use your own style but:

"This prediction allows the processor to prefetch and execute subsequent instructions without waiting to resolve the branch condition, minimizing stalls and maintaining high throughput in the pipeline."

|0.2 | 2024.11.18 | Shobhit Sinha | BPU design documentation
|0.1 | 2024.11.12 | Jeff Nye | initial template
|===

Copy link
Collaborator

@jeffnye-gh jeffnye-gh Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the revision history updated with summary of changes.

there are no change bars in adoc so this summary save time re-reading unchanged portions of the document

=== Overview Block Diagram

image:media/bpu_overview.png[image,width=576,height=366]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might consider removing this diagram and/or moving the subsequent diagram to this location. The 2nd diagram conveys more information.

Also if you are using diagrams from the web be sure you quote the references.

first tier provides a simple but fast prediction. The second tier consists
of a more accurate predictor which can predict even complex branches but takes an
additional cycle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Branch Prediction Unit in Olympia is a two-tiered branch predictor"

Are you describing the current state of Olympia or the future state with this design completed?

slots.

** `out_bpu_prediction_req` - in Fetch.cpp. To send PredictionInput to BPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a reference to the existing BPU API found in BranchPredIF.hpp and how these ports interact with that API. The intent of the API is to allow parallel development of other BPUs with a common interface.

** `pred_req_num` - Total number of prediction requests made to BPU
** `num_mispred` - Total number of mis-predictions
** `mispred_perc` - Percentage of mis-predictions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would expect stats/counters for most of the structures in the bpu. These statistics are often used to assess the qualities/configuration of a branch predictor implementation but also used to support application analysis.

Consider adding these stats/counters
prediction hit/miss rates by conditional branch type and direction
structure specific stats such as PHT hit/miss/aliasing conflicts, similar for BTB
histograms of entries usage, PHT entries for example
ras high/low water marks, ras utilization
global distribution of taken/not taken, speculative and resolved T/NT

some measure of false sharing in the tables.

tage will add additional stats/counters. I would mention this but I think you could leave the exact list for the future.

image:media/bpu_uarch.png[image,width=800,height=366]

Figure 2 - Unit block diagram of BPU

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment, perhaps replace the previous figure with this one.


Olympia's Branch Predictor API intends the implementation of Branch Predictor to
define custom PredictionInput

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this sentence.

PredictionInput is received by the BPU from the Fetch unit whenever a request for
the prediction is made. Based on the data provided by this input, BPU makes the
prediction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comment on the existing BPU API. This design should conform to that API.

If you propose future changes to that API it is best to put that in a section of it's own.

define custom PredictionInput

[[Overview_of_PredictionInput]]
=== Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there header formatting issues with the document? The organization from this point on does not follow a progression I understand.

@jeffnye-gh
Copy link
Collaborator

This is similar to Shobhit's other PR. Not expected to compile.

We asked that interns post their code in what ever state as draft PRs at the end of the internship so as not to lose any progress.

@knute-mips
Copy link
Contributor

Shobit, can you pull the latest master in your branches? CI should pass now

@dragon540 dragon540 requested review from arupc and klingaard March 16, 2025 22:49
@dragon540
Copy link
Collaborator Author

@klingaard @arupc this document is still a work in progress however some details are much more clearer than before. Can you take a look at this, and check if the high level design of the unit (interaction of BPU with Fetch and FTQ, organization of its constituent predictors, etc) looks okay?

@klingaard
Copy link
Collaborator

@klingaard @arupc this document is still a work in progress however some details are much more clearer than before. Can you take a look at this, and check if the high level design of the unit (interaction of BPU with Fetch and FTQ, organization of its constituent predictors, etc) looks okay?

Can do!

@oms-vmicro
Copy link

@dragon540 I know some of the code development is already being done in #243 but providing some feedback specific to the design document.

Overall I think the document is well done and covers a lot of the components one would like to see in a more high performance BPU & Front-end.

At a high level, my understanding from the design doc is the BranchPredIF.hpp interface is intended to be used for the BPU unit. PredictionRequest, PredictionOutput, and UpdateInput correspond to the interface "prediction input", "prediction output" and "update input" types from the interface.

I don't believe the BranchPredIF.hpp interface is currently being used by the individual branch predictors themselves though (e.g. PHT, BTB, TAGE, etc). I believe the intent behind the interface is that individual predictors can and should use the full interface as well. Specifically, inputs and outputs to the individual predictors should be templatized such that they can be redefined as needed. The existing predictors demonstrate where this would be useful as the PHT takes a hash of PC and global history as input vs PC being used for the other predictors. With using predictor-specific PredictionRequest, PredictionOutput, and UpdateInput types, a generic interface can be used with any individual predictor and defined to suit that predictor (e.g. PHT -> hash of PC & global hist, BTB -> PC). They can then be defaulted as is currently defined by the design doc. This would make for a less "hardened" approach to interfacing with the individual predictors along with the interfaces to the overall BPU unit.

In a similar thought, if there is a way to enable or disable individual predictors via command-line or configuration, I think that would be helpful. Meaning, an ability to define the collection of actual predictors implemented in the BPU would be useful along with a modular branch-predictor approach so then using a configuration file one can specify the collection of predictors implemented in the BPU (e.g. a BOOM uarch configuration, etc). The thought being that one can then enable, disable, or even add additional modular predictors that can be selected at via configuration. A simple example would be with the existing design, if one wanted to evaluate TAGE with and without the Statistical Corrector predictor. This would likely mean having classes for the individual predictors and then parent classes coupling them into "BasePredictor" and "TAGE_SC_L".

  • I realize that brings up a question of how does the design function/what assumptions can be made if someone were to disable all individual predictors, but I think we can prevent that and assume there will be one "base" (simple) and possibly one "TAGE_SC_L" (complex) predictor.

None

[[Return_Address_Stack]]
== Return Address Stack

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know BOOM uses a similar name for their return address predictor, but we might want to consider avoiding the acronym RAS.

When I see the term RAS with respect to computing I think of features related to "Reliability, Availability, and Servicability" (wiki). You'll find the acronym used in this manner in a variety of places:

Not dead set on this, but to potentially avoid confusion we could consider naming it something different. Some suggestions:

=== Functional Description

The proposed Branch Prediction Unit (BPU) is a two-tiered branch predictor where the
first tier(BasePredictor) provides a simple but fast prediction, whereas the second tier(TAGE_SC_L)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a suggestion, but wanted to see if we would want to use terms other than first and second to describe the branch predictor tiers here and in the interfaces/functions below.

The thought was if there is a more defined way of describing these tiers that is still generic but provides some guidance as to what predictors fit in which tier. My suggestion would be to potentially call them "fast" vs "slow" or "single-cycle" vs "multi-cycle" as I believe first-tier predictors are expected to complete in a single cycle whereas second-tier predictors take more than one cycle.

to the BTB, BasePredictor and the TAGE_SC_L-Predictor.
*** If it is a hit on BTB, and the BasePredictor predicts a taken
branch, then the output is sent to Fetch unit
*** If it is a hit on BTB, but

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete sentence. Possibly: "... but the BasePredictor predicts a not taken branch, then fall-through is sent to Fetch unit".

1. `const uint32_t pht_size_`
2. `const uint8_t ctr_bits_`
3. `const uint8_t ctr_bits_val_`
4. `std::map<uint64_t, uint8_t> pht_`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe PHT will use a uint32_t index representing a hash of PC and global history. I think this is noted by the use of uint32_t in the increment, decrement, and getPrediction functions noted below.

Should the key/index for the PHT std::map here also be a uint32_t? (e.g. std::map<uint32_t, uint8_t> pht_)

==== Private data members
1. `uint32_t tage_bim_max_size_` - Represents maximum size of the BIM table of TAGE
2. `uint8_t tage_bim_ctr_bits_` - Represents the number of bits used in counter of BIM table
3. `std::vector<uint8_t> Tage_Bimodal_` - Represents the container used for BIM in the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be uint32_t if tage_bim_table_size_ is a uint32_t as defined above?

@jeffnye-gh
Copy link
Collaborator

jeffnye-gh commented May 19, 2025

From todays call 2025.06.19, regarding changes to the api to support staged predictions.

Short of it is: I think we want to expose signals in the API to allow override of previous predictions. This has some implications to what is behind the BP-API.

In recent open source designs there is the concept of staged prediction, the uBTB/LP/TAGE/SC/ITTAGE all deliver predictions, some with longer latency. Not all of these will exist, not all latency values will be different.

In terms of the visible methods in the BP-API I think you want to add a method to signal an override to the front end logic and pass a struct with override info, BP request queue idx, basic block address, new prediction, maybe an extension for debug/metadata info.

I am making an assumption that the decision to signal an override has been made behind the BP-API, a topic for discussion, but seems like a way to allow greatest freedom/generality to developers to explore choices and keep the API smaller.

You could imagine I would test the effectiveness of a tournament selector vs simply always choosing the latest prediction that is different than an earlier one, and these differences would not need to be exposed through the api.

One other thing I did not mention in the call, I believe the BP-API will want to support multiple prediction requests, updates, and returns. If you read the trade press 2 or 4 predictions at a time will become the bar. I believe this just means that request/result data is grouped. Not a significant change on the surface, but occurred to me while writing.

@jeffnye-gh
Copy link
Collaborator

jeffnye-gh commented May 26, 2025

I can put this in a PR but it's one file,

I added to Arup's BP interface class, support for 2T/nT(multiple predictions) and stages prediction results. This is for discussion.

I put this in a draft pr, seemed easier.
#259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants