Skip to content

Enable RAS manager for fatal and runtime errors #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 21, 2025
Merged

Conversation

aasboddu
Copy link
Collaborator

@aasboddu aasboddu commented Jan 23, 2025

Enable RAS manager for fatal and runtime errors

APML RAS Manager Initialization:

  • Added initialization for APML RAS Manager.
  • Included conditional compilation for APML support.
  • Added a placeholder error log for PLDM RAS capabilities, indicating
    that they are yet to be enabled.
  • The init function repeatedly attempts to get the BMC RAS OOB
    configuration until successful.
  • The function initializes the platform with the block ID's that needs
    to be harvested during a crashdump and sets up a D-Bus match to
    monitor watchdog state changes to monitor BIOS post complete.
  • It reads CPU IDs for all CPUs and logs errors on failure.
  • The init function also creates separate polling threads for MCA,
    DRAM CECC and PCIE AER error monitoring.
  • The function also handles BIOS post-completion by configuring PCIE
    OOB settings and enabling PCIE error thresholds based on watchdog
    timer changes.
  • It also clears SbrmiAlertMask register so APML_ALERT_L will be
    asserted during a syncflood in the system.
  • The commit has oem_cper.h providing the outline of file format for
    both runtime and crashdump CPER records.
  • Added additional json config parameters to enable OOB registers
    during initialization.
  • Overall , this commit provides all the necessary preps needed to
    enable the crashdump flow and runtime error monitoring.

Crashdump monitoring:

  • This commit introduces the handling of GPIO events for P0 and
    P1 APML alerts
  • Binds the P0 alert event handler and P1 alert evernt handler
    to manage these alerts.
  • Read RAS status register and check for errors.
  • Log and send alerts for various RAS errors including:
    • SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy.
    • RESET_HANG_ERR: Suggest manual immediate reset.
    • FATAL_ERROR: Harvest MCA data and reset based on policy.
    • MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow.
    • DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error
      counter overflow.
    • PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow.

CPER record generation:

  • Add functionality to generate Common Platform Error Record (CPER)
    entries when a FATAL_ERROR is detected.
  • It also creates CPER record when a runtime MCA , DRAM CECC and
    PCIE AER error is reported via polling or threshold overflow.
  • The system stores maximum of 10 CPER records in BMC including
    fatal and runtime errors.
  • Create D-Bus object paths for each CPER file in the system,
    allowing for download via redfish.
  • Update properties for each CPER object, including Filename,
    Log, and Timestamp.

root@morocco-d89c:~# busctl tree com.amd.RAS
- /com - /com/amd
- /com/amd/RAS - /com/amd/RAS/0

Signed-off-by: aasboddu [email protected], Abinaya Dhandapani [email protected]

@@ -0,0 +1,61 @@
#pragma once

#include "libcper/Cper.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still open discussion to use libcper library or copy the contents of libcper/Cper.h to amd-ras repo is still under discussion.

@supven01 Can you please provide your inputs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this comment as it is finalized to use the standaed cper.h file from libcper library

@abinayaddhandapani abinayaddhandapani force-pushed the review branch 3 times, most recently from 5ce505a to 6f25110 Compare February 10, 2025 19:30
@abinayaddhandapani abinayaddhandapani force-pushed the review branch 2 times, most recently from 1c1b01f to ad056d4 Compare February 11, 2025 10:47
@abinayaddhandapani abinayaddhandapani changed the title Enable APML RAS Manager Initialization Enable Fatal error monitoring and CPER file creation Feb 11, 2025
constexpr uint16_t PCIE_VENDOR_ID = 0x1022;
constexpr int MINOR_REVISION = 0xB;

/*template void dumpHeaderSection(const std::shared_ptr<FatalCperRecord>& data,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented code left out for debug purposes?

Copy link
Collaborator

@abinayaddhandapani abinayaddhandapani Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to cper_tul.cpp file.
Will delete these commented out lines.

{
if (*systemRecovery == "WARM_RESET")
{
if ((buf & 0x4)) // SYS_MGMT_CTRL_ERR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add bitmask macro

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@abinayaddhandapani abinayaddhandapani force-pushed the review branch 2 times, most recently from a4d00b8 to 78a6f41 Compare February 18, 2025 12:15
@abinayaddhandapani abinayaddhandapani force-pushed the review branch 2 times, most recently from 83fae27 to ae7a5f1 Compare February 26, 2025 16:34
@abinayaddhandapani abinayaddhandapani changed the title Enable Fatal error monitoring and CPER file creation Enable RAS manager for fatal and runtime errors Feb 26, 2025
CRASHDUMP_T CrashDumpData[32];
DF_DUMP DfDumpData;
UINT32 Reserved1[96];
UINT32 DebugLogIdData[12124];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of generic things:

  1. please define macros for all the array sizes of various members of these structures in this file.
  2. Is it OK to have the typdes as all capital letters?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ack. Replaced all the array sizes with Macros.
  2. The oem_cper.h is an extension of https://github.com/openbmc/libcper/blob/main/include/libcper/Cper.h with few additional structures appended as per the AMD requirement. So followed the same format as of Cper.h file in libcper library as it is defined under "extern C"

Manager(amd::ras::config::Manager&);

int errCount = 0;
uint8_t cpuCount;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my be late in the game. If I were you, I would make these members as private, and write get rouines for each of them, and initialize these variables as part of the constructor instead of initializing them in the getSocketInfo() method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Moved these variables to protected and initialized these variables in the constructor.
I opted to keep the variables as protected instead of private because I need to use base class pointer to point to objects of the derived class.

* @return On failure to create index file or read CPER index number
* throw std::runtime_error exception.
*/
void createCperIndexFile();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be part of cper_util.hpp?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

*/
class CrashdumpInterface : public CrashdumpBase
{
public:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the coding guideline? is it public first, or private first? not a major concern..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other openbmc repo's follow public , protected and private order.
Opted the same here as well

* This function is invoked when an alert event occurs on P0. The function
* handles the event by processing the necessary response.
*/
void p0AlertEventHandler();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p0AlertEventHandler(), and p1AertEventHandler() code will be similar, I am assuming. Instead, if we can have this Handler as a routine which would take the p0/p1 specifics as input, this code will be cleaner, and if we have to support p3, we are ready.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

* This GPIO line is used to detect hardware alerts for P0 and trigger
* events for processing.
*/
gpiod::line p0apmlAlertLine;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAS is not tested in 2P verona platforms as APML works only for 1 socket.
Not sure having one gpioline for both P0 and P1 will have any issues in the gpio monitoring.
I will mark this as TODO and change it once it is verified in Venice.

}
}

void triggerWarmReset()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a SEL log on any type of host reset might help during debug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SEL events will be logged for DC power off and ON as part of the x86-power-control application.

{
bool ret = false;
int socNum = 0;
uint32_t tempVar[2][8];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoud the socNum 2 be a macro/const?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

errorMgr->configure();
#endif

#ifdef PLDM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can make this as a variable based check? That is how it is going to be. if not now, may be later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. This is a TODO item once the flow is finalized for PLDM. I will revert this and change it to variable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link

@ojayanth ojayanth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like lot of places coding guidelines is not aligned to openbmc . Please refer , #99 and #98 reveiw comments and update accordingly. Not looked Cpp files , will start reviewing cpp files only after addressing generic comments.

@@ -4,7 +4,7 @@
"ApmlRetries": {
"Description": "Number of APML retry count",
"Value": 10,
"MaxBoundLimit": "50"
"MaxBoundLimit": 50
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good practice to fix bugs in this commit , please add new commit and fix/change this if it is not introduced with this commit specific chnages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is a bug in the previous code. The maxBoundLimit should be integer not a string.
I will revert this fix in this commit and create a new commit.

},
{
"DramCeccPollingEn": {
"Description": "If this field is true, DRAM Cecc correctable errors will be polled.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cecc stands for Dynamic Random-Access Memory Correctable Error Correction Code, why you are using Cecc correctable errors here ? please repharse with Cecc exanded form.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the expansion during first time or usage and acronym is used for further places

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PPR document:

DRAM CECC: Returns
Error count, counter info and
corresponding UMC_MCA
information from the valid
DRAM ECC Correctable
Error counter instance

Hence updating the description to "DRAM ECC Correctable Error" in the description

},
{
"PcieAerPollingEn": {
"Description": "If this field is true, PCIE AER correctable errors will be polled.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit inclued AER expaned form.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

},
{
"McaThresholdEn": {
"Description": "If this field is true, error thresholding is enable for MCa errors",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why MCa?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

},
{
"McaThresholdEn": {
"Description": "If this field is true, error thresholding is enable for MCa errors",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

period is missing at end. Be consistent .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased this to "Polling interval in seconds for MCA errors ".
Removed period at the end for all the description to maintain the consistency with the previous commit.

},
{
"DramCeccThresholdEn": {
"Description": "If this field is true, error thresholding is enable for DRAM CECC errors.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If this field is true, error thresholding for MCA errors will be enabled.",

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

namespace ras
{
namespace apml
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General scope :

Please revisit the documentation and update according to OpenbMC guidelines.
Documentation: Improved documentation for clarity and consistency.
Parameter Names: Added parameter names in function declarations for better readability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All classes are ordered in the public/protected/private sequence.
Replaced uint8_t with size_t wherever possible.
All variable names are in lowerCamelCase.
All functions are named in lowerCamelCase, except those called from the APML library, which use lower_snake_case.
All structs, classes, enums, and templates are in UpperCamelCase, except structures from the APML library, which follow lower_snake_case.
All constants are in lowerCamelCase.
APIs under a top-level namespace do not use the namespace names again in the API names.
Replaced all const int with fixed-width integer types.
Replaced all static const std::string with constexpr std::string_view.
Consistency is maintained across files in the comment descriptions of functions. This consistency is maintained file to file.

o/p of format-script.sh

Formatting code under /mnt/work2/adhandap/Code/amd-bmc-ras
eslint: cannot find .eslintrc.json; using /mnt/work2/adhandap/Code/openbmc-build-scripts/config/eslint-global-config.json
markdownlint: cannot find .markdownlint.yaml; using /mnt/work2/adhandap/Code/openbmc-build-scripts/config/markdownlint.yaml
Running commit_gitlint
Running commit_spelling

codespell-dictionary - misspelling count >> 0
generic-dictionary - misspelling count >> 0
Running clang_format
Running eslint
Running markdownlint
Running prettier
config/ras_config.json 54ms (unchanged)
config/gpio_config.json 2ms (unchanged)
README.md 34ms (unchanged)
.prettierrc.yaml 21ms (unchanged)
Result differences...
Format: PASSED

* This GPIO line is used to detect hardware alerts for P1 and trigger
* events for processing.
*/
gpiod::line p1apmlAlertLine;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard coding p0 /p1 approch is not scalable , Need better way to handle this instead of hard coding p0 and p1.

Like gpio line should be enabled based on config types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added gpio_config.json which contains list of ALERT_L gpio's and this is handled dynamically by reading the file.

static const std::string runtimeDramErr = "RUNTIME_DRAM_ERROR";
static const std::string fatalErr = "FATAL";

namespace amd
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use namespace amd {
namespace ras {
namespace cper {
namespace util { approch you don't need to use cper name attachment in the below defintion . please reveiw and update .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

namespace cper_util
{

constexpr int SEV_NON_FATAL_UNCORRECTED = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General: Using int for constants that represent specific sizes or types can lead to ambiguity and potential issues. It's better to use fixed-width integer types from to ensure the correct size and type. Here's the corrected version:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

},
{
"DramCeccPollingEn": {
"Description": "If this field is true, Dynamic Random Accecc Memory Error Correction Code (DRAM ECC) correctable errors will be polled",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo" Accecc"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@@ -0,0 +1,3 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the file name to amd_ras_gpio_config.json.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@@ -0,0 +1,302 @@
#include "error_monitor.hpp"

extern "C"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope: All hpp/cpp files.
Include Only What You Need: Only include headers that are necessary for the file to compile. Avoid unnecessary includes to reduce compilation time.
Self-Sufficient Headers:Each header file should be able to compile on its own
Use Include Guards #pragma once ( hpp).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

{
public:
Manager() = delete;
~Manager() = default;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually Constructors put in one block and the detsructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

{
namespace apml
{
class Manager : public amd::ras::Manager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class definition is missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

* @details This function configures a GPIO line and stream descriptor to
* listen for events. It triggers the provided callback function upon event
* detection.
*/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete function definitions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

*
* @details This function is invoked when an alert event occurs on P0 or P1.
* The function handles the event by processing the necessary response.
*/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete function definition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

std::mutex dramErrorHarvestMtx;
std::mutex pcieErrorHarvestMtx;

/** @brief Update processor OOB configuration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Better to follow the standard function descriptioms.consitent with public one.

sdbusplus::asio::object_server& objectServer;
std::shared_ptr<sdbusplus::asio::connection>& systemBus;

uint8_t progId;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please group all mutex varibale toogether and and make sure these mutexes initialized as part of constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

meson.build Outdated
@@ -11,12 +11,27 @@ project(
)

config_file = '/var/lib/amd-bmc-ras/ras_config.json'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these to meson.options

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@@ -0,0 +1,83 @@
#include "dbus_util.hpp"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source path is already under utils , file name rename to dbus.cpp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer valid as all the utils are clubbed togerther in util.hpp and cper_util.hpp
However, renamed the cper_util.hpp and platform_util.hpp to util/cper.hpp and util/platform.hpp

@@ -0,0 +1,83 @@
#include "dbus_util.hpp"

namespace amd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name space should be amd::ras::util::dbus

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Renamed the gpio_config.json to amd_ras_gpio_config.json
  2. Added documentation to all the API's
  3. Only included headers that are necessary for the file to compile.
  4. Moved the AMD CRB specifc changes to utils/platform.hpp
  5. Changed to a common util file-based approach instead of having multiple util files.

APML RAS Manager Initialization:

- Added initialization for APML RAS Manager.
- Included conditional compilation for APML support.
- Added a placeholder error log for PLDM RAS capabilities, indicating
  that they are yet to be enabled.
- The init function repeatedly attempts to get the BMC RAS OOB
  configuration until successful.
- The function initializes the platform with the block ID's that needs
  to be harvested during a crashdump and sets up a D-Bus match to
  monitor watchdog state changes to monitor BIOS post complete.
- It reads CPU IDs for all CPUs and logs errors on failure.
- The init function also creates separate polling threads for MCA
  errors, DRAM CECC errors and PCIE advanced error reporting.
- The function also handles BIOS post-completion by configuring PCIE
  OOB settings and enabling PCIE error thresholds based on watchdog
  timer changes.
- It also clears SbrmiAlertMask register so APML_ALERT_L will be
  asserted during a syncflood in the system.
- The commit has oem_cper.h providing the outline of file format for
  both runtime and crashdump CPER records.
- Added additional json config parameters to enable OOB registers
  during initialization.
- Overall , this commit provides all the necessary preps needed to
  enable the crashdump flow and runtime error monitoring.

Crashdump monitoring:

- This commit introduces the handling of GPIO events for APML alerts
  by identifying the GPIO's from the gpio_config.json
- Binds the P0 alert event handler and P1 alert evernt handler
  to manage these alerts.
- Read RAS status register and check for errors.
- Log and send alerts for various RAS errors including:
  - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy.
  - RESET_HANG_ERR: Suggest manual immediate reset.
  - FATAL_ERROR: Harvest MCA data and reset based on policy.
  - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow.
  - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error
    counter overflow.
  - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow.

CPER record generation:

- Add functionality to generate Common Platform Error Record (CPER)
  entries when a FATAL_ERROR is detected.
- It also creates CPER record when a runtime MCA , DRAM CECC and
  PCIE advanced error reporting is reported via polling or
  threshold overflow.
- The system stores maximum of 10 CPER records in BMC including
  fatal and runtime errors.
- Create D-Bus object paths for each CPER file in the system,
  allowing for download via redfish.
- Update properties for each CPER object, including Filename,
  Log, and Timestamp.

root@morocco-d89c:~# busctl tree com.amd.RAS
`- /com
  `- /com/amd
    `- /com/amd/RAS
      `- /com/amd/RAS/0

Verified in Venice SOC 1P system.

Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
@@ -0,0 +1,17 @@
{
"CpuCount": 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future, we should change it to 2.

{
"DramCeccPollingPeriod": {
"Description": "Polling interval in seconds for DRAM ECC correctable errors",
"Value": 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next commit, please disable polling and enable threshold interrupts for DRAM CECC.

}
},
{
"McaPollingEn": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next commit, please disable polling and enable threshold interrupts for MCA.

}
},
{
"PcieAerPollingPeriod": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next commit, please disable polling and enable threshold interrupts for PCIe AER.

@@ -19,6 +24,20 @@ int main()

amd::ras::config::Manager manager(objectServer, systemBus);

#ifdef APML
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove #ifdef in the future commit

errorMgr->configure();
#endif

#ifdef PLDM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link
Collaborator

@supven01 supven01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit has been a stellar example of how the reviews need to happen. Thanks Jayanth, Mahesh for the thorough reviews and Abinaya for addressing them diligently.

@supven01 supven01 merged commit 5f87bcf into integ_sp7 May 21, 2025
@supven01 supven01 deleted the review branch May 21, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants