-
Notifications
You must be signed in to change notification settings - Fork 5
Enable RAS manager for fatal and runtime errors #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
include/oem_cper.hpp
Outdated
@@ -0,0 +1,61 @@ | |||
#pragma once | |||
|
|||
#include "libcper/Cper.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still open discussion to use libcper library or copy the contents of libcper/Cper.h to amd-ras repo is still under discussion.
@supven01 Can you please provide your inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing this comment as it is finalized to use the standaed cper.h file from libcper library
5ce505a
to
6f25110
Compare
1c1b01f
to
ad056d4
Compare
include/cper_util.hpp
Outdated
constexpr uint16_t PCIE_VENDOR_ID = 0x1022; | ||
constexpr int MINOR_REVISION = 0xB; | ||
|
||
/*template void dumpHeaderSection(const std::shared_ptr<FatalCperRecord>& data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this commented code left out for debug purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved to cper_tul.cpp file.
Will delete these commented out lines.
src/utils/host_util.cpp
Outdated
{ | ||
if (*systemRecovery == "WARM_RESET") | ||
{ | ||
if ((buf & 0x4)) // SYS_MGMT_CTRL_ERR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add bitmask macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
a4d00b8
to
78a6f41
Compare
83fae27
to
ae7a5f1
Compare
include/oem_cper.hpp
Outdated
CRASHDUMP_T CrashDumpData[32]; | ||
DF_DUMP DfDumpData; | ||
UINT32 Reserved1[96]; | ||
UINT32 DebugLogIdData[12124]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of generic things:
- please define macros for all the array sizes of various members of these structures in this file.
- Is it OK to have the typdes as all capital letters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Ack. Replaced all the array sizes with Macros.
- The oem_cper.h is an extension of https://github.com/openbmc/libcper/blob/main/include/libcper/Cper.h with few additional structures appended as per the AMD requirement. So followed the same format as of Cper.h file in libcper library as it is defined under "extern C"
include/error_monitor.hpp
Outdated
Manager(amd::ras::config::Manager&); | ||
|
||
int errCount = 0; | ||
uint8_t cpuCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my be late in the game. If I were you, I would make these members as private, and write get rouines for each of them, and initialize these variables as part of the constructor instead of initializing them in the getSocketInfo() method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Moved these variables to protected and initialized these variables in the constructor.
I opted to keep the variables as protected instead of private because I need to use base class pointer to point to objects of the derived class.
include/error_monitor.hpp
Outdated
* @return On failure to create index file or read CPER index number | ||
* throw std::runtime_error exception. | ||
*/ | ||
void createCperIndexFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be part of cper_util.hpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
*/ | ||
class CrashdumpInterface : public CrashdumpBase | ||
{ | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the coding guideline? is it public first, or private first? not a major concern..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other openbmc repo's follow public , protected and private order.
Opted the same here as well
include/apml_manager.hpp
Outdated
* This function is invoked when an alert event occurs on P0. The function | ||
* handles the event by processing the necessary response. | ||
*/ | ||
void p0AlertEventHandler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p0AlertEventHandler(), and p1AertEventHandler() code will be similar, I am assuming. Instead, if we can have this Handler as a routine which would take the p0/p1 specifics as input, this code will be cleaner, and if we have to support p3, we are ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
include/apml_manager.hpp
Outdated
* This GPIO line is used to detect hardware alerts for P0 and trigger | ||
* events for processing. | ||
*/ | ||
gpiod::line p0apmlAlertLine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAS is not tested in 2P verona platforms as APML works only for 1 socket.
Not sure having one gpioline for both P0 and P1 will have any issues in the gpio monitoring.
I will mark this as TODO and change it once it is verified in Venice.
src/utils/host_util.cpp
Outdated
} | ||
} | ||
|
||
void triggerWarmReset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a SEL log on any type of host reset might help during debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SEL events will be logged for DC power off and ON as part of the x86-power-control application.
src/utils/cper_util.cpp
Outdated
{ | ||
bool ret = false; | ||
int socNum = 0; | ||
uint32_t tempVar[2][8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoud the socNum 2 be a macro/const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
errorMgr->configure(); | ||
#endif | ||
|
||
#ifdef PLDM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see if you can make this as a variable based check? That is how it is going to be. if not now, may be later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. This is a TODO item once the flow is finalized for PLDM. I will revert this and change it to variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
ae7a5f1
to
7862321
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config/ras_config.json
Outdated
@@ -4,7 +4,7 @@ | |||
"ApmlRetries": { | |||
"Description": "Number of APML retry count", | |||
"Value": 10, | |||
"MaxBoundLimit": "50" | |||
"MaxBoundLimit": 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good practice to fix bugs in this commit , please add new commit and fix/change this if it is not introduced with this commit specific chnages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This is a bug in the previous code. The maxBoundLimit should be integer not a string.
I will revert this fix in this commit and create a new commit.
config/ras_config.json
Outdated
}, | ||
{ | ||
"DramCeccPollingEn": { | ||
"Description": "If this field is true, DRAM Cecc correctable errors will be polled.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cecc stands for Dynamic Random-Access Memory Correctable Error Correction Code, why you are using Cecc correctable errors here ? please repharse with Cecc exanded form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the expansion during first time or usage and acronym is used for further places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From PPR document:
DRAM CECC: Returns
Error count, counter info and
corresponding UMC_MCA
information from the valid
DRAM ECC Correctable
Error counter instance
Hence updating the description to "DRAM ECC Correctable Error" in the description
config/ras_config.json
Outdated
}, | ||
{ | ||
"PcieAerPollingEn": { | ||
"Description": "If this field is true, PCIE AER correctable errors will be polled.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit inclued AER expaned form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
config/ras_config.json
Outdated
}, | ||
{ | ||
"McaThresholdEn": { | ||
"Description": "If this field is true, error thresholding is enable for MCa errors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why MCa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
config/ras_config.json
Outdated
}, | ||
{ | ||
"McaThresholdEn": { | ||
"Description": "If this field is true, error thresholding is enable for MCa errors", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
period is missing at end. Be consistent .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased this to "Polling interval in seconds for MCA errors ".
Removed period at the end for all the description to maintain the consistency with the previous commit.
config/ras_config.json
Outdated
}, | ||
{ | ||
"DramCeccThresholdEn": { | ||
"Description": "If this field is true, error thresholding is enable for DRAM CECC errors.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If this field is true, error thresholding for MCA errors will be enabled.",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
namespace ras | ||
{ | ||
namespace apml | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General scope :
Please revisit the documentation and update according to OpenbMC guidelines.
Documentation: Improved documentation for clarity and consistency.
Parameter Names: Added parameter names in function declarations for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All classes are ordered in the public/protected/private sequence.
Replaced uint8_t with size_t wherever possible.
All variable names are in lowerCamelCase.
All functions are named in lowerCamelCase, except those called from the APML library, which use lower_snake_case.
All structs, classes, enums, and templates are in UpperCamelCase, except structures from the APML library, which follow lower_snake_case.
All constants are in lowerCamelCase.
APIs under a top-level namespace do not use the namespace names again in the API names.
Replaced all const int with fixed-width integer types.
Replaced all static const std::string with constexpr std::string_view.
Consistency is maintained across files in the comment descriptions of functions. This consistency is maintained file to file.
o/p of format-script.sh
Formatting code under /mnt/work2/adhandap/Code/amd-bmc-ras
eslint: cannot find .eslintrc.json; using /mnt/work2/adhandap/Code/openbmc-build-scripts/config/eslint-global-config.json
markdownlint: cannot find .markdownlint.yaml; using /mnt/work2/adhandap/Code/openbmc-build-scripts/config/markdownlint.yaml
Running commit_gitlint
Running commit_spelling
codespell-dictionary - misspelling count >> 0
generic-dictionary - misspelling count >> 0
Running clang_format
Running eslint
Running markdownlint
Running prettier
config/ras_config.json 54ms (unchanged)
config/gpio_config.json 2ms (unchanged)
README.md 34ms (unchanged)
.prettierrc.yaml 21ms (unchanged)
Result differences...
Format: PASSED
include/apml_manager.hpp
Outdated
* This GPIO line is used to detect hardware alerts for P1 and trigger | ||
* events for processing. | ||
*/ | ||
gpiod::line p1apmlAlertLine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard coding p0 /p1 approch is not scalable , Need better way to handle this instead of hard coding p0 and p1.
Like gpio line should be enabled based on config types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added gpio_config.json which contains list of ALERT_L gpio's and this is handled dynamically by reading the file.
include/cper_util.hpp
Outdated
static const std::string runtimeDramErr = "RUNTIME_DRAM_ERROR"; | ||
static const std::string fatalErr = "FATAL"; | ||
|
||
namespace amd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you use namespace amd {
namespace ras {
namespace cper {
namespace util { approch you don't need to use cper name attachment in the below defintion . please reveiw and update .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
include/cper_util.hpp
Outdated
namespace cper_util | ||
{ | ||
|
||
constexpr int SEV_NON_FATAL_UNCORRECTED = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General: Using int for constants that represent specific sizes or types can lead to ambiguity and potential issues. It's better to use fixed-width integer types from to ensure the correct size and type. Here's the corrected version:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
7862321
to
0e9b9c2
Compare
config/ras_config.json
Outdated
}, | ||
{ | ||
"DramCeccPollingEn": { | ||
"Description": "If this field is true, Dynamic Random Accecc Memory Error Correction Code (DRAM ECC) correctable errors will be polled", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo" Accecc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
config/gpio_config.json
Outdated
@@ -0,0 +1,3 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the file name to amd_ras_gpio_config.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
@@ -0,0 +1,302 @@ | |||
#include "error_monitor.hpp" | |||
|
|||
extern "C" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope: All hpp/cpp files.
Include Only What You Need: Only include headers that are necessary for the file to compile. Avoid unnecessary includes to reduce compilation time.
Self-Sufficient Headers:Each header file should be able to compile on its own
Use Include Guards #pragma once ( hpp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
include/apml_manager.hpp
Outdated
{ | ||
public: | ||
Manager() = delete; | ||
~Manager() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually Constructors put in one block and the detsructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
{ | ||
namespace apml | ||
{ | ||
class Manager : public amd::ras::Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class definition is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
* @details This function configures a GPIO line and stream descriptor to | ||
* listen for events. It triggers the provided callback function upon event | ||
* detection. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete function definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
* | ||
* @details This function is invoked when an alert event occurs on P0 or P1. | ||
* The function handles the event by processing the necessary response. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete function definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
std::mutex dramErrorHarvestMtx; | ||
std::mutex pcieErrorHarvestMtx; | ||
|
||
/** @brief Update processor OOB configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Better to follow the standard function descriptioms.consitent with public one.
include/apml_manager.hpp
Outdated
sdbusplus::asio::object_server& objectServer; | ||
std::shared_ptr<sdbusplus::asio::connection>& systemBus; | ||
|
||
uint8_t progId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please group all mutex varibale toogether and and make sure these mutexes initialized as part of constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
meson.build
Outdated
@@ -11,12 +11,27 @@ project( | |||
) | |||
|
|||
config_file = '/var/lib/amd-bmc-ras/ras_config.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these to meson.options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
src/utils/dbus_util.cpp
Outdated
@@ -0,0 +1,83 @@ | |||
#include "dbus_util.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source path is already under utils , file name rename to dbus.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer valid as all the utils are clubbed togerther in util.hpp and cper_util.hpp
However, renamed the cper_util.hpp and platform_util.hpp to util/cper.hpp and util/platform.hpp
src/utils/dbus_util.cpp
Outdated
@@ -0,0 +1,83 @@ | |||
#include "dbus_util.hpp" | |||
|
|||
namespace amd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name space should be amd::ras::util::dbus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Renamed the gpio_config.json to amd_ras_gpio_config.json
- Added documentation to all the API's
- Only included headers that are necessary for the file to compile.
- Moved the AMD CRB specifc changes to utils/platform.hpp
- Changed to a common util file-based approach instead of having multiple util files.
0e9b9c2
to
ea0737b
Compare
ea0737b
to
ac69d7b
Compare
ac69d7b
to
94a0e11
Compare
588d9b8
to
c7309af
Compare
APML RAS Manager Initialization: - Added initialization for APML RAS Manager. - Included conditional compilation for APML support. - Added a placeholder error log for PLDM RAS capabilities, indicating that they are yet to be enabled. - The init function repeatedly attempts to get the BMC RAS OOB configuration until successful. - The function initializes the platform with the block ID's that needs to be harvested during a crashdump and sets up a D-Bus match to monitor watchdog state changes to monitor BIOS post complete. - It reads CPU IDs for all CPUs and logs errors on failure. - The init function also creates separate polling threads for MCA errors, DRAM CECC errors and PCIE advanced error reporting. - The function also handles BIOS post-completion by configuring PCIE OOB settings and enabling PCIE error thresholds based on watchdog timer changes. - It also clears SbrmiAlertMask register so APML_ALERT_L will be asserted during a syncflood in the system. - The commit has oem_cper.h providing the outline of file format for both runtime and crashdump CPER records. - Added additional json config parameters to enable OOB registers during initialization. - Overall , this commit provides all the necessary preps needed to enable the crashdump flow and runtime error monitoring. Crashdump monitoring: - This commit introduces the handling of GPIO events for APML alerts by identifying the GPIO's from the gpio_config.json - Binds the P0 alert event handler and P1 alert evernt handler to manage these alerts. - Read RAS status register and check for errors. - Log and send alerts for various RAS errors including: - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy. - RESET_HANG_ERR: Suggest manual immediate reset. - FATAL_ERROR: Harvest MCA data and reset based on policy. - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow. - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error counter overflow. - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow. CPER record generation: - Add functionality to generate Common Platform Error Record (CPER) entries when a FATAL_ERROR is detected. - It also creates CPER record when a runtime MCA , DRAM CECC and PCIE advanced error reporting is reported via polling or threshold overflow. - The system stores maximum of 10 CPER records in BMC including fatal and runtime errors. - Create D-Bus object paths for each CPER file in the system, allowing for download via redfish. - Update properties for each CPER object, including Filename, Log, and Timestamp. root@morocco-d89c:~# busctl tree com.amd.RAS `- /com `- /com/amd `- /com/amd/RAS `- /com/amd/RAS/0 Verified in Venice SOC 1P system. Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
c7309af
to
34a36c8
Compare
@@ -0,0 +1,17 @@ | |||
{ | |||
"CpuCount": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future, we should change it to 2.
{ | ||
"DramCeccPollingPeriod": { | ||
"Description": "Polling interval in seconds for DRAM ECC correctable errors", | ||
"Value": 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the next commit, please disable polling and enable threshold interrupts for DRAM CECC.
} | ||
}, | ||
{ | ||
"McaPollingEn": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the next commit, please disable polling and enable threshold interrupts for MCA.
} | ||
}, | ||
{ | ||
"PcieAerPollingPeriod": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the next commit, please disable polling and enable threshold interrupts for PCIe AER.
@@ -19,6 +24,20 @@ int main() | |||
|
|||
amd::ras::config::Manager manager(objectServer, systemBus); | |||
|
|||
#ifdef APML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove #ifdef in the future commit
errorMgr->configure(); | ||
#endif | ||
|
||
#ifdef PLDM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit has been a stellar example of how the reviews need to happen. Thanks Jayanth, Mahesh for the thorough reviews and Abinaya for addressing them diligently.
Enable RAS manager for fatal and runtime errors
APML RAS Manager Initialization:
that they are yet to be enabled.
configuration until successful.
to be harvested during a crashdump and sets up a D-Bus match to
monitor watchdog state changes to monitor BIOS post complete.
DRAM CECC and PCIE AER error monitoring.
OOB settings and enabling PCIE error thresholds based on watchdog
timer changes.
asserted during a syncflood in the system.
both runtime and crashdump CPER records.
during initialization.
enable the crashdump flow and runtime error monitoring.
Crashdump monitoring:
P1 APML alerts
to manage these alerts.
counter overflow.
CPER record generation:
entries when a FATAL_ERROR is detected.
PCIE AER error is reported via polling or threshold overflow.
fatal and runtime errors.
allowing for download via redfish.
Log, and Timestamp.
root@morocco-d89c:~# busctl tree com.amd.RAS
- /com
- /com/amd- /com/amd/RAS
- /com/amd/RAS/0Signed-off-by: aasboddu [email protected], Abinaya Dhandapani [email protected]