-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable RAS manager for fatal and runtime errors #102
base: integ_sp7
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,61 @@ | |||
#pragma once | |||
|
|||
#include "libcper/Cper.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still open discussion to use libcper library or copy the contents of libcper/Cper.h to amd-ras repo is still under discussion.
@supven01 Can you please provide your inputs
5ce505a
to
6f25110
Compare
1c1b01f
to
ad056d4
Compare
include/cper_util.hpp
Outdated
constexpr uint16_t PCIE_VENDOR_ID = 0x1022; | ||
constexpr int MINOR_REVISION = 0xB; | ||
|
||
/*template void dumpHeaderSection(const std::shared_ptr<FatalCperRecord>& data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this commented code left out for debug purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved to cper_tul.cpp file.
Will delete these commented out lines.
src/utils/host_util.cpp
Outdated
{ | ||
if (*systemRecovery == "WARM_RESET") | ||
{ | ||
if ((buf & 0x4)) // SYS_MGMT_CTRL_ERR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add bitmask macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
a4d00b8
to
78a6f41
Compare
78a6f41
to
83fae27
Compare
APML RAS Manager Initialization: - Added initialization for APML RAS Manager. - Included conditional compilation for APML support. - Added a placeholder error log for PLDM RAS capabilities, indicating that they are yet to be enabled. - The init function repeatedly attempts to get the BMC RAS OOB configuration until successful. - The function initializes the platform with the block ID's that needs to be harvested during a crashdump and sets up a D-Bus match to monitor watchdog state changes to monitor BIOS post complete. - It reads CPU IDs for all CPUs and logs errors on failure. - The init function also creates separate polling threads for MCA, DRAM CECC and PCIE AER error monitoring. - The function also handles BIOS post-completion by configuring PCIE OOB settings and enabling PCIE error thresholds based on watchdog timer changes. - It also clears SbrmiAlertMask register so APML_ALERT_L will be asserted during a syncflood in the system. - The commit has oem_cper.h providing the outline of file format for both runtime and crashdump CPER records. - Added additional json config parameters to enable OOB registers during initialization. - Overall , this commit provides all the necessary preps needed to enable the crashdump flow and runtime error monitoring. Crashdump monitoring: - This commit introduces the handling of GPIO events for P0 and P1 APML alerts - Binds the P0 alert event handler and P1 alert evernt handler to manage these alerts. - Read RAS status register and check for errors. - Log and send alerts for various RAS errors including: - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy. - RESET_HANG_ERR: Suggest manual immediate reset. - FATAL_ERROR: Harvest MCA data and reset based on policy. - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow. - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error counter overflow. - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow. CPER record generation: - Add functionality to generate Common Platform Error Record (CPER) entries when a FATAL_ERROR is detected. - It also creates CPER record when a runtime MCA , DRAM CECC and PCIE AER error is reported via polling or threshold overflow. - The system stores maximum of 10 CPER records in BMC including fatal and runtime errors. - Create D-Bus object paths for each CPER file in the system, allowing for download via redfish. - Update properties for each CPER object, including Filename, Log, and Timestamp. root@morocco-d89c:~# busctl tree com.amd.RAS `- /com `- /com/amd `- /com/amd/RAS `- /com/amd/RAS/0 Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
83fae27
to
ae7a5f1
Compare
CRASHDUMP_T CrashDumpData[32]; | ||
DF_DUMP DfDumpData; | ||
UINT32 Reserved1[96]; | ||
UINT32 DebugLogIdData[12124]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of generic things:
- please define macros for all the array sizes of various members of these structures in this file.
- Is it OK to have the typdes as all capital letters?
Manager(amd::ras::config::Manager&); | ||
|
||
int errCount = 0; | ||
uint8_t cpuCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my be late in the game. If I were you, I would make these members as private, and write get rouines for each of them, and initialize these variables as part of the constructor instead of initializing them in the getSocketInfo() method
* @return On failure to create index file or read CPER index number | ||
* throw std::runtime_error exception. | ||
*/ | ||
void createCperIndexFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be part of cper_util.hpp?
*/ | ||
class CrashdumpInterface : public CrashdumpBase | ||
{ | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the coding guideline? is it public first, or private first? not a major concern..
* This function is invoked when an alert event occurs on P0. The function | ||
* handles the event by processing the necessary response. | ||
*/ | ||
void p0AlertEventHandler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p0AlertEventHandler(), and p1AertEventHandler() code will be similar, I am assuming. Instead, if we can have this Handler as a routine which would take the p0/p1 specifics as input, this code will be cleaner, and if we have to support p3, we are ready.
* This GPIO line is used to detect hardware alerts for P0 and trigger | ||
* events for processing. | ||
*/ | ||
gpiod::line p0apmlAlertLine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
} | ||
} | ||
|
||
void triggerWarmReset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a SEL log on any type of host reset might help during debug.
{ | ||
bool ret = false; | ||
int socNum = 0; | ||
uint32_t tempVar[2][8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shoud the socNum 2 be a macro/const?
errorMgr->configure(); | ||
#endif | ||
|
||
#ifdef PLDM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see if you can make this as a variable based check? That is how it is going to be. if not now, may be later.
Enable RAS manager for fatal and runtime errors
APML RAS Manager Initialization:
that they are yet to be enabled.
configuration until successful.
to be harvested during a crashdump and sets up a D-Bus match to
monitor watchdog state changes to monitor BIOS post complete.
DRAM CECC and PCIE AER error monitoring.
OOB settings and enabling PCIE error thresholds based on watchdog
timer changes.
asserted during a syncflood in the system.
both runtime and crashdump CPER records.
during initialization.
enable the crashdump flow and runtime error monitoring.
Crashdump monitoring:
P1 APML alerts
to manage these alerts.
counter overflow.
CPER record generation:
entries when a FATAL_ERROR is detected.
PCIE AER error is reported via polling or threshold overflow.
fatal and runtime errors.
allowing for download via redfish.
Log, and Timestamp.
root@morocco-d89c:~# busctl tree com.amd.RAS
- /com
- /com/amd- /com/amd/RAS
- /com/amd/RAS/0Signed-off-by: aasboddu [email protected], Abinaya Dhandapani [email protected]