Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enable RAS manager for fatal and runtime errors
APML RAS Manager Initialization - Added initialization for APML RAS Manager. - Included conditional compilation for APML support. - Added a placeholder error log for PLDM RAS capabilities, indicating that they are yet to be enabled. - The init function repeatedly attempts to get the BMC RAS OOB configuration until successful. - The function initializes the platform with the block ID's that needs to be harvested during a crashdump and sets up a D-Bus match to monitor watchdog state changes to monitor BIOS post complete. - It reads CPU IDs for all CPUs and logs errors on failure. - The init function also creates separate polling threads for MCA, DRAM CECC and PCIE AER error monitoring. - The function also handles BIOS post-completion by configuring PCIE OOB settings and enabling PCIE error thresholds based on watchdog timer changes. - It also clears SbrmiAlertMask register so APML_ALERT_L will be asserted during a syncflood in the system. - The commit has oem_cper.h providing the outline of file format for both runtime and crashdump CPER records. - Added additional json config parameters to enable OOB registers during initialization. - Overall , this commit provides all the necessary preps needed to enable the crashdump flow and runtime error monitoring. Crashdump monitoring: - This commit introduces the handling of GPIO events for P0 and P1 APML alerts - Binds the P0 alert event handler and P1 alert evernt handler to manage these alerts. - Read RAS status register and check for errors. - Log and send alerts for various RAS errors including: - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy. - RESET_HANG_ERR: Suggest manual immediate reset. - FATAL_ERROR: Harvest MCA data and reset based on policy. - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow. - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error counter overflow. - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow. CPER record generation: - Add functionality to generate Common Platform Error Record (CPER) entries when a FATAL_ERROR is detected. - It also creates CPER record when a runtime MCA , DRAM CECC and PCIE AER error is reported via polling or threshold overflow. - The system stores maximum of 10 CPER records in BMC including fatal and runtime errors. - Create D-Bus object paths for each CPER file in the system, allowing for download via redfish. - Update properties for each CPER object, including Filename, Log, and Timestamp. root@morocco-d89c:~# busctl tree com.amd.RAS `- /com `- /com/amd `- /com/amd/RAS `- /com/amd/RAS/0 Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
- Loading branch information