Skip to content

Commit

Permalink
Enable RAS manager for fatal and runtime errors
Browse files Browse the repository at this point in the history
APML RAS Manager Initialization:

- Added initialization for APML RAS Manager.
- Included conditional compilation for APML support.
- Added a placeholder error log for PLDM RAS capabilities, indicating
  that they are yet to be enabled.
- The init function repeatedly attempts to get the BMC RAS OOB
  configuration until successful.
- The function initializes the platform with the block ID's that needs
  to be harvested during a crashdump and sets up a D-Bus match to
  monitor watchdog state changes to monitor BIOS post complete.
- It reads CPU IDs for all CPUs and logs errors on failure.
- The init function also creates separate polling threads for MCA,
  DRAM CECC and PCIE AER error monitoring.
- The function also handles BIOS post-completion by configuring PCIE
  OOB settings and enabling PCIE error thresholds based on watchdog
  timer changes.
- It also clears SbrmiAlertMask register so APML_ALERT_L will be
  asserted during a syncflood in the system.
- The commit has oem_cper.h providing the outline of file format for
  both runtime and crashdump CPER records.
- Added additional json config parameters to enable OOB registers
  during initialization.
- Overall , this commit provides all the necessary preps needed to
  enable the crashdump flow and runtime error monitoring.

Crashdump monitoring:

- This commit introduces the handling of GPIO events for P0 and
  P1 APML alerts
- Binds the P0 alert event handler and P1 alert evernt handler
  to manage these alerts.
- Read RAS status register and check for errors.
- Log and send alerts for various RAS errors including:
  - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy.
  - RESET_HANG_ERR: Suggest manual immediate reset.
  - FATAL_ERROR: Harvest MCA data and reset based on policy.
  - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow.
  - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error
    counter overflow.
  - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow.

CPER record generation:

- Add functionality to generate Common Platform Error Record (CPER)
  entries when a FATAL_ERROR is detected.
- It also creates CPER record when a runtime MCA , DRAM CECC and
  PCIE AER error is reported via polling or threshold overflow.
- The system stores maximum of 10 CPER records in BMC including
  fatal and runtime errors.
- Create D-Bus object paths for each CPER file in the system,
  allowing for download via redfish.
- Update properties for each CPER object, including Filename,
  Log, and Timestamp.

root@morocco-d89c:~# busctl tree com.amd.RAS
`- /com
  `- /com/amd
    `- /com/amd/RAS
      `- /com/amd/RAS/0

Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
  • Loading branch information
aasboddu authored and Abinaya Dhandapani committed Feb 26, 2025
1 parent f0f49c8 commit ae7a5f1
Show file tree
Hide file tree
Showing 22 changed files with 4,404 additions and 2 deletions.
74 changes: 73 additions & 1 deletion config/ras_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"ApmlRetries": {
"Description": "Number of APML retry count",
"Value": 10,
"MaxBoundLimit": "50"
"MaxBoundLimit": 50
}
},
{
Expand Down Expand Up @@ -67,6 +67,78 @@
"Description": "Disable AIFS Reset on syncflood counter",
"Value": true
}
},
{
"DramCeccPollingEn": {
"Description": "If this field is true, DRAM Cecc correctable errors will be polled.",
"Value": false
}
},
{
"McaPollingEn": {
"Description": "If this field is true, MCA correctable errors will be polled.",
"Value": true
}
},
{
"PcieAerPollingEn": {
"Description": "If this field is true, PCIE AER correctable errors will be polled.",
"Value": false
}
},
{
"DramCeccThresholdEn": {
"Description": "If this field is true, error thresholding is enable for DRAM CECC errors.",
"Value": false
}
},
{
"McaThresholdEn": {
"Description": "If this field is true, error thresholding is enable for MCa errors",
"Value": false
}
},
{
"PcieAerThresholdEn": {
"Description": "If this field is true, error thresholding is enable for PCIE AER errors.",
"Value": false
}
},
{
"McaPollingPeriod": {
"Description": "Polling time period in seconds for MCA errors",
"Value": 3
}
},
{
"DramCeccPollingPeriod": {
"Description": "Polling time period in seconds for DRAM CECC errors",
"Value": 5
}
},
{
"PcieAerPollingPeriod": {
"Description": "Polling time period in seconds for PCIE AER errors",
"Value": 7
}
},
{
"DramCeccErrThresholdCnt": {
"Description": "Error threshold value for DRAM CECC errors.",
"Value": 1
}
},
{
"McaErrThresholdCnt": {
"Description": "Error threshold value for MCA errors.",
"Value": 1
}
},
{
"PcieAerErrThresholdCnt": {
"Description": "Error threshold value for PCIE AER errors.",
"Value": 1
}
}
]
}
Loading

0 comments on commit ae7a5f1

Please sign in to comment.