Skip to content

Commit ae7a5f1

Browse files
aasbodduAbinaya Dhandapani
authored andcommitted
Enable RAS manager for fatal and runtime errors
APML RAS Manager Initialization: - Added initialization for APML RAS Manager. - Included conditional compilation for APML support. - Added a placeholder error log for PLDM RAS capabilities, indicating that they are yet to be enabled. - The init function repeatedly attempts to get the BMC RAS OOB configuration until successful. - The function initializes the platform with the block ID's that needs to be harvested during a crashdump and sets up a D-Bus match to monitor watchdog state changes to monitor BIOS post complete. - It reads CPU IDs for all CPUs and logs errors on failure. - The init function also creates separate polling threads for MCA, DRAM CECC and PCIE AER error monitoring. - The function also handles BIOS post-completion by configuring PCIE OOB settings and enabling PCIE error thresholds based on watchdog timer changes. - It also clears SbrmiAlertMask register so APML_ALERT_L will be asserted during a syncflood in the system. - The commit has oem_cper.h providing the outline of file format for both runtime and crashdump CPER records. - Added additional json config parameters to enable OOB registers during initialization. - Overall , this commit provides all the necessary preps needed to enable the crashdump flow and runtime error monitoring. Crashdump monitoring: - This commit introduces the handling of GPIO events for P0 and P1 APML alerts - Binds the P0 alert event handler and P1 alert evernt handler to manage these alerts. - Read RAS status register and check for errors. - Log and send alerts for various RAS errors including: - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy. - RESET_HANG_ERR: Suggest manual immediate reset. - FATAL_ERROR: Harvest MCA data and reset based on policy. - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow. - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error counter overflow. - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow. CPER record generation: - Add functionality to generate Common Platform Error Record (CPER) entries when a FATAL_ERROR is detected. - It also creates CPER record when a runtime MCA , DRAM CECC and PCIE AER error is reported via polling or threshold overflow. - The system stores maximum of 10 CPER records in BMC including fatal and runtime errors. - Create D-Bus object paths for each CPER file in the system, allowing for download via redfish. - Update properties for each CPER object, including Filename, Log, and Timestamp. root@morocco-d89c:~# busctl tree com.amd.RAS `- /com `- /com/amd `- /com/amd/RAS `- /com/amd/RAS/0 Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
1 parent f0f49c8 commit ae7a5f1

22 files changed

+4404
-2
lines changed

config/ras_config.json

Lines changed: 73 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"ApmlRetries": {
55
"Description": "Number of APML retry count",
66
"Value": 10,
7-
"MaxBoundLimit": "50"
7+
"MaxBoundLimit": 50
88
}
99
},
1010
{
@@ -67,6 +67,78 @@
6767
"Description": "Disable AIFS Reset on syncflood counter",
6868
"Value": true
6969
}
70+
},
71+
{
72+
"DramCeccPollingEn": {
73+
"Description": "If this field is true, DRAM Cecc correctable errors will be polled.",
74+
"Value": false
75+
}
76+
},
77+
{
78+
"McaPollingEn": {
79+
"Description": "If this field is true, MCA correctable errors will be polled.",
80+
"Value": true
81+
}
82+
},
83+
{
84+
"PcieAerPollingEn": {
85+
"Description": "If this field is true, PCIE AER correctable errors will be polled.",
86+
"Value": false
87+
}
88+
},
89+
{
90+
"DramCeccThresholdEn": {
91+
"Description": "If this field is true, error thresholding is enable for DRAM CECC errors.",
92+
"Value": false
93+
}
94+
},
95+
{
96+
"McaThresholdEn": {
97+
"Description": "If this field is true, error thresholding is enable for MCa errors",
98+
"Value": false
99+
}
100+
},
101+
{
102+
"PcieAerThresholdEn": {
103+
"Description": "If this field is true, error thresholding is enable for PCIE AER errors.",
104+
"Value": false
105+
}
106+
},
107+
{
108+
"McaPollingPeriod": {
109+
"Description": "Polling time period in seconds for MCA errors",
110+
"Value": 3
111+
}
112+
},
113+
{
114+
"DramCeccPollingPeriod": {
115+
"Description": "Polling time period in seconds for DRAM CECC errors",
116+
"Value": 5
117+
}
118+
},
119+
{
120+
"PcieAerPollingPeriod": {
121+
"Description": "Polling time period in seconds for PCIE AER errors",
122+
"Value": 7
123+
}
124+
},
125+
{
126+
"DramCeccErrThresholdCnt": {
127+
"Description": "Error threshold value for DRAM CECC errors.",
128+
"Value": 1
129+
}
130+
},
131+
{
132+
"McaErrThresholdCnt": {
133+
"Description": "Error threshold value for MCA errors.",
134+
"Value": 1
135+
}
136+
},
137+
{
138+
"PcieAerErrThresholdCnt": {
139+
"Description": "Error threshold value for PCIE AER errors.",
140+
"Value": 1
141+
}
70142
}
71143
]
72144
}

0 commit comments

Comments
 (0)