Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable RAS manager for fatal and runtime errors #102

Open
wants to merge 1 commit into
base: integ_sp7
Choose a base branch
from

Conversation

aasboddu
Copy link
Collaborator

@aasboddu aasboddu commented Jan 23, 2025

Enable RAS manager for fatal and runtime errors

APML RAS Manager Initialization:

  • Added initialization for APML RAS Manager.
  • Included conditional compilation for APML support.
  • Added a placeholder error log for PLDM RAS capabilities, indicating
    that they are yet to be enabled.
  • The init function repeatedly attempts to get the BMC RAS OOB
    configuration until successful.
  • The function initializes the platform with the block ID's that needs
    to be harvested during a crashdump and sets up a D-Bus match to
    monitor watchdog state changes to monitor BIOS post complete.
  • It reads CPU IDs for all CPUs and logs errors on failure.
  • The init function also creates separate polling threads for MCA,
    DRAM CECC and PCIE AER error monitoring.
  • The function also handles BIOS post-completion by configuring PCIE
    OOB settings and enabling PCIE error thresholds based on watchdog
    timer changes.
  • It also clears SbrmiAlertMask register so APML_ALERT_L will be
    asserted during a syncflood in the system.
  • The commit has oem_cper.h providing the outline of file format for
    both runtime and crashdump CPER records.
  • Added additional json config parameters to enable OOB registers
    during initialization.
  • Overall , this commit provides all the necessary preps needed to
    enable the crashdump flow and runtime error monitoring.

Crashdump monitoring:

  • This commit introduces the handling of GPIO events for P0 and
    P1 APML alerts
  • Binds the P0 alert event handler and P1 alert evernt handler
    to manage these alerts.
  • Read RAS status register and check for errors.
  • Log and send alerts for various RAS errors including:
    • SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy.
    • RESET_HANG_ERR: Suggest manual immediate reset.
    • FATAL_ERROR: Harvest MCA data and reset based on policy.
    • MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow.
    • DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error
      counter overflow.
    • PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow.

CPER record generation:

  • Add functionality to generate Common Platform Error Record (CPER)
    entries when a FATAL_ERROR is detected.
  • It also creates CPER record when a runtime MCA , DRAM CECC and
    PCIE AER error is reported via polling or threshold overflow.
  • The system stores maximum of 10 CPER records in BMC including
    fatal and runtime errors.
  • Create D-Bus object paths for each CPER file in the system,
    allowing for download via redfish.
  • Update properties for each CPER object, including Filename,
    Log, and Timestamp.

root@morocco-d89c:~# busctl tree com.amd.RAS
- /com - /com/amd
- /com/amd/RAS - /com/amd/RAS/0

Signed-off-by: aasboddu [email protected], Abinaya Dhandapani [email protected]

@@ -0,0 +1,61 @@
#pragma once

#include "libcper/Cper.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still open discussion to use libcper library or copy the contents of libcper/Cper.h to amd-ras repo is still under discussion.

@supven01 Can you please provide your inputs

@abinayaddhandapani abinayaddhandapani force-pushed the review branch 3 times, most recently from 5ce505a to 6f25110 Compare February 10, 2025 19:30
@abinayaddhandapani abinayaddhandapani force-pushed the review branch 2 times, most recently from 1c1b01f to ad056d4 Compare February 11, 2025 10:47
@abinayaddhandapani abinayaddhandapani changed the title Enable APML RAS Manager Initialization Enable Fatal error monitoring and CPER file creation Feb 11, 2025
constexpr uint16_t PCIE_VENDOR_ID = 0x1022;
constexpr int MINOR_REVISION = 0xB;

/*template void dumpHeaderSection(const std::shared_ptr<FatalCperRecord>& data,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented code left out for debug purposes?

Copy link
Collaborator

@abinayaddhandapani abinayaddhandapani Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to cper_tul.cpp file.
Will delete these commented out lines.

{
if (*systemRecovery == "WARM_RESET")
{
if ((buf & 0x4)) // SYS_MGMT_CTRL_ERR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add bitmask macro

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

APML RAS Manager Initialization:

- Added initialization for APML RAS Manager.
- Included conditional compilation for APML support.
- Added a placeholder error log for PLDM RAS capabilities, indicating
  that they are yet to be enabled.
- The init function repeatedly attempts to get the BMC RAS OOB
  configuration until successful.
- The function initializes the platform with the block ID's that needs
  to be harvested during a crashdump and sets up a D-Bus match to
  monitor watchdog state changes to monitor BIOS post complete.
- It reads CPU IDs for all CPUs and logs errors on failure.
- The init function also creates separate polling threads for MCA,
  DRAM CECC and PCIE AER error monitoring.
- The function also handles BIOS post-completion by configuring PCIE
  OOB settings and enabling PCIE error thresholds based on watchdog
  timer changes.
- It also clears SbrmiAlertMask register so APML_ALERT_L will be
  asserted during a syncflood in the system.
- The commit has oem_cper.h providing the outline of file format for
  both runtime and crashdump CPER records.
- Added additional json config parameters to enable OOB registers
  during initialization.
- Overall , this commit provides all the necessary preps needed to
  enable the crashdump flow and runtime error monitoring.

Crashdump monitoring:

- This commit introduces the handling of GPIO events for P0 and
  P1 APML alerts
- Binds the P0 alert event handler and P1 alert evernt handler
  to manage these alerts.
- Read RAS status register and check for errors.
- Log and send alerts for various RAS errors including:
  - SYS_MGMT_CTRL_ERR: Trigger cold reset based on policy.
  - RESET_HANG_ERR: Suggest manual immediate reset.
  - FATAL_ERROR: Harvest MCA data and reset based on policy.
  - MCA_ERR_OVERFLOW: Log MCA runtime error counter overflow.
  - DRAM_CECC_ERR_OVERFLOW: Log DRAM CECC runtime error
    counter overflow.
  - PCIE_ERR_OVERFLOW: Log PCIE runtime error counter overflow.

CPER record generation:

- Add functionality to generate Common Platform Error Record (CPER)
  entries when a FATAL_ERROR is detected.
- It also creates CPER record when a runtime MCA , DRAM CECC and
  PCIE AER error is reported via polling or threshold overflow.
- The system stores maximum of 10 CPER records in BMC including
  fatal and runtime errors.
- Create D-Bus object paths for each CPER file in the system,
  allowing for download via redfish.
- Update properties for each CPER object, including Filename,
  Log, and Timestamp.

root@morocco-d89c:~# busctl tree com.amd.RAS
`- /com
  `- /com/amd
    `- /com/amd/RAS
      `- /com/amd/RAS/0

Signed-off-by: aasboddu <[email protected]>, Abinaya Dhandapani <[email protected]>
@abinayaddhandapani abinayaddhandapani changed the title Enable Fatal error monitoring and CPER file creation Enable RAS manager for fatal and runtime errors Feb 26, 2025
CRASHDUMP_T CrashDumpData[32];
DF_DUMP DfDumpData;
UINT32 Reserved1[96];
UINT32 DebugLogIdData[12124];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of generic things:

  1. please define macros for all the array sizes of various members of these structures in this file.
  2. Is it OK to have the typdes as all capital letters?

Manager(amd::ras::config::Manager&);

int errCount = 0;
uint8_t cpuCount;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my be late in the game. If I were you, I would make these members as private, and write get rouines for each of them, and initialize these variables as part of the constructor instead of initializing them in the getSocketInfo() method

* @return On failure to create index file or read CPER index number
* throw std::runtime_error exception.
*/
void createCperIndexFile();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be part of cper_util.hpp?

*/
class CrashdumpInterface : public CrashdumpBase
{
public:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the coding guideline? is it public first, or private first? not a major concern..

* This function is invoked when an alert event occurs on P0. The function
* handles the event by processing the necessary response.
*/
void p0AlertEventHandler();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p0AlertEventHandler(), and p1AertEventHandler() code will be similar, I am assuming. Instead, if we can have this Handler as a routine which would take the p0/p1 specifics as input, this code will be cleaner, and if we have to support p3, we are ready.

* This GPIO line is used to detect hardware alerts for P0 and trigger
* events for processing.
*/
gpiod::line p0apmlAlertLine;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

}
}

void triggerWarmReset()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a SEL log on any type of host reset might help during debug.

{
bool ret = false;
int socNum = 0;
uint32_t tempVar[2][8];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoud the socNum 2 be a macro/const?

errorMgr->configure();
#endif

#ifdef PLDM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can make this as a variable based check? That is how it is going to be. if not now, may be later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants