Skip to content

v6.0

Compare
Choose a tag to compare
released this 23 May 22:22
· 1802 commits to master since this release
v6.0

skiboot-6.0

skiboot v6.0 was released on Friday May 11th 2018. It is the first
release of skiboot 6.0, which is the new stable release of skiboot
following the 5.11 release, first released April 6th 2018.

Skiboot 6.0 is the basis for op-build v2.0 and will is required for
POWER9 systems.

skiboot v6.0 contains all bug fixes as of :ref:skiboot-5.11,
:ref:skiboot-5.10.5, and :ref:skiboot-5.4.9 (the currently maintained
stable releases). We do not expect any further stable releases in the
5.10.x series, nor in the 5.11.x series.

For how the skiboot stable releases work, see :ref:stable-rules for details.

Over skiboot-5.11, we have the following changes:

New Features

Since 6.0-rc1:

  • Update default stop-state-disable mask to cut only stop11

    Stability improvements in microcode for stop4/stop5 are
    available in upstream hcode images. Stop4 and stop5 can
    be safely enabled by default.

    Use ~0xE0000000 to cut all but stop0,1,2 in case there
    are any issues with stop4/5.

    example: ::

    nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0x1FFFFFFF

    Note: that DD2.1 chips that have a frequency <1867Mhz possible need to
    run a hcode image different than the default in op-build (set
    BR2_HCODE_LATEST_VERSION=y in your config)

  • ibm,firmware-versions: add hcode to device tree

    op-build commit 736a08b996e292a449c4996edb264011dfe56a40
    added hcode to the VERSION partition, let's parse it out
    and let the user know.

  • ipmi: Add BMC firmware version to device tree

    BMC Get device ID command gives BMC firmware version details. Lets add this
    to device tree. User space tools will use this information to display BMC
    version details.

Since 5.11:

  • Disable stop states from OPAL

    On ZZ, stop4,5,11 are enabled for PowerVM, even though doing
    so may cause problems with OPAL due to bugs in hcode.

    For other platforms, this isn't so much of an issue as
    we can just control stop states by the MRW. However the
    rebuild-the-world approach to changing values there is a bit
    annoying if you just want to rule out a specific stop state
    from being problematic.

    Provide an nvram option to override what's disabled in OPAL.

    The OPAL mask is currently ~0xE0000000 (i.e. all but stop 0,1,2)

    You can set an NVRAM override with: ::

    nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0xFFFFFFF
    

    This nvram override will disable all stop states.

  • interrupts: Create an "interrupts" property in the OPAL node

    Deprecate the old "opal-interrupts", it's still there, but the new
    property follows the standard and allow us to specify whether an
    interrupt is level or edge sensitive.

    Similarly create "interrupt-names" whose content is identical to
    "opal-interrupts-names".

  • SBE: Add timer support on POWER9

    SBE on P9 provides one shot programmable timer facility. We can use this
    to implement OPAL timers and hence limit the reliance on the Linux
    heartbeat (similar to HW timer facility provided by SLW on P8).

  • Add SBE driver support

    SBE (Self Boot Engine) on P9 has two different jobs:

    • Boot the chip up to the point the core is functional
    • Provide various services like timer, scom, stash MPIPL, etc., at runtime

    We will use SBE for various purposes like timer, MPIPL, etc.

  • opal:hmi: Add missing processor recovery reason string.

    With this patch now we see reason string printed for CORE_WOF[43] bit. ::

    [ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred.
    [ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error:
    [ 477.352242181,7] HMI: PC - Thread hang recovery

  • Add DIMM actual speed to device tree

    Recent HDAT provides DIMM actuall speed. Lets add this to device tree.

  • Fix DIMM size property

    Today we parse vpd blob to get DIMM size information. This is limited
    to FSP based system. HDAT provides DIMM size value. Lets use that to
    populate device tree. So that we can get size information on BMC based
    system as well.

  • PCI: Set slot power limit when supported

    The PCIe slot capability can be implemented in a root or switch
    downstream port to set the maximum power a card is allowed to draw
    from the system. This patch adds support for setting the power limit
    when the platform has defined one.

  • hdata/spira: parse vpd to add part-number and serial-number to xscom@ node

    Expected by FWTS and associates our processor with the part/serial
    number, which is obviously a good thing for one's own sanity.

Improved HMI Handling
^^^^^^^^^^^^^^^^^^^^^

  • opal/hmi: Add documentation for opal_handle_hmi2 call

  • opal/hmi: Generate hmi event for recovered HDEC parity error.

  • opal/hmi: check thread 0 tfmr to validate latched tfmr errors.

    Due to P9 errata, HDEC parity and TB residue errors are latched for
    non-zero threads 1-3 even if they are cleared. But these are not
    latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr
    value and ignore them on non-zero threads if they are not present on
    thread 0.

  • opal/hmi: Print additional debug information in rendezvous.

  • opal/hmi: Fix handling of TFMR parity/corrupt error.

    While testing TFMR parity/corrupt error it has been observed that HMIs are
    delivered twice for this error

    • First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1.
    • Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB.

    On second HMI we end up throwing "HMI: TB invalid without core error
    reported" even though TB is in a valid state.

  • opal/hmi: Stop flooding HMI event for TOD errors.

    Fix the issue where every thread on the chip sends HMI event to host for
    TOD errors. TOD errors are reported to all the core/threads on the chip.
    Any one thread can fix the error and send event. Rest of the threads don't
    need to send HMI event unnecessarily.

  • opal/hmi: Fix soft lockups during TOD errors

    There are some TOD errors which do not affect working of TOD and TB. They
    stay in valid state. Hence we don't need rendez vous for TOD errors that
    does not affect TB working.

    TOD errors that affects TOD/TB will report a global error on TFMR[44]
    alongwith bit 51, and they will go in rendez vous path as expected.

    But the TOD errors that does not affect TB register sets only TFMR bit 51.
    The TFMR bit 51 is cleared when any single thread clears the TOD error.
    Once cleared, the bit 51 is reflected to all the cores on that chip. Any
    thread that reads the TFMR register after the error is cleared will see
    TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through
    rendez-vous path and threads that see TFMR[51]=0, returns doing
    nothing. This ends up in a soft lockups in host kernel.

    This patch fixes this issue by not considering TOD interrupt (TFMR[51])
    as a core-global error and hence avoiding rendez-vous path completely.
    Instead threads that see TFMR[51]=1 will now take different path that
    just do the TOD error recovery.

  • opal/hmi: Do not send HMI event if no errors are found.

    For TOD errors, all the cores in the chip get HMIs. Any one thread from any
    core can fix the issue and TFMR will have error conditions cleared. Rest of
    the threads need take any action if TOD errors are already cleared. Hence
    thread 0 of every core should get a fresh copy of TFMR before going ahead
    recovery path. Initialize recover = -1, so that if no errors found that
    thread need not send a HMI event to linux. This helps in stop flooding host
    with hmi event by every thread even there are no errors found.

  • opal/hmi: Initialize the hmi event with old value of HMER.

    Do this before we check for TFAC errors. Otherwise the event at host console
    shows no error reported in HMER register.

    Without this patch the console event show HMER with all zeros ::

    [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered]
    [ 216.753498] Error detail: Timer facility experienced an error
    [ 216.753509] HMER: 0000000000000000
    [ 216.753518] TFMR: 3c12000870e04000

    After this patch it shows old HMER values on host console: ::

    [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered]
    [ 2237.652651] Error detail: Timer facility experienced an error
    [ 2237.652766] HMER: 0840000000000000
    [ 2237.652837] TFMR: 3c12000870e04000

  • opal/hmi: Rework HMI handling of TFAC errors

    This patch reworks the HMI handling for TFAC errors by introducing
    4 rendez-vous points improve the thread synchronization while handling
    timebase errors that requires all thread to clear dirty data from TB/HDEC
    register before clearing the errors.

  • opal/hmi: Don't bother passing HMER to pre-recovery cleanup

    The test for TFAC error is now redundant so we remove it and
    remove the HMER argument.

  • opal/hmi: Move timer related error handling to a separate function

    Currently no functional change. This is a first step to completely
    rewriting how these things are handled.

  • opal/hmi: Add a new opal_handle_hmi2 that returns direct info to Linux

    It returns a 64-bit flags mask currently set to provide info
    about which timer facilities were lost, and whether an event
    was generated.

  • opal/hmi: Remove races in clearing HMER

    Writing to HMER acts as an "AND". The current code writes back the
    value we originally read with the bits we handled cleared. This is
    racy, if a new bit gets set in HW after the original read, we'll end
    up clearing it without handling it.

    Instead, use an all 1's mask with only the bit handled cleared.

  • opal/hmi: Don't re-read HMER multiple times

    We want to make sure all reporting and actions are based
    upon the same snapshot of HMER in case bits get added
    by HW while we are in OPAL.

libflash and ffspart
^^^^^^^^^^^^^^^^^^^^

Many improvements to the ffspart utility and libflash have come
in this release, making ffspart suitable for building bit-identical
PNOR images as the existing tooling used by op-build. The plan is to
switch op-build to use this infrastructure in the not too distant
future.

  • libflash/blocklevel: Make read/write be ECC agnostic for callers

    The blocklevel abstraction allows for regions of the backing store to be
    marked as ECC protected so that blocklevel can decode/encode the ECC
    bytes into the buffer automatically without the caller having to be ECC
    aware.

    Unfortunately this abstraction is far from perfect, this is only useful
    if reads and writes are performed at the start of the ECC region or in
    some circumstances at an ECC aligned position - which requires the
    caller be aware of the ECC regions.

    The problem that has arisen is that the blocklevel abstraction is
    initialised somewhere but when it is later called the caller is unaware
    if ECC exists in the region it wants to arbitrarily read and write to.
    This should not have been a problem since blocklevel knows. Currently
    misaligned reads will fail ECC checks and misaligned writes will
    overwrite ECC bytes and the backing store will become corrupted.

    This patch add the smarts to blocklevel_read() and blocklevel_write() to
    cope with the problem. Note that ECC can always be bypassed by calling
    blocklevel_raw_() functions.

    All this work means that the gard tool can can safely call
    blocklevel_read() and blocklevel_write() and as long as the blocklevel
    knows of the presence of ECC then it will deal with all cases.

    This also commit removes code in the gard tool which compensated for
    inadequacies no longer present in blocklevel.

  • libflash/blocklevel: Return region start from ecc_protected()

    Currently all ecc_protected() does is say if a region is ECC protected
    or not. Knowing a region is ECC protected is one thing but there isn't
    much that can be done afterwards if this is the only known fact. A lot
    more can be done if the caller is told where the ECC region begins.

    Knowing where the ECC region start it allows to caller to align its
    read/and writes. This allows for more flexibility calling read and write
    without knowing exactly how the backing store is organised.

  • libflash/ecc: Add helpers to align a position within an ecc buffer

    As part of ongoing work to make ECC invisible to higher levels up the
    stack this function converts a 'position' which should be ECC agnostic
    to the equivalent position within an ECC region starting at a specified
    location.

  • libflash/ecc: Add functions to deal with unaligned ECC memcpy

  • external/ffspart: Improve error output

  • libffs: Fix bad checks for partition overlap

    Not all TOCs are written at zero

  • libflash/libffs: Allow caller to specifiy header partition

    An FFS TOC is comprised of two parts. A small header which has a magic
    and very minimmal information about the TOC which will be common to all
    partitions, things like number of patritions, block sizes and the like.
    Following this small header are a series of entries. Importantly there
    is always an entry which encompases the TOC its self, this is usually
    called the 'part' partition.

    Currently libffs always assumes that the 'part' partition is at zero.
    While there is always a TOC and zero there doesn't actually have to be.
    PNORs may have multiple TOCs within them, therefore libffs needs to be
    flexible enough to allow callers to specify TOCs not at zero.

    The 'part' partition is otherwise a regular partition which may have
    flags associated with it. libffs should allow the user to set the flags
    for the 'part' partition.

    This patch achieves both by allowing the caller to specify the 'part'
    partition. The caller can not and libffs will provide a sensible
    default.

  • libflash/libffs: Refcount ffs entries

    Currently consumers can add an new ffs entry to multiple headers, this
    is fine but freeing any of the headers will cause the entry to be freed,
    this causes double free problems.

    Even if only one header is uses, the consumer of the library still has a
    reference to the entry, which they may well reuse at some other point.

    libffs will now refcount entries and only free when there are no more
    references.

    This patch also removes the pointless return value of ffs_hdr_free()

  • libflash/libffs: Switch to storing header entries in an array

    Since the libffs no longer needs to sort the entries as they get added
    it makes little sense to have the complexity of a linked list when an
    array will suffice.

  • libflash/libffs: Remove backup partition from TOC generation code

    It turns out this code was messy and not all that reliable. Doing it at
    the library level adds complexity to the library and restrictions to the
    caller.

    A simpler approach can be achived with the just instantiating multiple
    ffs_header structures pointing to different parts of the same file.

  • libflash/libffs: Remove the 'sides' from the FFS TOC generation code

    It turns out this code was messy and not all that reliable. Doing it at
    the library level adds complexity to the library and restrictions to the
    caller.

    A simpler approach can be achived with the just instantiating multiple
    ffs_header structures pointing to different parts of the same file.

  • libflash/libffs: Always add entries to the end of the TOC

    It turns out that sorted order isn't the best idea. This removes
    flexibility from the caller. If the user wants their partitions in
    sorted order, they should insert them in sorted order.

  • external/ffspart: Remove side, order and backup options

    These options are currently flakey in libflash/libffs so there isn't
    much point to being able to use them in ffspart.

    Future reworks planned for libflash/libffs will render these options
    redundant anyway.

  • libflash/libffs: ffs_close() should use ffs_hdr_free()

  • libflash/libffs: Add setter for a partitions actual size

  • pflash: Use ffs_entry_user_to_string() to standardise flag strings

  • libffs: Standardise ffs partition flags

    It seems we've developed a character respresentation for ffs partition
    flags. Currently only pflash really prints them so it hasn't been a
    problem but now ffspart wants to read them in from user input.

    It is important that what libffs reads and what pflash prints remain
    consistent, we should move the code into libffs to avoid problems.

  • external/ffspart: Allow # comments in input file\

p9dsu Platform changes

The p9dsu platform from SuperMicro (also known as 'Boston') has received
a number of updates, and the patches once carried by SuperMicro are now
upstream.

Since 6.0-rc1:

  • p9dsu: timeout for variant detection, default to 2uess

Since 5.11:

  • p9dsu: detect p9dsu variant even when hostboot doesn't tell us

    The SuperMicro BMC can tell us what riser type we have, which dictates
    the PCI slot tables. Usually, in an environment that a customer would
    experience, Hostboot will do the query with an SMC specific patch
    (not upstream as there's no platform specific code in hostboot)
    and skiboot knows what variant it is based on the compatible string.

    However, if you're using upstream hostboot, you only get the bare
    'p9dsu' compatible type. We can work around this by asking the BMC
    ourselves and setting the slot table appropriately. We do this
    syncronously in platform init so that we don't start probing
    PCI before we setup the slot table.

  • p9dsu: add slot power limit.

  • p9dsu: add pci slot table for Boston LC 1U/2U and Boston LA/ESS.

  • p9dsu HACK: fix system-vpd eeprom

  • p9dsu: change esel command from AMI to IBM 0x3a.

ZZ Platform Changes

  • hdata/i2c: Fix up pci hotplug labels

    These labels are used on the devices used to do PCIe slot power control
    for implementing PCIe hotplug. I'm not sure how they ended up as
    "eeprom-pgood" and "eeprom-controller" since that doesn't make any sense.

  • hdata/i2c: Ignore multi-port I2C devices

    Recent FSP firmware builds add support for multi-port I2C devices such
    as the GPIO expanders used for the presence detect of OpenCAPI devices
    and the PCIe hotplug controllers used to power cycle PCIe slots on ZZ.

    The OpenCAPI driver inside of skiboot currently uses a platform-specific
    method to talk to the relevant I2C device rather than relying on HDAT
    since not all platforms correctly report the I2C devices (hello Zaius).
    Additionally the nature of multi-port devices require that we a device
    specific handler so that we generate the correct DT bindings. Currently
    we don't and there is no immediate need for this support so just ignore
    the multi-port devices for now.

  • hdata/i2c: Replace i2c_ prefix with dev_

    The current naming scheme makes it easy to conflate "i2cm_port" and
    "i2c_port." The latter is used to describe multi-port I2C devices such
    as GPIO expanders and multi-channel PCIe hotplug controllers. Rename
    i2c_port to dev_port to make the two a bit more distinct.

    Also rename i2c_addr to dev_addr for consistency.

  • hdata/i2c: Ignore CFAM I2C master

    Recent FSP firmware builds put in information about the CFAM I2C master
    in addition the to host I2C masters accessible via XSCOM. Odds are this
    information should not be there since there's no handshaking between the
    FSP/BMC and the host over who controls that I2C master, but it is so
    we need to deal with it.

    This patch adds filtering to the HDAT parser so it ignores the CFAM I2C
    master. Without this it will create a bogus i2cm@ which migh cause
    issues.

  • ZZ: hw/imc: Add support to load imc catalog lid file

    Add support to load the imc catalog from a lid file packaged
    as part of the system firmware. Lid number allocated
    is 0x80f00103.lid.

Bugs Fixed

Since 6.0-rc2:

  • core/opal: Fix recursion check in opal_run_pollers()

    An earlier commit introduced a counter variable poller_recursion to
    limit to the number number of error messages shown when opal_pollers
    are run recursively. However the check for the counter value was
    placed in a way that the poller recursion was only detected first 16
    times and then allowed afterwards.

    This patch fixes this by moving the check for the counter value inside
    the conditional branch with some re-factoring so that opal_poller
    recursion is not erroneously allowed after poll_recursion is detected
    first 16 times.

  • phb4: Print WOF registers on fence detect

    Without the WOF registers it's hard to figure out what went wrong first,
    so print those when we print the FIRs when a fence is detected.

  • p9dsu: detect variant in init only if probe fails to found.

    Currently the slot table init happens twice in both probe and init
    functions due to the variant detection logic called with in-correct
    condition check.

Since 6.0-rc1:

  • core/direct-controls: improve p9_stop_thread error handling

    p9_stop_thread should fail the operation if it finds the thread was
    already quiescd. This implies something else is doing direct controls
    on the thread (e.g., pdbg) or there is some exceptional condition we
    don't know how to deal with. Proceeding here would cause things to
    trample on each other, for example the hard lockup watchdog trying to
    send a sreset to the core while it is stopped for debugging with pdbg
    will end in tears.

    If p9_stop_thread times out waiting for the thread to quiesce, do
    not hit it with a core_start direct control, because we don't know
    what state things are in and doing more things at this point is worse
    than doing nothing. There is no good recipe described in the workbook
    to de-assert the core_stop control if it fails to quiesce the thread.
    After timing out here, the thread may eventually quiesce and get
    stuck, but that's simpler to debug than undefied behaviour.

  • core/direct-controls: fix p9_cont_thread for stopped/inactive threads

    Firstly, p9_cont_thread should check that the thread actually was
    quiesced before it tries to resume it. Anything could happen if we
    try this from an arbitrary thread state.

    Then when resuming a quiesced thread that is inactive or stopped (in
    a stop idle state), we must not send a core_start direct control,
    clear_maint must be used in these cases.

  • hmi: Clear unknown debug trigger

    On some systems, seeing hangs like this when Linux starts: ::

    [ 170.027252763,5] OCC: All Chip Rdy after 0 ms
    [ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes)
    [ 171.238270428,5] OPAL: Switch to little-endian OS
    

    If you look at the in memory skiboot console (or do nvram -p ibm,skiboot --update-config log-level-driver=7) we see the console get
    spammed with: ::

    [ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
    [ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
    [ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
    [ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
    

    We're taking the debug trigger (bit 17) early on, before the
    hmi_debug_trigger function in the kernel is set up.

    This clears the HMI in Skiboot and reports to the kernel instead of
    bringing down the machine.

  • core/hmi: assign flags=0 in case nothing set by handle_hmi_exception

    Theoretically we could have returned junk to the OS in this parameter.

  • SLW: Fix mambo boot to use stop states

    After commit 35c66b8 ("SLW: Move MAMBO simulator checks to
    slw_init"), mambo boot no longer calls add_cpu_idle_state_properties()
    and as such we never enable stop states.

    After adding the call back, we get more testing coverage as well
    as faster mambo SMT boots.

  • phb4: Hardware init updates

    CFG Write Request Timeout was incorrectly set to informational and not
    fatal for both non-CAPI and CAPI, so set it to fatal. This was a
    mistake in the specification. Correcting this fixes a niche bug in
    escalation (which is necessary on pre-DD2.2) that can cause a checkstop
    due to a NCU timeout.

    In addition, set the values in the timeout control registers to match.
    This fixes an extremely rare and unreproducible bug, though the current
    timings don't make sense since they're higher than the NCU timeout (16)
    which will checkstop the machine anyway.

  • SLW: quieten 'Configuring self-restore' for DARN,NCU_SPEC_BAR and HRMOR

Since 5.11:

  • core: Fix iteration condition to skip garded cpu

  • uart: fix uart_opal_flush to take console lock over uart_con_flush
    This bug meant that OPAL_CONSOLE_FLUSH didn't take the appropriate locks.
    Luckily, since this call is only currently used in the crash path.

  • xive: fix missing unlock in error path

  • OPAL_PCI_SET_POWER_STATE: fix locking in error paths

    Otherwise we could exit OPAL holding locks, potentially leading
    to all sorts of problems later on.

  • hw/slw: Don't assert on a unknown chip

    For some reason skiboot populates nodes in /cpus/ for the cores on
    chips that are deconfigured. As a result Linux includes the threads
    of those cores in it's set of possible CPUs in the system and attempts
    to set the SPR values that should be used when waking a thread from
    a deep sleep state.

    However, in the case where we have deconfigured chip we don't create
    a xscom node for that chip and as a result we don't have a proc_chip
    structure for that chip either. In turn, this results in an assertion
    failure when calling opal_slw_set_reg() since it expects the chip
    structure to exist. Fix this up and print an error instead.

  • opal/hmi: Generate one event per core for processor recovery.

    Processor recovery is per core error. All threads on that core receive
    HMI. All threads don't need to generate HMI event for same error.

    Let thread 0 only generate the event.

  • sensors: Dont add DTS sensors when OCC inband sensors are available

    There are two sets of core temperature sensors today. One is DTS scom
    based core temperature sensors and the second group is the sensors
    provided by OCC. DTS is the highest temperature among the different
    temperature zones in the core while OCC core temperature sensors are
    the average temperature of the core. DTS sensors are read directly by
    the host by SCOMing the DTS sensors while OCC sensors are read and
    updated by OCC to main memory.

    Reading DTS sensors by SCOMing is a heavy and slower operation as
    compared to reading OCC sensors which is as good as reading memory.
    So dont add DTS sensors when OCC sensors are available.

  • core/fast-reboot: Increase timeout for dctl sreset to 1sec

    Direct control xscom can take more time to complete. We seem to
    wait too little on Boston failing fast-reboot for no good reason.

    Increase timeout to 1 sec as a reasonable value for sreset to be delivered
    and core to start executing instructions.

  • occ: sensors-groups: Add DT properties to mark HWMON sensor groups

    Fix the sensor type to match HWMON sensor types. Add compatible flag
    to indicate the environmental sensor groups so that operations on
    these groups can be handled by HWMON linux interface.

  • core: Correctly load initramfs in stb container

    Skiboot does not calculate the actual size and start location of the
    initramfs if it is wrapped by an STB container (for example if loading
    an initramfs from the ROOTFS partition).

    Check if the initramfs is in an STB container and determine the size and
    location correctly in the same manner as the kernel. Since
    load_initramfs() is called after load_kernel() move the call to
    trustedboot_exit_boot_services() into load_and_boot_kernel() so it is
    called after both of these.

  • hdat/i2c.c: quieten "v2 found, parsing as v1"

  • hw/imc: Check for pause_microcode_at_boot() return status

    pause_microcode_at_boot() loops through all the chip's ucode
    control block and pause the ucode if it is in the running state.
    But it does not fail if any of the chip's ucode is not initialised.

    Add code to return a failure if ucode is not initialized in any
    of the chip. Since pause_microcode_at_boot() is called just before
    attaching the IMC device nodes in imc_init(), add code to check for
    the function return.

Slot location code fixes:

  • npu2: Use ibm, loc-code rather than ibm, slot-label

    The ibm,slot-label property is to name the slot that appears under a
    PCIe bridge. In the past we (ab)used the slot tables to attach names
    to GPU devices and their corresponding NVLinks which resulted in npu2.c
    using slot-label as a location code rather than as a way to name slots.

    Fix this up since it's confusing.

  • hdata/slots: Apply slot label to the parent slot

    Slot names only really make sense when applied to an actual slot rather
    than a device. On witherspoon the GPU devices have a name associated with
    the device rather than the slot for the GPUs. Add a hack that moves the
    slot label to the parent slot rather than on the device itself.

  • pci-dt-slot: Big ol' cleanup

    The underlying data that we get from HDAT can only really describe a
    PCIe system. As such we can simplify the devicetree slot lookup code
    by only caring about the important cases, namly, root ports and switch
    downstream ports.

    This also fixes a bug where root port didn't get a Slot label applied
    which results in devices under that port not having ibm,loc-code set.
    This results in the EEH core being unable to report the location of
    EEHed devices under that port.

opal-prd
^^^^^^^^

  • opal-prd: Insert powernv_flash module

    Explictly load powernv_flash module on BMC based system so that we are sure
    that flash device is created before starting opal-prd daemon.

    Note that I have replaced pnor_available() check with is_fsp_system(). As we
    want to load module on BMC system only. Also pnor_init has enough logic to
    detect flash device. Hence pnor_available() becomes redundant check.

NPU2/NVLINK2
^^^^^^^^^^^^

  • npu2/hw-procedures: fence bricks on GPU reset

    The NPU workbook defines a way of fencing a brick and
    getting the brick out of fence state. We do have an implementation
    of bringing the brick out of fenced/quiesced state. We do
    the latter in our procedures, but to support run time reset
    we need to do the former.

    The fencing ensures that access to memory behind the links
    will not lead to HMI's, but instead SUE's will be populated
    in cache (in the case of speculation). The expectation is then
    that prior to and after reset, the operating system components
    will flush the cache for the region of memory behind the GPU.

    This patch does the following:

    1. Implements a npu2_dev_fence_brick() function to set/clear
      fence state
    2. Clear FIR bits prior to clearing the fence status
    3. Clear's the fence status
    4. We take the powerbus out of CQ fence much later now,
      in credits_check() which is the last hardware procedure
      called after link training.
  • hw/npu2.c: Remove static configuration of NPU2 register

    The NPU_SM_CONFIG0 register currently needs to be configured in Skiboot to
    select NVLink mode, however Hostboot should configure other bits in this
    register.

    For some reason Skiboot was explicitly clearing bit-6
    (CONFIG_DISABLE_VG_NOT_SYS). It is unclear why this bit was getting cleared
    as recent Hostboot versions explicitly set it to the correct value based on
    the specific system configuration. Therefore Skiboot should not alter it.

    Bit-58 (CONFIG_NVLINK_MODE) selects if NVLink mode should be enabled or
    not. Hostboot does not configure this bit so Skiboot should continue to
    configure it.

  • npu2: Improve log output of GPU-to-link mapping

    Debugging issues related to unconnected NVLinks can be a little less
    irritating if we use the NPU2DEV{DBG,INF}() macros instead of prlog().

    In short, change this: ::

    NPU2: comparing GPU 'GPU2' and NPU2 'GPU1'
    NPU2: comparing GPU 'GPU3' and NPU2 'GPU1'
    NPU2: comparing GPU 'GPU4' and NPU2 'GPU1'
    NPU2: comparing GPU 'GPU5' and NPU2 'GPU1'
          :
    npu2_dev_bind_pci_dev: No PCI device for NPU2 device 0006:00:01.0 to bind to. If you expect a GPU to be there, this is a problem.
    

    to this: ::

    NPU6:0:1.0 Comparing GPU 'GPU2' and NPU2 'GPU1'
    NPU6:0:1.0 Comparing GPU 'GPU3' and NPU2 'GPU1'
    NPU6:0:1.0 Comparing GPU 'GPU4' and NPU2 'GPU1'
    NPU6:0:1.0 Comparing GPU 'GPU5' and NPU2 'GPU1'
          :
    NPU6:0:1.0 No PCI device found for slot 'GPU1'
    
  • npu2: Move NPU2_XTS_BDF_MAP_VALID assignment to context init

    A bad GPU or other condition may leave us with a subset of links that
    never get initialized. If an ATSD is sent to one of those bricks, it
    will never complete, leaving us waiting forever for a response: ::

    watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [acos:2050]
    ...
    Modules linked in: nvidia_uvm(O) nvidia(O)
    CPU: 23 PID: 2050 Comm: acos Tainted: G W O 4.14.0 #2
    task: c0000000285cfc00 task.stack: c000001fea860000
    NIP: c0000000000abdf0 LR: c0000000000acc48 CTR: c0000000000ace60
    REGS: c000001fea863550 TRAP: 0901 Tainted: G W O (4.14.0)
    MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28004484 XER: 20040000
    CFAR: c0000000000abdf4 SOFTE: 1
    GPR00: c0000000000acc48 c000001fea8637d0 c0000000011f7c00 c000001fea863820
    GPR04: 0000000002000000 0004100026000000 c0000000012778c8 c00000000127a560
    GPR08: 0000000000000001 0000000000000080 c000201cc7cb7750 ffffffffffffffff
    GPR12: 0000000000008000 c000000003167e80
    NIP [c0000000000abdf0] mmio_invalidate_wait+0x90/0xc0
    LR [c0000000000acc48] mmio_invalidate.isra.11+0x158/0x370

    ATSDs are only sent to bricks which have a valid entry in the XTS_BDF
    table. So to prevent the hang, don't set NPU2_XTS_BDF_MAP_VALID unless
    we make it all the way to creating a context for the BDF.

Secure and Trusted Boot
^^^^^^^^^^^^^^^^^^^^^^^

  • hdata/tpmrel: detect tpm not present by looking up the stinfo->status

    Skiboot detects if tpm is present by checking if a secureboot_tpm_info
    entry exists. However, if a tpm is not present, hostboot also creates a
    secureboot_tpm_info entry. In this case, hostboot creates an empty
    entry, but setting the field tpm_status to TPM_NOT_PRESENT.

    This detects if tpm is not present by looking up the stinfo->status.

    This fixes the "TPMREL: TPM node not found for chip_id=0 (HB bug)"
    issue, reproduced when skiboot is running on a system that has no tpm.

PCI
^^^

  • phb4: Restore bus numbers after CRS

    Currently we restore PCIe bus numbers right after the link is
    up. Unfortunately as this point we haven't done CRS so config space
    may not be accessible.

    This moves the bus number restore till after CRS has happened.

  • romulus: Add a barebones slot table

  • phb4: Quieten and improve "Timeout waiting for electrical link"

    This happens normally if a slot doesn't have a working HW presence
    detect and relies instead of inband presence detect.

    The message we display is scary and not very useful unless ou
    are debugging, so quiten it up and change it to something more
    meaningful.

  • pcie-slot: Don't fail powering on an already on switch

    If the power state is already the required value, return
    OPAL_SUCCESS rather than OPAL_PARAMETER to avoid spurrious
    errors during boot.

CAPI/OpenCAPI
^^^^^^^^^^^^^

  • capi: Keep the current mmio windows in the mbt cache table.

    When the phb is used as a CAPI interface, the current mmio windows list
    is cleaned before adding the capi and the prefetchable memory (M64)
    windows, which implies that the non-prefetchable BAR is no more
    configured.
    This patch allows to set only the mbt bar to pass capi mmio window and
    to keep, as defined, the other mmio values (M32 and M64).

  • npu2-opencapi: Fix 'link internal error' FIR, take 2

    When setting up an opencapi link, we set the transport muxes first,
    then set the PHY training config register, which includes disabling
    nvlink mode for the bricks. That's the order of the init sequence, as
    found in the NPU workbook.

    In reality, doing so works, but it raises 2 FIR bits in the PowerBus
    OLL FIR Register for the 2 links when we configure the transport
    muxes. Presumably because nvlink is not disabled yet and we are
    configuring the transport muxes for opencapi.

    bit 60:
    link0 internal error
    bit 61:
    link1 internal error

    Overall the current setup ends up being correct and everything works,
    but we raise 2 FIR bits.

    So tweak the order of operations to disable nvlink before configuring
    the transport muxes. Incidentally, this is what the scripts from the
    opencapi enablement team were doing all along.

  • npu2-opencapi: Fix 'link internal error' FIR, take 1

    When we setup a link, we always enable ODL0 and ODL1 at the same time
    in the PHY training config register, even though we are setting up
    only one OTL/ODL, so it raises a "link internal error" FIR bit in the
    PowerBus OLL FIR Register for the second link. The error is harmless,
    as we'll eventually setup the second link, but there's no reason to
    raise that FIR bit.

    The fix is simply to only enable the ODL we are using for the link.

  • phb4: Do not set the PBCQ Tunnel BAR register when enabling capi mode.

    The cxl driver will set the capi value, like other drivers already do.

  • phb4: set TVT1 for tunneled operations in capi mode

    The ASN indication is used for tunneled operations (as_notify and
    atomics). Tunneled operation messages can be sent in PCI mode as
    well as CAPI mode.

    The address field of as_notify messages is hijacked to encode the
    LPID/PID/TID of the target thread, so those messages should not go
    through address translation. Therefore bit 59 is part of the ASN
    indication.

    This patch sets TVT#1 in bypass mode when capi mode is enabled,
    to prevent as_notify messages from being dropped.

Debugging/Testing improvements

Since 6.0-rc1:

  • mambo: Enable XER CA32 and OV32 bits on P9

    POWER9 adds 32 bit carry and overflow bits to the XER, but we need to
    set the relevant CTRL1 bit to enable them.

  • Makefile: Fix building natively on ppc64le

    When on ppc64le and CROSS is not set by the environment, make assumes
    ppc64 and sets a default CROSS. Check for ppc64le as well, so that
    'make' works out of the box on ppc64le.

  • Experimental support for building with Clang

  • Improvements to testing and Travis CI

Since 5.11:

  • core/stack: backtrace unwind basic OPAL call details

    Put OPAL callers' r1 into the stack back chain, and then use that to
    unwind back to the OPAL entry frame (as opposed to boot entry, which
    has a 0 back chain).

    From there, dump the OPAL call token and the caller's r1. A backtrace
    looks like this: ::

    CPU 0000 Backtrace:
     S: 0000000031c03ba0 R: 000000003001a548   ._abort+0x4c
     S: 0000000031c03c20 R: 000000003001baac   .opal_run_pollers+0x3c
     S: 0000000031c03ca0 R: 000000003001bcbc   .opal_poll_events+0xc4
     S: 0000000031c03d20 R: 00000000300051dc   opal_entry+0x12c
     --- OPAL call entry token: 0xa caller R1: 0xc0000000006d3b90 ---
    

    This is pretty basic for the moment, but it does give you the bottom
    of the Linux stack. It will allow some interesting improvements in
    future.

    First, with the eframe, all the call's parameters can be printed out
    as well. The ___backtrace / ___print_backtrace API needs to be
    reworked in order to support this, but it's otherwise very simple
    (see opal_trace_entry()).

    Second, it will allow Linux's stack to be passed back to Linux via
    a debugging opal call. This will allow Linux's BUG() or xmon to
    also print the Linux back trace in case of a NMI or MCE or watchdog
    lockup that hits in OPAL.

  • asm/head: implement quiescing without stack or clobbering regs

    Quiescing currently is implmeented in C in opal_entry before the
    opal call handler is called. This works well enough for simple
    cases like fast reset when one CPU wants all others out of the way.

    Linux would like to use it to prevent an sreset IPI from
    interrupting firmware, which could lead to deadlocks when crash
    dumping or entering the debugger. Linux interrupts do not recover
    well when returning back to general OPAL code, due to r13 not being
    restored. OPAL also can't be re-entered, which may happen e.g.,
    from the debugger.

    So move the quiesce hold/reject to entry code, beore the stack or
    r1 or r13 registers are switched. OPAL can be interrupted and
    returned to or re-entered during this period.

    This does not completely solve all such problems. OPAL will be
    interrupted with sreset if the quiesce times out, and it can be
    interrupted by MCEs as well. These still have the issues above.

  • core/opal: Allow poller re-entry if OPAL was re-entered

    If an NMI interrupts the middle of running pollers and the OS
    invokes pollers again (e.g., for console output), the poller
    re-entrancy check will prevent it from running and spam the
    console.

    That check was designed to catch a poller calling opal_run_pollers,
    OPAL re-entrancy is something different and is detected elsewhere.
    Avoid the poller recursion check if OPAL has been re-entered. This
    is a best-effort attempt to cope with errors.

  • core/opal: Emergency stack for re-entry

    This detects OPAL being re-entered by the OS, and switches to an
    emergency stack if it was. This protects the firmware's main stack
    from re-entrancy and allows the OS to use NMI facilities for crash
    / debug functionality.

    Further nested re-entry will destroy the previous emergency stack
    and prevent returning, but those should be rare cases.

    This stack is sized at 16kB, which doubles the size of CPU stacks,
    so as not to introduce a regression in primary stack size. The 16kB
    stack originally had a 4kB machine check stack at the top, which was
    removed by 80eee19 ("opal: Remove machine check interrupt patching
    in OPAL."). So it is possible the size could be tightened again, but
    that would require further analysis.

  • hdat_to_dt: hash_prop the same on all platforms
    Fixes this unit test on ppc64le hosts.

  • mambo: Add persistent memory disk support

    This adds support to for mapping disks images using persistent
    memory. Disks can be added by setting this ENV variable:

    PMEM_DISK="/mydisks/disk1.img,/mydisks/disk2.img"

    These will show up in Linux as /dev/pmem0 and /dev/pmem1.

    This uses a new feature in mambo "mysim memory mmap .." which is only
    available since mambo commit 0131f0fc08 (from 24/4/2018).

    This also needs the of_pmem.c driver in Linux which is only available
    since v4.17. It works with powernv_defconfig + CONFIG_OF_PMEM.

  • external/mambo: Add di command to decode instructions

    By default you get 16 instructions but you can specify the number you
    want. i.e. ::

    systemsim % di 0x100 4
    0x0000000000000100: Enc:0xA64BB17D : mtspr   HSPRG1,r13
    0x0000000000000104: Enc:0xA64AB07D : mfspr   r13,HSPRG0
    0x0000000000000108: Enc:0xF0092DF9 : std     r9,0x9F0(r13)
    0x000000000000010C: Enc:0xA6E2207D : mfspr   r9,PPR
    

    Using di since it's what xmon uses.

  • mambo/mambo_utils.tcl: Inject an MCE at a specified address

    Currently we don't support injecting an MCE on a specific address.
    This is useful for testing functionality like memcpy_mcsafe()
    (see https://patchwork.ozlabs.org/cover/893339/)

    The core of the functionality is a routine called
    inject_mce_ue_on_addr, which takes an addr argument and injects
    an MCE (load/store with UE) when the specified address is accessed
    by code. This functionality can easily be enhanced to cover
    instruction UE's as well.

    A sample use case to create an MCE on stack access would be ::

    set addr [mysim display gpr 1]
    inject_mce_ue_on_addr $addr

    This would cause an mce on any r1 or r1 based access

  • external/mambo: improve helper for machine checks

    Improve workarounds for stop injection, because mambo often will
    trigger on 0x104/204 when injecting sreset/mces.

    This also adds a workaround to skip injecting on reservations to
    avoid infinite loops when doing inject_mce_step.

  • travis: Enable ppc64le builds

    At least on the IBM Travis Enterprise instance, we can now do
    ppc64le builds!

    We can only build a subset of our matrix due to availability of
    ppc64le distros. The Dockerfiles need some tweaking to only
    attempt to install (x86_64 only) Mambo binaries, as well as the
    build scripts.

  • external: Add "lpc" tool

    This is a little front-end to the lpc debugfs files to access
    the LPC bus from userspace on the host.

  • core/test/run-trace: fix on ppc64el