-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skiboot 5.4.0-rc3 #696
Skiboot 5.4.0-rc3 #696
Conversation
retest this please |
Can someone who has any idea how to read HBRT logs and decipher what on earth could have gone wrong please look at this failure? That it' takes 300kb of log to say nothing is a bug in itself of course... |
@jazurin maybe you possess the magic decoder ring? |
@cjcain or @wilbryan could you please take a look at this failure? CI failed twice while trying to run |
A couple things I see right away:
|
The 2A02 error is an informational error that gets generated based on the reset. I believe these are always generated on the reset. Is this log what is triggering the CI failure? No clue about the Uncorrectable ECC error. |
The test timed out while running
The collected syslog shows an "Uncorrectable ECC error" while reading from PNOR.
@dcrowell77 does HBRT try to save the OCC informational error logs to PNOR ? Any recent changes that went in HBRT that would affect PNOR functionality recently? Thanks. |
Yes, we are trying to save the error log to the PNOR. |
As part of the reset, we communicate with the OCC and notify them that they will be reset and they have some FFDC data that we collect (to be used to debug why a reset may have been done). This is all a normal part of the OCC reset. |
retest this please |
(seeing if this is intermittent enough to only fail 2/3 times) |
I fired up skiboot 5.4.0-rc1 on the firestone version of secureboot tree w/ .. with secure mode enabled; it failed with a CAPP verification error ... Is the secure driver supposed to be signing that? As far as I know that's not being done. |
retest this please |
@bofferdn (as sent in email, but reproducing here for completeness): Your CAPP partition is likely incorrect for STB. Due to bugs in ffs The CAPP partition contains "sub partitions", that is, we have a table The way that code worked is that you'd read the first block of the CAPP With a secure boot container, it only signs the whole partition. Which Which is where the problems start... FFS (that giant bit of unmaintained code with zero tests and eight But it's worse than that, because if you reflash a CAPP partition Yes, for some reason we have ECC on the CAPP partition but not on the As you can measure something without a STB header (of course), if On top of that, it seems that the AMI BMC may incorporate this code that Luckily, nobody inside IBM development does that as nobody knows how to So, we use pflash. pflash does correctly set the actualSize header, so The work-around is to break CAPI by putting crap in the CAPP partition: And this is why it hasn't mattered previously, HOSTBOOT just read the But, IT GETS WORSE. So there's a current bug where we'll measure the BOOTKERNEL partition So the measurement for BOOTKERNEL will be constant, if you have two So, we need to go and rework the code around loading every bit of thing Why should we work out the size rather than defaulting to reading the To workaround that issue, write crap to your CAPP partition if enforcing signature verification, that way we'll fail to load the CAPP ucode and gracefully continue booting. |
That is entirely an implementation choice. Hostboot has the exact same problem with the partitions that contain scan ring data. There is a sub-partition for each processor revision. They can sign at a sub-partition level so that the lab can update each sub-partition independently. |
Unmaintained? This code came from the FSP driver team and they are the ones that invented ffs. The ffs code is maintained by them internally to resolve issues related to FSP. We might have some issue with the back-and-forth flow of this code.
"buggy"? This was done on purpose. The FSP code that uses ffs native, for their own NOR device, actually uses the 'actualSize' field I believe. But in that case they are the sole owner of their chip. For the PNOR chip we decided very early in the FSP/Hostboot program that we would not use that field and instead always set them equal. Since there are two owners of the PNOR chip, you have to make sure they agree on how that field is managed. We decided to save the pain of having FSP and Hostboot, BMC and Hostboot, and There is also the huge problem of the golden side... In systems that have a golden side, we are not suppose to ever touch the golden side. It could, in fact, be a locked NOR module. But, the golden side contains information pointing to partitions on the non-golden side for "sideless" data. In addition you have the primary and backup TOC. So you'd also have to modify pflash (and all the BMC implementations) to modify 4 TOCs whenever a user modifies a partition. And hopefully your hardware implementation wasn't a lockable flash for the Golden side. And, what do we do for partitions that contain modifiable data? Are we going to have the Hostboot code update the TOC entry every time it adds a new error log or guard record? This is not only dangerous but also likely to contribute to premature flash wear-out. There is a lot more to this problem then simply "fixing a tool". |
A key I would like to point out here is that by using xz compression, Hostboot only loads exactly as much as needed because the xz stream has a EOF marker. Since we use virtual memory operations to load PNOR pages on-demand we never look at the whole PAYLOAD partition. Maybe your answer is to make most partitions, that can be xz compressed, xz compressed. |
Patrick Williams [email protected] writes:
I've been waiting on review for 10 months on these: open-power/ffs#10 and the last commit was 18 months ago. If we're going to rely on code, we should ensure someone is looking
It would help if this was documented anywhere, as pflash has been doing We have something that has ended up being ABI that is undocumented,
Although those pointers are to whole-partition structures so it isn't Even then though, errors in that sideless data can cause the golden side
The current design of "file" size = whole partition is the elegant
Stewart Smith |
Patrick Williams [email protected] writes:
This has the issue where if by loading an alternate subpartition could Stewart Smith |
Patrick Williams [email protected] writes:
Yeah, the XZ stream having the EOF marker and Hostboot's on demand My RFC hostboot patch (I need to go and make that not RFC but actually I've been thinking of extending/modifying the interface so that it's Stewart Smith |
Wow... mail got delayed by days. I wish IBM could get email remotely correct. |
Anyway, skiboot-5.4.0-rc2 (to be released later today) will address the partition size issue. The remaining issue is this OCC test (where is that test by the way, I don't recognize something from op-test-framework?). Has anyone been able to further decode what's going on? If it's an error talking to PNOR, looking at the skiboot log could be enlightening too. |
9c59d4b
to
d7c7315
Compare
retest this please |
@jk-ozlabs is this some ancient kernel you built way back when? @jazurin what OS is this machine running? It looks like something ancient. Remotely recent opal-prd even? |
@stewart-ibm that habanero system is running ubuntu 14.10
Looks like @pridhiviraj was able to reproduce the same problem see: #709 |
Ubuntu 14.10 reached End of Life on Thursday, July 23rd 2015. It has not received security updates nor bugfixes for the past 15 months. I suggest that machine is upgraded to the latest LTS (Ubuntu 16.04). |
Joel Stanley [email protected] writes:
There's that too. Although we shouldn't regress functionality with old Stewart Smith |
Signed-off-by: Stewart Smith <[email protected]>
d7c7315
to
306303a
Compare
ffs_index was used to ensure we updated the ffs header with the actual size. However, the ffs_index was hardcoded to -1 nd never updated, so this cade was never executed. Secondly, recent discussion[1] on the open-power bug tracker suggests that this was never something that should be done. [1] open-power/op-build#696 (comment) Change-Id: I302b48213561c4d4490927fa0953c65a52d82c11 Signed-off-by: Joel Stanley <[email protected]>
ffs_index was used to ensure we updated the ffs header with the actual size. However, the ffs_index was hardcoded to -1 nd never updated, so this cade was never executed. Secondly, recent discussion[1] on the open-power bug tracker suggests that this was never something that should be done. [1] open-power/op-build#696 (comment) Change-Id: I302b48213561c4d4490927fa0953c65a52d82c11 Signed-off-by: Joel Stanley <[email protected]>
…_3-26-2020 op-build update 3-26-2020
op-build update 2-2-2021
Signed-off-by: Stewart Smith [email protected]
This change is