-
Notifications
You must be signed in to change notification settings - Fork 751
boot: bootutil: Refactor erase functionality to fix watchdog feeding, also fix swap scratch corrupt image issue #2275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d32100f
to
5893f04
Compare
f5badf3
to
374a654
Compare
4d58442
to
fe35c27
Compare
off = flash_sector_get_off(§or); | ||
csize = flash_sector_get_size(§or); | ||
|
||
rc = flash_area_erase(fa, off, csize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we still have an issue here as depending on the flash driver implementation, the sector returned by flash_area_get_sector
might be a virtual sector, that is composed of multiple physical sectors. For example, I'm doing that on a project where I have an internal flash memory with 128 KiB physical sectors and an external flash memory with 64 KiB sector. My flash_map_backend
implementation simply use a virtual sector size of 128 KiB for simplicity, but also to be able to support swap-move/offset on that device.
Perhaps we want to add a backwards
argument to flash_area_erase
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flash_area_erase
is an OS function, so it can't be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we still have an issue here as depending on the flash driver implementation, the sector returned by
flash_area_get_sector
might be a virtual sector, that is composed of multiple physical sectors. For example, I'm doing that on a project where I have an internal flash memory with 128 KiB physical sectors and an external flash memory with 64 KiB sector. Myflash_map_backend
implementation simply use a virtual sector size of 128 KiB for simplicity, but also to be able to support swap-move/offset on that device.Perhaps we want to add a
backwards
argument toflash_area_erase
then?
Whatever API returns as a sector is a sector, you do not care beyond that. So your backend just returns sectors. I do not understand what would be benefit of knowing what type of sector that really is.
I am working at making MCUboot funciton with software defined sectors, that may be considered virtual sectors, and reducing code that queries the info allover the place. Software sectors also allow devices with storage on RRAM/NVRAM/MRAM/EEPROM function more effectively as they drop the physical constrains like alignment. Actually, for the purpose to ramp up support of RRAM devices, we made them pretend to be flash and have paged layout, even though it does nothing for them and is only annoying for software above them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, let say my scratch area is 128 KiB and composed of two 64 KiB sectors. Imagine my flash backend is considering 128-KiB virtual sectors, so when trying to erase the scratch area backwards, the backend will return a single 128-KiB sector to boot_erase_region
that we will erase with flash_area_erase
. The backend doesn't know it has to erase the sectors backwards, so it is quite likely it will start by erasing the first 64 KiB sector of the scratch area. If a power cycle occurs at that moment, I guess we'd observe the same issue spotted by @nordicjm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's different, because MCUboot would see it as a single 128KiB sector, so if we assume this is being used for swap status, swap status is going to be at the beginning of that in the first sector, MCUboot will not split up the sectors. Though having though about it I guess that maybe the patch here does only work because of the limited number of sectors, if there were a lot of sectors, so much that the trailer section spanned multiple sectors and power was lost during erasing them (backwards) then I assume it means that the swap would resume earlier than it should do, but also creating a corrupt image... might need a new device in simulator for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's different, because MCUboot would see it as a single 128KiB sector, so if we assume this is being used for swap status, swap status is going to be at the beginning of that in the first sector, MCUboot will not split up the sectors. Though having though about it I guess that maybe the patch here does only work because of the limited number of sectors, if there were a lot of sectors, so much that the trailer section spanned multiple sectors and power was lost during erasing them (backwards) then I assume it means that the swap would resume earlier than it should do, but also creating a corrupt image... might need a new device in simulator for that
Yeah, I was mentioning that problem with swap algorithms too, several times. If your log crosses to a next page it will be left there, in case of interruption, and once new swap starts you will end up in situation where next swap tries to overwrite existing log.
Solution would be to check on every boot, if trailer has erased magic or magic is somehow damaged, and try to erase the entire, presumed, log range. That check should happen on all boots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, let say my scratch area is 128 KiB and composed of two 64 KiB sectors. Imagine my flash backend is considering 128-KiB virtual sectors, so when trying to erase the scratch area backwards, the backend will return a single 128-KiB sector to
boot_erase_region
that we will erase withflash_area_erase
. The backend doesn't know it has to erase the sectors backwards, so it is quite likely it will start by erasing the first 64 KiB sector of the scratch area. If a power cycle occurs at that moment, I guess we'd observe the same issue spotted by @nordicjm
Well you are not sure about that. OK, so you may with that hardware, but if you have a device like SPI NOR, the driver may choose to erase data in bulk and select command it will send to device depending on range you are trying to erase. For example if you ask it to erase 4k, then it will send command to erase 4k, if you ask it to erase 8k, it will send two 4k commands, but if you ask it to erase 64k then it will erase it with one command. In all above cases, the spi nor will report page size as 4k, so you actually do not know how it does the erase in the end.
You may also have a modified driver that will jump out of erase every 4k to allow other software to function, so it is hard to relay on erase as atomic operation. And actually even the erase, as hardware does it, is not atomic, and may be interrupted leaving some data not erased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you are not sure about that. OK, so you may with that hardware, but if you have a device like SPI NOR, the driver may choose to erase data in bulk and select command it will send to device depending on range you are trying to erase. For example if you ask it to erase 4k, then it will send command to erase 4k, if you ask it to erase 8k, it will send two 4k commands, but if you ask it to erase 64k then it will erase it with one command. In all above cases, the spi nor will report page size as 4k, so you actually do not know how it does the erase in the end.
You may also have a modified driver that will jump out of erase every 4k to allow other software to function, so it is hard to relay on erase as atomic operation. And actually even the erase, as hardware does it, is not atomic, and may be interrupted leaving some data not erased.
You're absolutely right, but in that case does that mean swap-scratch is broken by design? Indeed, if my understanding of the bug described by @nordicjm is correct, it seems swap-scratch assumes the trailer of the scratch area is erased before the other data contained in the scratch area, which wouldn't be the case in the situations you're describing, even though boot_erase_region
is erasing backwards.
And in fact, we might even have an issue for swap-move or swap-offset, the swap_scramble_trailer_sectors
routine erases the trailer backwards to ensure we cannot have a case where, after a reboot during the erase of the trailer, the magic number (at the end of the trailer) is valid but the remaining of the trailer has already been erased and is therefore corrupted. Since we cannot trust the flash driver to erase the last part of our trailer first and since the erase is not atomic at hardware-level, I'm wondering if there are some cases where we would have an issue.
boot/bootutil/src/loader.c
Outdated
goto done; | ||
} else if (off >= flash_area_get_size(fa) || (flash_area_get_size(fa) - off) < size) { | ||
rc = -1; | ||
goto done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to add the same check to boot_erase_region
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do, will do it in separate PR though since CI is very slow today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Think mynewt might be broken with this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Think mynewt might be broken with this PR...
What will break it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI was running for 4 hours on mynewt, I assume it just locked up in some loop, what broke it though I have no clue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that keeps happening I will try to run this locally, should still have mynewet env available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment left, but generally looks ok.
82707e0
to
24b3c97
Compare
size_t csize; | ||
|
||
/* Get current sector and, also, correct offset */ | ||
rc = flash_area_get_sector(fa, off, §or); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nordicjm Something is wrong with this function from mynewt. For the second sector starting at offset 131072 it returns sector start offset as 0, which causes the line 624 to assign 0 to offset and the loop never ends.
Refactors the erase functionality in bootutil so that it can be used alongside feeding the watchdog. This has also optimised some functions out. Signed-off-by: Jamie McCrae <[email protected]>
Fixes an issue with the swap using scratch algorithm that would cause the image loaded into the primary slot to be corrupt and unbootable if a device was rebooted during an erase of the scratch section that had not completed Signed-off-by: Jamie McCrae <[email protected]>
Adds some additional fixes that have been fixed for 2.2.0 Signed-off-by: Jamie McCrae <[email protected]>
I did have one minor concern about how helpful this will be if we add support to be able to lie about having a larger erase size than the underlying device. This would then do the erase forward, which might not fix the issue this fixes. |
Refactors the erase functionality in bootutil so that it can be used alongside feeding the watchdog. This has also optimised some functions out.
Also fixes a serious issue with swap using scratch that would leave primary slot images unbootable and potentially brick devices