Monday, September 19, 2022

Zephyr RTOS with MCUboot "USAGE FAULT" after "Jumping to the first image slot", fixed.

Subjectively, what is it like to develop embedded software?

When things go wrong, its days of feeling like your hands are wrapped in mittens while wearing a blindfold, with your primary source of documentation being a Douglas Adams novel like "Hitchhikers Guide to the Galaxy".  Or even worse a Charles Stross "Laundry Series" novel.  Then, after days of struggle, the problem is fixed!  Development rapidly proceeds at a supersonic pace.  Until impact with the next block, stuck for a couple days of zero progress.  Such is life when doing embedded / IOT software development.


Take for example, this recent three day struggle:

Consider that for more than a year, you've been using older, lower RAM STM32 dev boards like the disco_l475_iot1, a truly fine board, other than only having about 128K of ram and Zephyr needing about 127.9K to run everything other than your application code, which will only require another 128K of ram LOL.  Or a classic F767 board like the nucleo_f767zi, a wonderful board and a joy to use, other than being wired ethernet instead of wifi.  The good news is the b_l4s5i_iot01a is a massive upgrade over the disco_l475_iot1 it has around six times the ram... awesome... its the board every "starved for memory" disco_l475_iot1 developer dreams of... this should be a simple port job ... Right?  Right?  Just recompile and upload?  Its never that easy in embedded development...

Not to put the cart before the horse, however, the problem I ran into is, in a technical sense only, documented at:

https://docs.zephyrproject.org/latest/services/device_mgmt/dfu.html

Where it explains you need to set the zephyr,code-partition in your DTS overlay file using a line looking something like:

zephyr,code-partition = &slot0_partition;

However, old dev boards automatically or automagically had that set, no developer effort required.  To make it crystal clear, although it disagrees with the docs above, code for old boards (admittedly, using older versions of Zephyr...) works perfectly without setting a code-partition variable.  I just verified using the latest version of Zephyr that a venerable 'F767 compiles and boots perfectly with MCUboot, no need to manually set a code-partition setting.

What happens if you do not configure a code-partition and compile for a nice new l4s5i board?  Certainly, not any error messages upon compilation, not the slightest indication of any future problems incoming.  Run your imgtool to sign the new image with firmware keys and mark as confirmed so MCUboot will be happy, upload to your dev board at address 0x08020000 as you do on that platform, reboot, and I kid you not, this is all you have to troubleshoot the problem:

(the usual MCUboot startup messages, it recognizes my primary image as good, etc)
I: Jumping to the first image slot
E: ***** USAGE FAULT *****
E: Attempt to execute undefined instruction
(boring register dump here)
E: Faulting instruction address (r15/pc): 0x0801bca8
and it crashes out and halts the system.

Notice that the PC address where it crashes out is NOT in the zephyr firmware slot ranges, its in the MCUboot range of addresses.  Interesting!  So its "obviously" a MCUboot bug, amirite?  Not so fast...

I think I have a good MCUboot compilation and source tree, in fact if I change the "west" command to force the same zephyr source tree to compile MCUboot for my old boards, they work fine.  So, a code bug in MCUboot?  I donno, there's not much space in "jump to the first slot image" to have a bug. 

Clearly (sarcasm) it can't be a problem with my code that is configured as per my old notes and runs fine on my old boards.  MCUboot is crashing at an MCUboot program counter address ONLY on the brand new dev board and fine on the other hardware.

Try one of several dev boards?  Sure, why not.  Turns out none of them work, all fail the same way.  OK then.  So its not hardware, its not Zephyr, its crashing at a MCUboot address, its not my build system, that works fine when told to compile for other boards...  I've "proven" it has to be a MCUboot problem.  

I spent a couple days learning about the innards of MCUboot.  Fun, but not productive and it didn't fix my problem.

After some days of messing around trying everything else under the sun, as typical for embedded development, I finally checked the Zephyr docs for device management (as opposed to using my documentation and source code comments and working source code) and sometime after Zephyr version 2.2.1 and before Zephyr 3.0.0 the online Zephyr docs were changed to reflect the need for a code-partition setting (apparently only for SOME boards, as most of my boards don't require this, and certainly never required it in the past).

Add that code-partition config option and recompile and imgtool sign the binary, upload with STM32CubeProgrammer, and we're off and running, where I should have been three days ago.

Two interesting summary points:  Good luck finding anything explaining this using Google, because I certainly did not, all I had to work with was the crash address 0x0801bca8 and that isn't finding much.  Also, yes, weird as it sounds, I have proven you can crash MCUboot while its running, by giving it a Zephyr firmware image that compiled perfectly yet was incorrectly configured.

By posting this on my blog maybe Google will index it, then people who experience crashes in MCUboot at address 0x0801bca8 on the newest L4 STM32 boards while trying to run old code that runs fine on older boards, will find this explanation and fix...


I could also tell the typical embedded development story of how my spyware blocking "lie protection" software thought the Ninja build tool, used by Zephyr, is spyware so it would crash the build for fun, not bother to log any errors or other explanation, just kind of randomly crash the suspected spyware program while in mid-build.  That was so much fun.


So... yeah.  That's what people mean when they talk about the Embedded IoT software development experience.  On the other hand, it's more fun than the above blog post implies, really, it is.  It's just that the struggles are real.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.