drivers/mtd/w25n: implement bad block management#18771
drivers/mtd/w25n: implement bad block management#18771julianoes wants to merge 1 commit intoapache:masterfrom
Conversation
315b125 to
c9e976d
Compare
Use the chip's built-in 20-entry non-volatile Bad Block Management Look-Up Table (datasheet section 8.2.7) to transparently route around bad blocks. Init: - Reserve the top 24 blocks of the array as a spare pool - Clamp the MTD geometry to W25N_USER_BLOCKS = 1000 so upper layers never see the spare area (125 MB usable instead of 128 MB) - Force BUF=1 alongside enabling ECC. The W25N01GVxxIT variant power-ups with BUF=0 (Continuous Read mode), in which Read Data ignores the column address and always starts at byte 0 - which silently broke any read targeting a non-zero column (OOB markers, sub-page reads in w25n_read). - Scan all 1024 blocks for factory bad markers (non-FFh at byte 0 of the spare area of page 0) and remap any user-area bad blocks via the A1h BBM command. Idempotent across reboots: blocks already present in the LUT are skipped, so repeated scans don't consume LUT slots. Runtime: - On E-FAIL from w25n_block_erase or P-FAIL from w25n_program_execute, allocate a spare and issue A1h, then retry the operation once. The chip routes the retry to the spare PBA transparently. Data buffer is reloaded on program retry. - Uncorrectable read ECC is left as -EIO (soft errors shouldn't trigger permanent remap, and remapping discards data we may still recover). Safeguards against burning LUT slots on bogus bad blocks: - w25n_pick_free_spare erases each candidate spare as an active proof of life before returning it - the factory OOB marker alone isn't trusted. - w25n_bbm_swap rejects A1h with LBA outside the user area or PBA outside the spare pool. Stack discipline for the logger-thread hot path: - The 20-entry cached LUT lives in the device struct, not on the stack. - w25n_read_bbm_lut decodes 4 bytes at a time instead of reading the full 80-byte LUT dump into a local buffer. Boot diagnostics are emitted via syslog so they appear unconditionally: - [w25n] BBM scan summary (new/remapped/unremapped/previously-remapped/ LUT slots used) - [w25n] W25N01GV ready line with user blocks, spare count, geometry, and actual SPI frequency - [w25n] per-remap info and warnings on runtime E-FAIL/P-FAIL paths Note: existing littlefs filesystems become unmountable because the block count shrinks from 1024 to 1000; both PX4 board init.c paths already mount with autoformat so they reformat on first boot after this change. Signed-off-by: Julian Oes <julian@oes.ch>
c9e976d to
fc5f6df
Compare
julianoes
left a comment
There was a problem hiding this comment.
Reviewed myself. From what I can tell this looks sane.
| } | ||
| } | ||
|
|
||
| syslog(LOG_INFO, |
| * spare PBA transparently. | ||
| */ | ||
|
|
||
| syslog(LOG_WARNING, |
| * data buffer is reloaded above on the retry pass. | ||
| */ | ||
|
|
||
| syslog(LOG_WARNING, |
| finfo("W25N initialized: %d blocks, %d bytes/block\n", | ||
| priv->nblocks, W25N_BLOCK_SIZE); | ||
|
|
||
| syslog(LOG_INFO, |
|
How does it work with the license if Claude is the one who wrote the code? Can you submit it under Apache license or does Anthropic hold the copyright? |
| continue; | ||
| } | ||
|
|
||
| if (w25n_is_factory_bad(priv, b)) |
There was a problem hiding this comment.
Merge this with the previous if statement
| continue; | ||
| } | ||
|
|
||
| if (!w25n_is_factory_bad(priv, block)) |
| for (attempt = 0; attempt < 2; attempt++) | ||
| { | ||
| ret = w25n_block_erase(priv, block); | ||
| if (ret == OK || ret != -EIO) |
There was a problem hiding this comment.
Why catch only EIO error?
@michallenc nobody know for sure, it is like the monkey selfie: https://www.bbc.com/future/article/20260414-the-monkey-selfie-that-predicted-the-ai-age something like public domain. So, I think it is fine to use it, otherwise Linux kernel and all other projects using AI should face legal actions. And the most important thing here is that @julianoes was honest to say it was created by Claude. |
| * Each entry is encoded as (lba_raw << 16) | pba_raw, where the high two | ||
| * LBA bits are status flags (Enable, Invalid). | ||
| * See datasheet section 8.2.8. | ||
| * |
There was a problem hiding this comment.
Please add Input Parameters
| * Description: | ||
| * Add an LBA->PBA link to the chip's non-volatile BBM LUT. Subsequent | ||
| * accesses to the LBA are transparently routed to the PBA by the chip. | ||
| * |
There was a problem hiding this comment.
Please add Input Parameters and Returned Value
| * Read byte 0 of the spare area of the first page of `block`. The factory | ||
| * marks bad blocks with a non-FFh value here. Returns the marker value | ||
| * (0..255), or 0x00 if the read failed (treated as bad). | ||
| * |
There was a problem hiding this comment.
Please add Input Parameters and Returned Value
| * Find an unused, non-factory-bad block in the spare pool that is not | ||
| * already a remap target in the LUT. Returns the block index, or | ||
| * 0xffff if none is available. | ||
| * |
| * Allocate a spare and add a BBM LUT entry mapping `block` to it. | ||
| * Returns OK if the chip will now route accesses to `block` to a good | ||
| * spare, or a negative errno if no spare is available or the LUT is full. | ||
| * |
| * blocks via the chip's BBM LUT. Idempotent across reboots: blocks that | ||
| * are already remapped (present in the LUT) are skipped, so repeated | ||
| * scans don't consume LUT slots. Takes ~200 ms (1024 OOB reads). | ||
| * |
|
@julianoes are you using W25N01GVZEIG ? I have in on my shelf: https://acassis.wordpress.com/2018/01/14/a-huddle-of-electronic-modules/ |
Summary
Use the chip's built-in 20-entry non-volatile Bad Block Management Look-Up Table (datasheet section 8.2.7) to transparently route around bad blocks.
This is mostly developed using Claude Code. I'll do a thorough review myself and compare things to the datasheet.
Init:
Runtime:
Safeguards against burning LUT slots on bogus bad blocks:
Stack discipline for the logger-thread hot path:
Boot diagnostics are emitted via syslog so they appear unconditionally:
Impact
Note: existing littlefs filesystems become unmountable because the block count shrinks from 1024 to 1000; both PX4 board init.c paths already mount with autoformat so they reformat on first boot after this change.
Testing
Tested downstream using PX4 manually. Unfortunately (or fortunately) none of my boards have any bad blocks, so I could not actually test this. If anyone has an idea how to test this, I'm all ears.