Skip to content

drivers/mtd/w25n: implement bad block management#18771

Open
julianoes wants to merge 1 commit intoapache:masterfrom
julianoes:pr-w25n-bad-block
Open

drivers/mtd/w25n: implement bad block management#18771
julianoes wants to merge 1 commit intoapache:masterfrom
julianoes:pr-w25n-bad-block

Conversation

@julianoes
Copy link
Copy Markdown
Contributor

Summary

Use the chip's built-in 20-entry non-volatile Bad Block Management Look-Up Table (datasheet section 8.2.7) to transparently route around bad blocks.

This is mostly developed using Claude Code. I'll do a thorough review myself and compare things to the datasheet.

Init:

  • Reserve the top 24 blocks of the array as a spare pool
  • Clamp the MTD geometry to W25N_USER_BLOCKS = 1000 so upper layers never see the spare area (125 MB usable instead of 128 MB)
  • Force BUF=1 alongside enabling ECC. The W25N01GVxxIT variant power-ups with BUF=0 (Continuous Read mode), in which Read Data ignores the column address and always starts at byte 0 - which silently broke any read targeting a non-zero column (OOB markers, sub-page reads in w25n_read).
  • Scan all 1024 blocks for factory bad markers (non-FFh at byte 0 of the spare area of page 0) and remap any user-area bad blocks via the A1h BBM command. Idempotent across reboots: blocks already present in the LUT are skipped, so repeated scans don't consume LUT slots.

Runtime:

  • On E-FAIL from w25n_block_erase or P-FAIL from w25n_program_execute, allocate a spare and issue A1h, then retry the operation once. The chip routes the retry to the spare PBA transparently. Data buffer is reloaded on program retry.
  • Uncorrectable read ECC is left as -EIO (soft errors shouldn't trigger permanent remap, and remapping discards data we may still recover).

Safeguards against burning LUT slots on bogus bad blocks:

  • w25n_pick_free_spare erases each candidate spare as an active proof of life before returning it - the factory OOB marker alone isn't trusted.
  • w25n_bbm_swap rejects A1h with LBA outside the user area or PBA outside the spare pool.

Stack discipline for the logger-thread hot path:

  • The 20-entry cached LUT lives in the device struct, not on the stack.
  • w25n_read_bbm_lut decodes 4 bytes at a time instead of reading the full 80-byte LUT dump into a local buffer.

Boot diagnostics are emitted via syslog so they appear unconditionally:

  • [w25n] BBM scan summary (new/remapped/unremapped/previously-remapped/ LUT slots used)
  • [w25n] W25N01GV ready line with user blocks, spare count, geometry, and actual SPI frequency
  • [w25n] per-remap info and warnings on runtime E-FAIL/P-FAIL paths

Impact

Note: existing littlefs filesystems become unmountable because the block count shrinks from 1024 to 1000; both PX4 board init.c paths already mount with autoformat so they reformat on first boot after this change.

Testing

Tested downstream using PX4 manually. Unfortunately (or fortunately) none of my boards have any bad blocks, so I could not actually test this. If anyone has an idea how to test this, I'm all ears.

@github-actions github-actions Bot added Area: Drivers Drivers issues Size: M The size of the change in this PR is medium labels Apr 21, 2026
@julianoes julianoes force-pushed the pr-w25n-bad-block branch 2 times, most recently from 315b125 to c9e976d Compare April 21, 2026 01:26
Use the chip's built-in 20-entry non-volatile Bad Block Management
Look-Up Table (datasheet section 8.2.7) to transparently route around
bad blocks.

Init:
- Reserve the top 24 blocks of the array as a spare pool
- Clamp the MTD geometry to W25N_USER_BLOCKS = 1000 so upper layers
  never see the spare area (125 MB usable instead of 128 MB)
- Force BUF=1 alongside enabling ECC. The W25N01GVxxIT variant
  power-ups with BUF=0 (Continuous Read mode), in which Read Data
  ignores the column address and always starts at byte 0 - which
  silently broke any read targeting a non-zero column (OOB markers,
  sub-page reads in w25n_read).
- Scan all 1024 blocks for factory bad markers (non-FFh at byte 0 of
  the spare area of page 0) and remap any user-area bad blocks via the
  A1h BBM command. Idempotent across reboots: blocks already present
  in the LUT are skipped, so repeated scans don't consume LUT slots.

Runtime:
- On E-FAIL from w25n_block_erase or P-FAIL from w25n_program_execute,
  allocate a spare and issue A1h, then retry the operation once. The
  chip routes the retry to the spare PBA transparently. Data buffer is
  reloaded on program retry.
- Uncorrectable read ECC is left as -EIO (soft errors shouldn't trigger
  permanent remap, and remapping discards data we may still recover).

Safeguards against burning LUT slots on bogus bad blocks:
- w25n_pick_free_spare erases each candidate spare as an active proof
  of life before returning it - the factory OOB marker alone isn't
  trusted.
- w25n_bbm_swap rejects A1h with LBA outside the user area or PBA
  outside the spare pool.

Stack discipline for the logger-thread hot path:
- The 20-entry cached LUT lives in the device struct, not on the stack.
- w25n_read_bbm_lut decodes 4 bytes at a time instead of reading the
  full 80-byte LUT dump into a local buffer.

Boot diagnostics are emitted via syslog so they appear unconditionally:
- [w25n] BBM scan summary (new/remapped/unremapped/previously-remapped/
  LUT slots used)
- [w25n] W25N01GV ready line with user blocks, spare count, geometry,
  and actual SPI frequency
- [w25n] per-remap info and warnings on runtime E-FAIL/P-FAIL paths

Note: existing littlefs filesystems become unmountable because the
block count shrinks from 1024 to 1000; both PX4 board init.c paths
already mount with autoformat so they reformat on first boot after
this change.

Signed-off-by: Julian Oes <julian@oes.ch>
Copy link
Copy Markdown
Contributor Author

@julianoes julianoes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed myself. From what I can tell this looks sane.

@julianoes julianoes marked this pull request as ready for review April 21, 2026 01:31
Comment thread drivers/mtd/w25n.c
}
}

syslog(LOG_INFO,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finfo

Comment thread drivers/mtd/w25n.c
* spare PBA transparently.
*/

syslog(LOG_WARNING,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwarn

Comment thread drivers/mtd/w25n.c
* data buffer is reloaded above on the retry pass.
*/

syslog(LOG_WARNING,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwarn

Comment thread drivers/mtd/w25n.c
finfo("W25N initialized: %d blocks, %d bytes/block\n",
priv->nblocks, W25N_BLOCK_SIZE);

syslog(LOG_INFO,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finfo

@michallenc
Copy link
Copy Markdown
Contributor

How does it work with the license if Claude is the one who wrote the code? Can you submit it under Apache license or does Anthropic hold the copyright?

Comment thread drivers/mtd/w25n.c
continue;
}

if (w25n_is_factory_bad(priv, b))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this with the previous if statement

Comment thread drivers/mtd/w25n.c
continue;
}

if (!w25n_is_factory_bad(priv, block))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread drivers/mtd/w25n.c
for (attempt = 0; attempt < 2; attempt++)
{
ret = w25n_block_erase(priv, block);
if (ret == OK || ret != -EIO)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why catch only EIO error?

@acassis
Copy link
Copy Markdown
Contributor

acassis commented Apr 21, 2026

How does it work with the license if Claude is the one who wrote the code? Can you submit it under Apache license or does Anthropic hold the copyright?

@michallenc nobody know for sure, it is like the monkey selfie: https://www.bbc.com/future/article/20260414-the-monkey-selfie-that-predicted-the-ai-age something like public domain.

So, I think it is fine to use it, otherwise Linux kernel and all other projects using AI should face legal actions. And the most important thing here is that @julianoes was honest to say it was created by Claude.

Comment thread drivers/mtd/w25n.c
* Each entry is encoded as (lba_raw << 16) | pba_raw, where the high two
* LBA bits are status flags (Enable, Invalid).
* See datasheet section 8.2.8.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Input Parameters

Comment thread drivers/mtd/w25n.c
* Description:
* Add an LBA->PBA link to the chip's non-volatile BBM LUT. Subsequent
* accesses to the LBA are transparently routed to the PBA by the chip.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Input Parameters and Returned Value

Comment thread drivers/mtd/w25n.c
* Read byte 0 of the spare area of the first page of `block`. The factory
* marks bad blocks with a non-FFh value here. Returns the marker value
* (0..255), or 0x00 if the read failed (treated as bad).
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Input Parameters and Returned Value

Comment thread drivers/mtd/w25n.c
* Find an unused, non-factory-bad block in the spare pool that is not
* already a remap target in the LUT. Returns the block index, or
* 0xffff if none is available.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment thread drivers/mtd/w25n.c
* Allocate a spare and add a BBM LUT entry mapping `block` to it.
* Returns OK if the chip will now route accesses to `block` to a good
* spare, or a negative errno if no spare is available or the LUT is full.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Comment thread drivers/mtd/w25n.c
* blocks via the chip's BBM LUT. Idempotent across reboots: blocks that
* are already remapped (present in the LUT) are skipped, so repeated
* scans don't consume LUT slots. Takes ~200 ms (1024 OOB reads).
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@acassis
Copy link
Copy Markdown
Contributor

acassis commented Apr 21, 2026

@julianoes are you using W25N01GVZEIG ? I have in on my shelf: https://acassis.wordpress.com/2018/01/14/a-huddle-of-electronic-modules/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Drivers Drivers issues Size: M The size of the change in this PR is medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants