Skip to content

use lazy map for dataset preprocessing #1917

@akoumpa

Description

@akoumpa

Is your feature request related to a problem? Please describe.
AM uses .map(fn) to transform datasets (e.g., apply template, tokenize etc). Unfortunately, this happens during startup and runs over the whole dataset. In addition, the current caching functionality often fails, incurring additional delay on repeated runs.

2026-04-20 12:09:22 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-20 12:09:22 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/6d6f531753869fcd8cbbaefa7ee94b93966d2d04/config.json "HTTP/
1.1 200 OK"
2026-04-20 12:09:23 | INFO | nemo_automodel._transformers.auto_tokenizer | Using custom tokenizer MistralCommonBackend for model type 'mistral3'
2026-04-20 12:09:23 | INFO | httpx | HTTP Request: GET https://huggingface.co/api/models/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-04-20 12:09:24 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/resolve/main/tekken.json "HTTP/1.1 302 Found"
2026-04-20 12:09:24 | INFO | mistral_common.tokens.tokenizers.tekken | Non special vocabulary size is 130072 with 1000 special tokens.
2026-04-20 12:09:24 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/main/README.md "HTTP/1.1 307 Temporary Redirect"
2026-04-20 12:09:24 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/api/resolve-cache/datasets/rajpurkar/squad/7b6d24c440a36b6815f21b70d25016731768db1f/README.md "HTTP/1.1 200 OK"
2026-04-20 12:09:24 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/squad.py "HTTP/1.1 404 Not Found"
2026-04-20 12:09:25 | INFO | httpx | HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/rajpurkar/squad/rajpurkar/squad.py "HTTP/1.1 404 Not Found"
2026-04-20 12:09:25 | INFO | httpx | HTTP Request: GET https://huggingface.co/api/datasets/rajpurkar/squad/revision/7b6d24c440a36b6815f21b70d25016731768db1f "HTTP/1.1 200 OK"
2026-04-20 12:09:25 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/.huggingface.yaml "HTTP/1.1 404 Not Found"
2026-04-20 12:09:25 | INFO | httpx | HTTP Request: GET https://datasets-server.huggingface.co/info?dataset=rajpurkar/squad "HTTP/1.1 200 OK"
2026-04-20 12:09:25 | INFO | httpx | HTTP Request: GET https://huggingface.co/api/datasets/rajpurkar/squad/tree/7b6d24c440a36b6815f21b70d25016731768db1f/plain_text?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-04-20 12:09:25 | INFO | httpx | HTTP Request: GET https://huggingface.co/api/datasets/rajpurkar/squad/tree/7b6d24c440a36b6815f21b70d25016731768db1f?recursive=false&expand=false "HTTP/1.1 200 OK"
2026-04-20 12:09:26 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/dataset_infos.json "HTTP/1.1 404 Not Found"
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2918.98 examples/s]
2026-04-20 12:09:56 | INFO | root | Using model config to instantiate tokenizer
2026-04-20 12:09:56 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-04-20 12:09:56 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/6d6f531753869fcd8cbbaefa7ee94b93966d2d04/config.json "HTTP/
1.1 200 OK"
2026-04-20 12:09:56 | INFO | nemo_automodel._transformers.auto_tokenizer | Using custom tokenizer MistralCommonBackend for model type 'mistral3'
2026-04-20 12:09:56 | INFO | httpx | HTTP Request: GET https://huggingface.co/api/models/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-04-20 12:09:56 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16/resolve/main/tekken.json "HTTP/1.1 302 Found"
2026-04-20 12:09:56 | INFO | mistral_common.tokens.tokenizers.tekken | Non special vocabulary size is 130072 with 1000 special tokens.
2026-04-20 12:09:57 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/main/README.md "HTTP/1.1 307 Temporary Redirect"
2026-04-20 12:09:57 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/api/resolve-cache/datasets/rajpurkar/squad/7b6d24c440a36b6815f21b70d25016731768db1f/README.md "HTTP/1.1 200 OK"
2026-04-20 12:09:57 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/squad.py "HTTP/1.1 404 Not Found"
2026-04-20 12:09:57 | INFO | httpx | HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/rajpurkar/squad/rajpurkar/squad.py "HTTP/1.1 404 Not Found"
2026-04-20 12:09:57 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/.huggingface.yaml "HTTP/1.1 404 Not Found"
Map:   0%|▌                                                                                                                                                           | 286/87599 [00:00<00:30, 2842.86 examples/s]
2026-04-20 12:09:58 | INFO | httpx | HTTP Request: GET https://datasets-server.huggingface.co/info?dataset=rajpurkar/squad "HTTP/1.1 200 OK"
Map:   1%|█▎                                                                                                                                                          | 710/87599 [00:00<00:30, 2820.32 examples/s]
2026-04-20 12:09:58 | INFO | httpx | HTTP Request: HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/dataset_infos.json "HTTP/1.1 404 Not Found"
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 2617.07 examples/s]
Map:   2%|██▌                                                                                                                                                        | 1459/87599 [00:00<00:32, 2678.25 examples/s]
Map:   2%|██▌                                                                                                                                                        | 1457/87599 [00:00<00:32, 2675.43 examples/s]
Map:  51%|██████████████████████████████████████████████████████████████████████████████▍                                                                           | 44630/87599 [00:14<00:14, 2940.95 examples/s]
Map:  52%|███████████████████████████████████████████████████████████████████████████████▋                                                                          | 45310/87599 [00:15<00:15, 2670.04 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2878.41 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2877.41 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2873.01 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2864.85 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2863.68 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2860.51 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:30<00:00, 2857.92 examples/s]
2026-04-20 12:10:29 | INFO | root | Signal handler installed for 15

For example in the above log we can see that the full dataset (~88k samples) are preprocessed before finetuning can start.

Describe the solution you'd like
We want to switch to a lazy preprocessing, that processes items on the fly, and caches results.

Describe alternatives you've considered
N/A

Additional context
N/A

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions