Commit 6a174f4
authored
fix(tuner): Include sm_drivers channel in HyperparameterTuner jobs (#5634)
* fix(tuner): Include sm_drivers channel in HyperparameterTuner jobs
When ModelTrainer has distributed=Torchrun(), the sm_drivers channel
contains torchrun_driver.py and sm_train.sh which are required for
multi-GPU execution. The tuner was not building this channel, causing
the framework container to fall back to the legacy single-GPU entry
point (python train.py) instead of torchrun.
This caused a tensor size mismatch (batch_size vs accumulated_batch)
in TRL's compute_loss when gradient_accumulation_steps > 1, because
the single-process path doesn't partition batches across ranks.
Fix: Replace _upload_source_code_and_configure_hyperparameters with
_build_driver_and_code_channels that replicates ModelTrainer's channel
building logic (sm_drivers, code, distributed.json, sourcecode.json,
sm_train.sh). Also pass through environment and VPC config.
* fix(tuner): Harden _build_training_job_definition against missing attributes
- Use getattr with fallback for static_hyperparameters (fixes
test_build_training_job_definition_includes_internal_channels)
- Guard _prepare_model_trainer_for_tuning with isinstance check on
entry_script to avoid calling _build_driver_and_code_channels on
MagicMock model trainers
- Guard environment passthrough with isinstance(env, dict) check
- Guard VPC config passthrough with try/except for mock safety
* fix(test): Rewrite tuner distributed integ test to match CI patterns
- Use sagemaker_session fixture from conftest (auto-resolves role/region)
- Use ml.m5.xlarge CPU instance (cheaper, available in CI)
- Remove hardcoded role ARN and training_mode
- Remove @pytest.mark.slow (not registered in CI config)
- Use module-level function instead of class (matches other integ tests)
- Use DEFAULT_CPU_IMAGE consistent with test_model_trainer.py
* fix(tuner): Upload sourcedir.tar.gz for framework container compatibility
The HPT API uses the legacy framework container path which expects
sagemaker_submit_directory (a tar.gz on S3) to be downloaded and
extracted to /opt/ml/code/. The previous approach of using a 'code'
input channel mounted the code at /opt/ml/input/data/code/ instead,
causing 'No such file or directory' errors.
Fix: Create and upload sourcedir.tar.gz to S3, set both
sagemaker_program and sagemaker_submit_directory hyperparameters.
Remove the separate 'code' input channel since the framework
container handles code extraction via sagemaker_submit_directory.
* test(tuner): Add unit tests for driver/code channel building
Add 25 unit tests covering the tuner changes from PR #5634:
- _prepare_model_trainer_for_tuning guard logic
- _build_driver_and_code_channels sm_drivers channel creation
- _build_training_job_definition _tuner_channels inclusion
- Environment and VPC config passthrough
- sourcedir.tar.gz upload and sagemaker_submit_directory HP
- static_hyperparameters getattr fallback1 parent 10df8a4 commit 6a174f4
File tree
3 files changed
+801
-60
lines changed- sagemaker-train
- src/sagemaker/train
- tests
- integ/train
- unit/train
3 files changed
+801
-60
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
444 | 444 | | |
445 | 445 | | |
446 | 446 | | |
447 | | - | |
| 447 | + | |
448 | 448 | | |
449 | | - | |
450 | | - | |
451 | | - | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
452 | 454 | | |
453 | 455 | | |
454 | 456 | | |
455 | 457 | | |
456 | 458 | | |
457 | 459 | | |
458 | 460 | | |
459 | | - | |
460 | | - | |
461 | | - | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
462 | 468 | | |
463 | | - | |
464 | | - | |
465 | | - | |
| 469 | + | |
466 | 470 | | |
467 | | - | |
468 | | - | |
469 | | - | |
470 | | - | |
471 | | - | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
472 | 474 | | |
473 | | - | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
474 | 478 | | |
475 | 479 | | |
476 | 480 | | |
477 | 481 | | |
| 482 | + | |
478 | 483 | | |
479 | | - | |
480 | | - | |
| 484 | + | |
481 | 485 | | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
482 | 495 | | |
483 | 496 | | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
484 | 535 | | |
485 | | - | |
486 | | - | |
487 | | - | |
| 536 | + | |
| 537 | + | |
488 | 538 | | |
489 | | - | |
490 | | - | |
491 | | - | |
492 | | - | |
493 | | - | |
494 | | - | |
495 | | - | |
496 | | - | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
497 | 545 | | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
502 | | - | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
503 | 552 | | |
504 | | - | |
505 | | - | |
506 | | - | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
507 | 556 | | |
| 557 | + | |
| 558 | + | |
508 | 559 | | |
509 | | - | |
510 | 560 | | |
511 | | - | |
512 | | - | |
513 | | - | |
514 | | - | |
515 | | - | |
516 | | - | |
517 | | - | |
518 | | - | |
519 | | - | |
520 | | - | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
521 | 569 | | |
522 | | - | |
523 | | - | |
524 | | - | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
525 | 573 | | |
526 | | - | |
527 | 574 | | |
528 | 575 | | |
| 576 | + | |
| 577 | + | |
529 | 578 | | |
530 | | - | |
531 | | - | |
532 | | - | |
533 | | - | |
534 | | - | |
535 | | - | |
536 | | - | |
| 579 | + | |
| 580 | + | |
537 | 581 | | |
538 | 582 | | |
539 | 583 | | |
| |||
1422 | 1466 | | |
1423 | 1467 | | |
1424 | 1468 | | |
| 1469 | + | |
| 1470 | + | |
| 1471 | + | |
| 1472 | + | |
| 1473 | + | |
| 1474 | + | |
1425 | 1475 | | |
1426 | 1476 | | |
1427 | 1477 | | |
| |||
1459 | 1509 | | |
1460 | 1510 | | |
1461 | 1511 | | |
1462 | | - | |
| 1512 | + | |
1463 | 1513 | | |
1464 | 1514 | | |
| 1515 | + | |
| 1516 | + | |
| 1517 | + | |
| 1518 | + | |
| 1519 | + | |
| 1520 | + | |
| 1521 | + | |
| 1522 | + | |
| 1523 | + | |
| 1524 | + | |
| 1525 | + | |
| 1526 | + | |
| 1527 | + | |
| 1528 | + | |
| 1529 | + | |
1465 | 1530 | | |
Lines changed: 127 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
0 commit comments