[core] use swap_tensors in group offloading where possible
#29104
| Job | Run time |
|---|---|
| 25s | |
| 25s |
swap_tensors in group offloading where possible
#29104
| Job | Run time |
|---|---|
| 25s | |
| 25s |