My trainig process is frozen

Hello,
training process is initiated without problem, but when some times left, it is frozen like:

![image](https://github.com/jiangxiluning/FOTS.PyTorch/assets/60608880/7f0eabde-dfca-4b47-9f9b-982071568a59)

it doesn't show any change on console

and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):

![image](https://github.com/jiangxiluning/FOTS.PyTorch/assets/60608880/c66d742c-a3e9-45f7-a020-2299b3310c53)

I fixed parameter like batch_size, worker, etc, but it doesn't help

Can anyone help?

my env is on miniconda3, and using CUDA 11.8, so version is:
PyTorch 2.0.0
PyTorch Lightning 2.0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My trainig process is frozen #107

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

My trainig process is frozen #107

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions