I followed a tutorial and trained my first LoRA today. I was surprised to see it was using both my GPUs - 1080ti and 3060, but then it failed halfway through. I won’t print the whole log, but here are the important parts that caught my attention:
More than one GPU was found, enabling multi-GPU training.
2023-09-17 10:35:32.654285: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Blocksparse is not available: the current GPU does not expose Tensor cores
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
So my guess is the tensor errors are because of the GTX card which doesn’t have tensor cores. I removed that card and everything ran fine with just the 3060. I imagine either card would work by itself, but the differences between the two may have been enough to cause data corruption.
So I’m wondering if anyone has this working with multiple RTX cards. Can it work across generations - 3060 and 4060ti, etc. Or does it have to be the same generation? Thanks in advance.
As for the LoRA itself, it needs more work (denim boots)
Just an update, I was able to make it use only the 3060 by adding
export CUDA_VISIBLE_DEVICES=0
to gui.sh
To list your device IDs, you can enter
python3 -c "import torch; print([(i, torch.cuda.get_device_properties(i)) for i in range(torch.cuda.device_count())])"
https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on
https://stackoverflow.com/a/73179074