Differentiate thread failure from GPU failure by declaring a GPU sick first and trying to restart the thread without re-initialising the card. If that fails, then try once more at ten minutes and declare it dead. This should prevent an attempted re-initialising of the GPU from taking out other GPUs.