I encountered and error when I deploy my training job in my notebook instance.
This what it says:
"UnexpectedStatusException: Error for Training job tensorflow-training-2021-01-26-09-55-05-768: Failed. Reason: ClientError: Data download failed:Could not download s3://forex-model-data/data/train2001_2020.npz: insufficient disk space"
I deploy training job to try running it to different instances in 3 epoch. I use ml.c5.4xlarge, ml.c5.18xlarge, ml.m5.24xlarge, also I have two sets of training data, train2001_2020.npz and train2016_2020.npz.
First, I run train2001_2020 to ml.c5.18xlarge and ml.c5.18xlarge and the training job completed, then I switch to train2016_2020 and run it to ml.c5.4xlarge and ml.c5.18xlarge and it goes well. Then when I tried to run it using ml.m5.24xlarge I got an error (quoted above), but my dataset is train2016_2020 not train2001_2020 then when I rerun it again with all other instances it has the same error. What happen?
I stopped the instances and refresh everything, but I encountered same issue.
question from:
https://stackoverflow.com/questions/65902366/aws-sagemaker-clienterror-data-download-failedcould-not-download 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…