深度学习训练模型时,GPU显存不够怎么办?
作者丨游客26024知乎(已授权)
来源丨https:www。zhihu。comquestion461811359answer2492822726
编辑丨极市平台
题外话,我为什么要写这篇博客,就是因为我穷!没钱!租的服务器使用多GPU时一会钱就烧没了(gpu内存不用),急需要一种trick,来降低内存加速。
回到正题,如果我们使用的数据集较大,且网络较深,则会造成训练较慢,此时我们要想加速训练可以使用Pytorch的AMP(autocast与Gradscaler);本文便是依据此写出的博文,对Pytorch的AMP(autocast与Gradscaler进行对比)自动混合精度对模型训练加速。
注意Pytorch1。6,已经内置torch。cuda。amp,因此便不需要加载NVIDIA的apex库(半精度加速),为方便我们便不使用NVIDIA的apex库(安装麻烦),转而使用torch。cuda。amp。
AMP(Automaticmixedprecision):自动混合精度,那什么是自动混合精度?
先来梳理一下历史:先有NVIDIA的apex,之后NVIDIA的开发人员将其贡献到Pytorch1。6产生了torch。cuda。amp〔这是笔者梳理,可能有误,请留言〕
详细讲:默认情况下,大多数深度学习框架都采用32位浮点算法进行训练。2017年,NVIDIA研究了一种用于混合精度训练的方法(apex),该方法在训练网络时将单精度(FP32)与半精度(FP16)结合在一起,并使用相同的超参数实现了与FP32几乎相同的精度,且速度比之前快了不少
之后,来到了AMP时代(特指torch。cuda。amp),此有两个关键词:自动与混合精度(Pytorch1。6中的torch。cuda。amp)其中,自动表现在Tensor的dtype类型会自动变化,框架按需自动调整tensor的dtype,可能有些地方需要手动干预;混合精度表现在采用不止一种精度的Tensor,torch。FloatTensor与torch。HalfTensor。并且从名字可以看出torch。cuda。amp,这个功能只能在cuda上使用!为什么我们要使用AMP自动混合精度?
1。减少显存占用(FP16优势)
2。加快训练和推断的计算(FP16优势)
3。张量核心的普及(NVIDIATensorCore),低精度(FP16优势)
4。混合精度训练缓解舍入误差问题,(FP16有此劣势,但是FP32可以避免此)
5。损失放大,可能使用混合精度还会出现无法收敛的问题〔其原因时激活梯度值较小〕,造成了溢出,则可以通过使用torch。cuda。amp。GradScaler放大损失来防止梯度的下溢
申明此篇博文主旨为如何让网络模型加速训练,而非去了解其原理,且其以AlexNet为网络架构(其需要输入的图像大小为227x227x3),CIFAR10为数据集,Adamw为梯度下降函数,学习率机制为ReduceLROnPlateau举例。使用的电脑是2060的拯救者,虽然渣,但是还是可以搞搞这些测试。
本文从1。没使用DDP与DP训练与评估代码(之后加入amp),2。分布式DP训练与评估代码(之后加入amp),3。单进程占用多卡DDP训练与评估代码(之后加入amp)角度讲解。
运行此程序时,文件的结构:D:PycharmProjectSimpleCVPytorchmasterAMP(trainwithout。py、trainDP。py、trainautocast。py、trainGradScaler。py、evalXXX。py等,之后加入的alexnet也在这里,alexnet。py)tensorboard(保存tensorboard的文件夹)checkpoint(保存模型的文件夹)data(数据集所在文件夹)1。没使用DDP与DP训练与评估代码
没使用DDP与DP的训练与评估实验,作为我们实验的参照组(1)原本模型的训练与评估源码:
训练源码:
注意:此段代码无比简陋,仅为代码的雏形,大致能理解尚可!
trainwithout。pyimporttimeimporttorchimporttorchvisionfromtorchimportnnfromtorch。utils。dataimportDataLoaderfromtorchvision。modelsimportalexnetfromtorchvisionimporttransformsfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)optim。zerograd()losstrain。backward()optim。step()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()
运行结果:
Tensorboard观察:
评估源码:
代码特别粗犷,尤其是device与精度计算,仅供参考,切勿模仿!
evalwithout。pyimporttorchimporttorchvisionfromtorch。utils。dataimportDataLoaderfromtorchvision。transformsimporttransformsfromalexnetimportalexnetimportargparseevaldefparseargs():parserargparse。ArgumentParser(descriptionCVEvaluation)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。Createmodelmodelalexnet()2。ReadyDatasetifargs。datasetCIFAR10:testdatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainFalse,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)3。Lengthtestdatasetsizelen(testdataset)print(thetestdatasetsizeis{}。format(testdatasetsize))4。DataLoadertestdataloaderDataLoader(datasettestdataset,batchsizeargs。batchsize)5。Setsomeparametersfortestingthenetworktotalaccuracy0testmodel。eval()withtorch。nograd():fordataintestdataloader:imgs,targetsdatadevicetorch。device(cpu)imgs,targetsimgs。to(device),targets。to(device)modelloadtorch。load({}AlexNet。pth。format(args。checkpoint),maplocationdevice)model。loadstatedict(modelload)outputsmodel(imgs)outputsoutputs。to(device)accuracy(outputs。argmax(1)targets)。sum()totalaccuracytotalaccuracyaccuracyaccuracytotalaccuracytestdatasetsizeprint(thetotalaccuracyis{}。format(accuracy))
运行结果:
分析:
原本模型训练完20个epochs花费了22分22秒,得到的准确率为0。8191(2)原本模型加入autocast的训练与评估源码:
训练源码:
训练大致代码流程:fromtorch。cuda。ampimportautocastasautocast。。。Createmodel,defaulttorch。FloatTensormodelNet()。cuda()SGD,Adm,Admw,。。。optimoptim。XXX(model。parameters(),。。)。。。forimgs,targetsindataloader:imgs,targetsimgs。cuda(),targets。cuda()。。。。withautocast():outputsmodel(imgs)losslossfn(outputs,targets)。。。optim。zerograd()loss。backward()optim。step()。。。
trainautocastwithout。pyimporttimeimporttorchimporttorchvisionfromtorchimportnnfromtorch。cuda。ampimportautocastfromtorchvisionimporttransformsfromtorchvision。modelsimportalexnetfromtorch。utils。dataimportDataLoaderfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()withautocast():outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)optim。zerograd()losstrain。backward()optim。step()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()
运行结果:
Tensorboard观察:
评估源码:
evalwithout。py和1。(1)一样
运行结果:
分析:
原本模型训练完20个epochs花费了22分22秒,加入autocast之后模型花费的时间为21分21秒,说明模型速度增加了,并且准确率从之前的0。8191提升到0。8403(3)原本模型加入autocast与GradScaler的训练与评估源码:
使用torch。cuda。amp。GradScaler是放大损失值来防止梯度的下溢
训练源码:
训练大致代码流程:fromtorch。cuda。ampimportautocastasautocastfromtorch。cuda。ampimportGradScalerasGradScaler。。。Createmodel,defaulttorch。FloatTensormodelNet()。cuda()SGD,Adm,Admw,。。。optimoptim。XXX(model。parameters(),。。)scalerGradScaler()。。。forimgs,targetsindataloader:imgs,targetsimgs。cuda(),targets。cuda()。。。optim。zerograd()。。。。withautocast():outputsmodel(imgs)losslossfn(outputs,targets)scaler。scale(loss)。backward()scaler。step(optim)scaler。update()。。。
trainGradScalerwithout。pyimporttimeimporttorchimporttorchvisionfromtorchimportnnfromtorch。cuda。ampimportautocast,GradScalerfromtorchvisionimporttransformsfromtorchvision。modelsimportalexnetfromtorch。utils。dataimportDataLoaderfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)scalerGradScaler()8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataoptim。zerograd()ifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()withautocast():outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)scaler。scale(losstrain)。backward()scaler。step(optim)scaler。update()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()
运行结果:
Tensorboard观察:
评估源码:
evalwithout。py和1。(1)一样
运行结果:
分析:
为什么,我们训练完20个epochs花费了27分27秒,比之前原模型未使用任何amp的时间(22分22秒)都多了?
这是因为我们使用了GradScaler放大了损失降低了模型训练的速度,还有个原因可能是笔者自身的显卡太小,没有起到加速的作用2。分布式DP训练与评估代码(1)DP原本模型的训练与评估源码:
训练源码:
trainDP。pyimporttimeimporttorchimporttorchvisionfromtorchimportnnfromtorch。utils。dataimportDataLoaderfromtorchvision。modelsimportalexnetfromtorchvisionimporttransformsfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()modeltorch。nn。DataParallel(model)。cuda()else:modeltorch。nn。DataParallel(model)6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)optim。zerograd()losstrain。backward()optim。step()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()
运行结果:
Tensorboard观察:
评估源码:
evalDP。pyimporttorchimporttorchvisionfromtorch。utils。dataimportDataLoaderfromtorchvision。transformsimporttransformsfromalexnetimportalexnetimportargparseevaldefparseargs():parserargparse。ArgumentParser(descriptionCVEvaluation)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。Createmodelmodelalexnet()modeltorch。nn。DataParallel(model)2。ReadyDatasetifargs。datasetCIFAR10:testdatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainFalse,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)3。Lengthtestdatasetsizelen(testdataset)print(thetestdatasetsizeis{}。format(testdatasetsize))4。DataLoadertestdataloaderDataLoader(datasettestdataset,batchsizeargs。batchsize)5。Setsomeparametersfortestingthenetworktotalaccuracy0testmodel。eval()withtorch。nograd():fordataintestdataloader:imgs,targetsdatadevicetorch。device(cpu)imgs,targetsimgs。to(device),targets。to(device)modelloadtorch。load({}AlexNet。pth。format(args。checkpoint),maplocationdevice)model。loadstatedict(modelload)outputsmodel(imgs)outputsoutputs。to(device)accuracy(outputs。argmax(1)targets)。sum()totalaccuracytotalaccuracyaccuracyaccuracytotalaccuracytestdatasetsizeprint(thetotalaccuracyis{}。format(accuracy))
运行结果:
(2)DP使用autocast的训练与评估源码:
训练源码:
如果你这样写代码,那么你的代码无效!!!。。。modelModel()modeltorch。nn。DataParallel(model)。。。withautocast():outputmodel(imgs)losslossfn(output)
正确写法,训练大致流程代码:1。Model(nn。Module):autocast()defforward(self,input):。。。2。Model(nn。Module):deffoward(self,input):withautocast():。。。
1与2皆可,之后:。。。modelModel()modeltorch。nn。DataParallel(model)withautocast():outputmodel(imgs)losslossfn(output)。。。
模型:
须在forward函数上加入autocast()或者在forward里面最上面加入withautocast():
alexnet。pyimporttorchimporttorch。nnasnnfromtorchvision。models。utilsimportloadstatedictfromurlfromtorch。cuda。ampimportautocastfromtypingimportAnyall〔AlexNet,alexnet〕modelurls{alexnet:https:download。pytorch。orgmodelsalexnetowt4df8aa71。pth,}classAlexNet(nn。Module):definit(self,numclasses:int1000)None:super(AlexNet,self)。init()self。featuresnn。Sequential(nn。Conv2d(3,64,kernelsize11,stride4,padding2),nn。ReLU(inplaceTrue),nn。MaxPool2d(kernelsize3,stride2),nn。Conv2d(64,192,kernelsize5,padding2),nn。ReLU(inplaceTrue),nn。MaxPool2d(kernelsize3,stride2),nn。Conv2d(192,384,kernelsize3,padding1),nn。ReLU(inplaceTrue),nn。Conv2d(384,256,kernelsize3,padding1),nn。ReLU(inplaceTrue),nn。Conv2d(256,256,kernelsize3,padding1),nn。ReLU(inplaceTrue),nn。MaxPool2d(kernelsize3,stride2),)self。avgpoolnn。AdaptiveAvgPool2d((6,6))self。classifiernn。Sequential(nn。Dropout(),nn。Linear(25666,4096),nn。ReLU(inplaceTrue),nn。Dropout(),nn。Linear(4096,4096),nn。ReLU(inplaceTrue),nn。Linear(4096,numclasses),)autocast()defforward(self,x:torch。Tensor)torch。Tensor:xself。features(x)xself。avgpool(x)xtorch。flatten(x,1)xself。classifier(x)returnxdefalexnet(pretrained:boolFalse,progress:boolTrue,kwargs:Any)AlexNet:rAlexNetmodelarchitecturefromtheOneweirdtrick。。。https:arxiv。orgabs1404。5997paper。Args:pretrained(bool):IfTrue,returnsamodelpretrainedonImageNetprogress(bool):IfTrue,displaysaprogressbarofthedownloadtostderrmodelAlexNet(kwargs)ifpretrained:statedictloadstatedictfromurl(modelurls〔alexnet〕,progressprogress)model。loadstatedict(statedict)returnmodel
trainDPautocast。py导入自己的alexnet。pyimporttimeimporttorchfromalexnetimportalexnetimporttorchvisionfromtorchimportnnfromtorch。utils。dataimportDataLoaderfromtorchvisionimporttransformsfromtorch。cuda。ampimportautocastasautocastfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()modeltorch。nn。DataParallel(model)。cuda()else:modeltorch。nn。DataParallel(model)6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()withautocast():outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)optim。zerograd()losstrain。backward()optim。step()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()
运行结果:
Tensorboard观察:
评估源码:
evalDP。py相比与2。(1)导入自己的alexnet。py
运行结果:
分析:
可以看出DP使用autocast训练完20个epochs时需要花费的时间是21分21秒,相比与之前DP没有使用的时间(22分22秒)快了1分1秒
之前DP未使用amp能达到准确率0。8216,而现在准确率降低到0。8188,说明还是使用自动混合精度加速还是对模型的准确率有所影响,后期可通过增大batchsizel让运行时间和之前一样,但是准确率上升,来降低此影响(3)DP使用autocast与GradScaler的训练与评估源码:
训练源码:
trainDPGradScaler。py导入自己的alexnet。pyimporttimeimporttorchfromalexnetimportalexnetimporttorchvisionfromtorchimportnnfromtorch。utils。dataimportDataLoaderfromtorchvisionimporttransformsfromtorch。cuda。ampimportautocastasautocastfromtorch。cuda。ampimportGradScalerasGradScalerfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()modeltorch。nn。DataParallel(model)。cuda()else:modeltorch。nn。DataParallel(model)6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)scalerGradScaler()8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataoptim。zerograd()ifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()withautocast():outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)scaler。scale(losstrain)。backward()scaler。step(optim)scaler。update()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()
运行结果:
Tensorboard观察:
评估源码:
evalDP。py相比与2。(1)导入自己的alexnet。py
运行结果:
分析:
跟之前一样,DP使用了GradScaler放大了损失降低了模型训练的速度
现在DP使用了autocast与GradScaler的准确率为0。8409,相比与DP只使用autocast准确率0。8188还是有所上升,并且之前DP未使用amp是准确率(0。8216)也提高了不少3。单进程占用多卡DDP训练与评估代码(1)DDP原模型训练与评估源码:
训练源码:
trainDDP。pyimporttimeimporttorchfromtorchvision。models。alexnetimportalexnetimporttorchvisionfromtorchimportnnimporttorch。distributedasdistfromtorchvisionimporttransformsfromtorch。utils。dataimportDataLoaderfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(rank,typeint,default0)parser。addargument(worldsize,typeint,default1)parser。addargument(masteraddr,typestr,default127。0。0。1)parser。addargument(masterport,typestr,default12355)parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()deftrain():dist。initprocessgroup(gloo,initmethodtcp:{}:{}。format(args。masteraddr,args。masterport),rankargs。rank,worldsizeargs。worldsize)1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))trainsamplertorch。utils。data。distributed。DistributedSampler(traindataset)4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize,samplertrainsampler,numworkers2,pinmemoryTrue)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()modeltorch。nn。parallel。DistributedDataParallel(model)。cuda()else:modeltorch。nn。parallel。DistributedDataParallel(model)6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)optim。zerograd()losstrain。backward()optim。step()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()ifnamemain:localsizetorch。cuda。devicecount()print(localsize:。format(localsize))train()
运行结果:
Tensorboard观察:
评估源码:
evalDDP。pyimporttorchimporttorchvisionimporttorch。distributedasdistfromtorch。utils。dataimportDataLoaderfromtorchvision。transformsimporttransformsfromalexnetimportalexnetfromtorchvision。models。alexnetimportalexnetimportargparseevaldefparseargs():parserargparse。ArgumentParser(descriptionCVEvaluation)parser。addmutuallyexclusivegroup()parser。addargument(rank,typeint,default0)parser。addargument(worldsize,typeint,default1)parser。addargument(masteraddr,typestr,default127。0。0。1)parser。addargument(masterport,typestr,default12355)parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()defeval():dist。initprocessgroup(gloo,initmethodtcp:{}:{}。format(args。masteraddr,args。masterport),rankargs。rank,worldsizeargs。worldsize)1。Createmodelmodelalexnet()modeltorch。nn。parallel。DistributedDataParallel(model)2。ReadyDatasetifargs。datasetCIFAR10:testdatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainFalse,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)3。Lengthtestdatasetsizelen(testdataset)print(thetestdatasetsizeis{}。format(testdatasetsize))testsamplertorch。utils。data。distributed。DistributedSampler(testdataset)4。DataLoadertestdataloaderDataLoader(datasettestdataset,samplertestsampler,batchsizeargs。batchsize,numworkers2,pinmemoryTrue)5。Setsomeparametersfortestingthenetworktotalaccuracy0testmodel。eval()withtorch。nograd():fordataintestdataloader:imgs,targetsdatadevicetorch。device(cpu)imgs,targetsimgs。to(device),targets。to(device)modelloadtorch。load({}AlexNet。pth。format(args。checkpoint),maplocationdevice)model。loadstatedict(modelload)outputsmodel(imgs)outputsoutputs。to(device)accuracy(outputs。argmax(1)targets)。sum()totalaccuracytotalaccuracyaccuracyaccuracytotalaccuracytestdatasetsizeprint(thetotalaccuracyis{}。format(accuracy))ifnamemain:localsizetorch。cuda。devicecount()print(localsize:。format(localsize))eval()
运行结果:
(2)DDP使用autocast的训练与评估源码:
训练源码:
trainDDPautocast。py导入自己的alexnet。pyimporttimeimporttorchfromalexnetimportalexnetimporttorchvisionfromtorchimportnnimporttorch。distributedasdistfromtorchvisionimporttransformsfromtorch。utils。dataimportDataLoaderfromtorch。cuda。ampimportautocastasautocastfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(rank,typeint,default0)parser。addargument(worldsize,typeint,default1)parser。addargument(masteraddr,typestr,default127。0。0。1)parser。addargument(masterport,typestr,default12355)parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()deftrain():dist。initprocessgroup(gloo,initmethodtcp:{}:{}。format(args。masteraddr,args。masterport),rankargs。rank,worldsizeargs。worldsize)1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))trainsamplertorch。utils。data。distributed。DistributedSampler(traindataset)4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize,samplertrainsampler,numworkers2,pinmemoryTrue)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()modeltorch。nn。parallel。DistributedDataParallel(model)。cuda()else:modeltorch。nn。parallel。DistributedDataParallel(model)6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()withautocast():outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)optim。zerograd()losstrain。backward()optim。step()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()ifnamemain:localsizetorch。cuda。devicecount()print(localsize:。format(localsize))train()
运行结果:
Tensorboard观察:
评估源码:
evalDDP。py导入自己的alexnet。pyimporttorchimporttorchvisionimporttorch。distributedasdistfromtorch。utils。dataimportDataLoaderfromtorchvision。transformsimporttransformsfromalexnetimportalexnetfromtorchvision。models。alexnetimportalexnetimportargparseevaldefparseargs():parserargparse。ArgumentParser(descriptionCVEvaluation)parser。addmutuallyexclusivegroup()parser。addargument(rank,typeint,default0)parser。addargument(worldsize,typeint,default1)parser。addargument(masteraddr,typestr,default127。0。0。1)parser。addargument(masterport,typestr,default12355)parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()defeval():dist。initprocessgroup(gloo,initmethodtcp:{}:{}。format(args。masteraddr,args。masterport),rankargs。rank,worldsizeargs。worldsize)1。Createmodelmodelalexnet()modeltorch。nn。parallel。DistributedDataParallel(model)2。ReadyDatasetifargs。datasetCIFAR10:testdatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainFalse,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)3。Lengthtestdatasetsizelen(testdataset)print(thetestdatasetsizeis{}。format(testdatasetsize))testsamplertorch。utils。data。distributed。DistributedSampler(testdataset)4。DataLoadertestdataloaderDataLoader(datasettestdataset,samplertestsampler,batchsizeargs。batchsize,numworkers2,pinmemoryTrue)5。Setsomeparametersfortestingthenetworktotalaccuracy0testmodel。eval()withtorch。nograd():fordataintestdataloader:imgs,targetsdatadevicetorch。device(cpu)imgs,targetsimgs。to(device),targets。to(device)modelloadtorch。load({}AlexNet。pth。format(args。checkpoint),maplocationdevice)model。loadstatedict(modelload)outputsmodel(imgs)outputsoutputs。to(device)accuracy(outputs。argmax(1)targets)。sum()totalaccuracytotalaccuracyaccuracyaccuracytotalaccuracytestdatasetsizeprint(thetotalaccuracyis{}。format(accuracy))ifnamemain:localsizetorch。cuda。devicecount()print(localsize:。format(localsize))eval()
运行结果:
分析:
从DDP未使用amp花费21分21秒,DDP使用autocast花费20分20秒,说明速度提升了
DDP未使用amp的准确率0。8224,之后DDP使用了autocast准确率下降到0。8162(3)DDP使用autocast与GradScaler的训练与评估源码
训练源码:
trainDDPGradScaler。py导入自己的alexnet。pyimporttimeimporttorchfromalexnetimportalexnetimporttorchvisionfromtorchimportnnimporttorch。distributedasdistfromtorchvisionimporttransformsfromtorch。utils。dataimportDataLoaderfromtorch。cuda。ampimportautocastasautocastfromtorch。cuda。ampimportGradScalerasGradScalerfromtorch。utils。tensorboardimportSummaryWriterimportnumpyasnpimportargparsedefparseargs():parserargparse。ArgumentParser(descriptionCVTrain)parser。addmutuallyexclusivegroup()parser。addargument(rank,typeint,default0)parser。addargument(worldsize,typeint,default1)parser。addargument(masteraddr,typestr,default127。0。0。1)parser。addargument(masterport,typestr,default12355)parser。addargument(dataset,typestr,defaultCIFAR10,helpCIFAR10)parser。addargument(datasetroot,typestr,default。。data,helpDatasetrootdirectorypath)parser。addargument(imgsize,typeint,default227,helpimagesize)parser。addargument(tensorboard,typestr,defaultTrue,helpUsetensorboardforlossvisualization)parser。addargument(tensorboardlog,typestr,default。。tensorboard,helptensorboardfolder)parser。addargument(cuda,typestr,defaultTrue,helpifiscudaavailable)parser。addargument(batchsize,typeint,default64,helpbatchsize)parser。addargument(lr,typefloat,default1e4,helplearningrate)parser。addargument(epochs,typeint,default20,helpNumberofepochstotrain。)parser。addargument(checkpoint,typestr,default。。checkpoint,helpSave。pthfold)returnparser。parseargs()argsparseargs()deftrain():dist。initprocessgroup(gloo,initmethodtcp:{}:{}。format(args。masteraddr,args。masterport),rankargs。rank,worldsizeargs。worldsize)1。CreateSummaryWriterifargs。tensorboard:writerSummaryWriter(args。tensorboardlog)2。Readydatasetifargs。datasetCIFAR10:traindatasettorchvision。datasets。CIFAR10(rootargs。datasetroot,trainTrue,transformtransforms。Compose(〔transforms。Resize(args。imgsize),transforms。ToTensor()〕),downloadTrue)else:raiseValueError(DatasetisnotCIFAR10)cudatorch。cuda。isavailable()print(CUDAavailable:{}。format(cuda))3。Lengthtraindatasetsizelen(traindataset)print(thetraindatasetsizeis{}。format(traindatasetsize))trainsamplertorch。utils。data。distributed。DistributedSampler(traindataset)4。DataLoadertraindataloaderDataLoader(datasettraindataset,batchsizeargs。batchsize,samplertrainsampler,numworkers2,pinmemoryTrue)5。Createmodelmodelalexnet()ifargs。cudacuda:modelmodel。cuda()modeltorch。nn。parallel。DistributedDataParallel(model)。cuda()else:modeltorch。nn。parallel。DistributedDataParallel(model)6。Createlosscrossentropylossnn。CrossEntropyLoss()7。Optimizeroptimtorch。optim。AdamW(model。parameters(),lrargs。lr)schedulertorch。optim。lrscheduler。ReduceLROnPlateau(optim,patience3,verboseTrue)scalerGradScaler()8。Setsomeparameterstocontrolloopepochiter0t0time。time()forepochinrange(args。epochs):t1time。time()print(the{}numberoftrainingepoch。format(epoch))model。train()fordataintraindataloader:loss0imgs,targetsdataoptim。zerograd()ifargs。cudacuda:crossentropylosscrossentropyloss。cuda()imgs,targetsimgs。cuda(),targets。cuda()withautocast():outputsmodel(imgs)losstraincrossentropyloss(outputs,targets)losslosstrain。item()lossifargs。tensorboard:writer。addscalar(trainloss,losstrain。item(),iter)scaler。scale(losstrain)。backward()scaler。step(optim)scaler。update()iteriter1ifiter1000:print(Epoch:{}Iteration:{}lr:{}loss:{}np。mean(loss):{}。format(epoch,iter,optim。paramgroups〔0〕〔lr〕,losstrain。item(),np。mean(loss)))ifargs。tensorboard:writer。addscalar(lr,optim。paramgroups〔0〕〔lr〕,epoch)scheduler。step(np。mean(loss))t2time。time()h(t2t1)3600m((t2t1)3600)60s((t2t1)3600)60print(epoch{}isfinished,andtimeis{}h{}m{}s。format(epoch,int(h),int(m),int(s)))ifepoch10:print(Savestate,iter:{}。format(epoch))torch。save(model。statedict(),{}AlexNet{}。pth。format(args。checkpoint,epoch))torch。save(model。statedict(),{}AlexNet。pth。format(args。checkpoint))t3time。time()ht(t3t0)3600mt((t3t0)3600)60st((t3t0)3600)60print(Thefinishedtimeis{}h{}m{}s。format(int(ht),int(mt),int(st)))ifargs。tensorboard:writer。close()ifnamemain:localsizetorch。cuda。devicecount()print(localsize:。format(localsize))train()
运行结果:
Tensorboard观察:
评估源码:
evalDDP。py与3。(2)一样,导入自己的alexnet。py
运行结果:
分析:
运行起来了,速度也比DDP未使用amp(用时21分21秒)快了不少(用时20分20秒),之前DDP未使用amp准确率到达0。8224,现在DDP使用了autocast与GradScaler的准确率达到0。8252,提升了
参考:
1。Pytorch自动混合精度(AMP)训练:https:blog。csdn。netytusdcarticledetails122152244
2。PyTorch分布式训练基础DDP使用:https:zhuanlan。zhihu。comp358974461
快递查询单号查询物流的实用工具现如今,线上购物已经成为了主流的购物方式之一,那么购物多了,就会有越来越多的快递单号,需要查询物流。今天小编给大家分享一个新的查询技巧,下面一起来试试。需要哪些工具?……
做一个有梦想的人人生还是要有梦想的,不然跟咸鱼有什么区别周星驰。梦想是星星之火,可以燎原;梦想是灯,照亮夜行的路;梦想是舵,指引你航行的方向。做一个有梦想的人,有了梦想才会努力奋斗……
2022年儿童标准身高表出炉,118岁都能看,你家孩子身高达随着社会不断的进步和发展,大家在满足自身基本需求以后,越来越重视自己的身高和颜值,毕竟有一个好的外在形象是人际交往的一把钥匙。我们的颜值和体重,能够通过饮食或者运动来达到……
柳泉录丨谁能替淄博留住本地人抢来外地人?记者张文珂马玉姝城市竞争,愈演愈烈。经济总量和人口总量,是当下观察一座城市竞争力最直观且最不留情面的硬性指标。官方数据显示:2021年,淄博GDP总量为420……
游戏体验进一步优化,可在秘境中打开地图更节省时间喜欢玩游戏的朋友们看过来,尤其是喜欢游戏玩冒险之证的伙伴们,你一定会喜欢今天的科技内容,相信我。连续讨伐功能的优化原神的每一期更新都会对一些游戏内的功能或设定进行科……
双十一入手大屏电视,别只看价格是否跳水,3个关键也很重要虽然最近几年电视的销量整体下跌,但是大屏电视却逆势上涨。其中最关键的原因就是国产电视技术上来了,成本也有所下降,像大屏电视时下就成为了主流,走进了许多普通人的家庭。而今年双十一……
工信部新一代信息技术产业规模效益稳步增长9月20日,工信部举行新时代工业和信息化发展系列主题新闻发布会。工信部电子信息司司长乔跃山在会上表示,新一代信息技术产业是国民经济的战略性、基础性和先导性产业。十年来,我国新一……
刘慈欣赡养上帝长生不老和人工智能是文明的终结如果有一天,科技以及人工智能可以让人类过上衣食无忧神仙般的日子,什么都有机器帮我们完成,是不是人类最幸福的时代?如果人的寿命能达到几千年,随便都能活将近4000岁,那个时……
敢穿中国队队服训练?冬奥会一结束,台北选手马上被秋后算账据观察者网2月21日的报道,代表中华台北队出战北京冬奥会的台湾地区速滑选手黄郁婷,近日被台当局秋后算账,所谓的台行政院院长苏贞昌下令,要求台教育和体育主管部门,对黄玉婷在北京冬……
马岩松在城市中打造诗意的心灵之地北京日报客户端作者马岩松心灵之地是一个超现实的精神的话题,但是这个话题并不是我突然想到的,而是我觉得,现代的城市慢慢形成以后,回过头来再看古典的城市有一种非常超现实的感觉……
周冬雨半裸出镜突破尺度周冬雨半裸出镜,超敢拍,蓝发碎钻造型亮眼8月中旬,周冬雨登封《ELLE世界时装之苑》金九月大片释出。大片中,周冬雨一头雾蓝发在风的作用下熠熠生光,眼尾的碎钻装饰创意……
郭艾伦公开申请离队,可他又能去哪儿呢这么多年,为了辽宁篮球,我也问心无愧,付出了我的一切。也许接下来我就没什么好期待的了。辽宁夺冠后,更衣室里一片欢声笑语,郭略带伤感地说了这句话。就像郭在过去12个赛季的职……