本文共 1867 字,大约阅读时间需要 6 分钟。
目录
上一章主要介绍基于tensorflow-cpu的object_detection开源框架的模型训练操作(文中若无特殊说明均基于tensorflow-cpu运行),而该节将介绍基于GPU服务器的object_detection开源框架的训练,分为单机训练和分布式训练,并以ssd模型为例进行说明(4.1.5节的模型训练操作不同,其他操作相同)
在一台GPU服务器172.22.13.223上运行:
进入/root/software/object_detection/models/research/object_detection目录# python3 legacy/train.py \--logtostderr \--pipeline_config_path=datasetvocssd/ssd_mobilenet_v1_pascal.config \--train_dir=datasetvocssd/train_dir1
分布式训练需要使用移动4A的三台gpu服务器,172.22.13.221-223。其中,172.22.13.223被用作参数服务器,用于参数的更新,172.22.13.222-221被用作计算服务器,用于神经网络的计算,同时,172.22.13.222还被用作初始化参数,模型的保存,summary的保存。三台服务器上面的数据及配置完全一样。
进入/root/software/object_detection/models/research/object_detection目录在hadoop1(172.22.13.223)运行ps:TF_CONFIG='{"cluster":{"master":["hadoop2:2002"],"ps":["hadoop1:2222"],"worker":["hadoop3:3003"]},"task":{"index":0,"type":"ps"}}' nohup python3 legacy/train.py --logtostderr --pipeline_config_path=datasetvocssd/ssd_mobilenet_v1_pascal.config --train_dir=datasetvocssd/train_dir >> datasetvocssd/train.log &在hadoop2(172.22.13.222)运行master:TF_CONFIG='{"cluster":{"master":["hadoop2:2002"],"ps":["hadoop1:2222"],"worker":["hadoop3:3003"]},"task":{"index":0,"type":"master"}}' nohup python3 /root/software/object_detection/models/research/object_detection/legacy/train.py --logtostderr --pipeline_config_path=datasetvocssd/ssd_mobilenet_v1_pascal.config --train_dir=datasetvocssd/train_dir >> datasetvocssd/train.log &在hadoop3(172.22.13.221)运行worker:TF_CONFIG='{"cluster":{"master":["hadoop2:2002"],"ps":["hadoop1:2222"],"worker":["hadoop3:3003"]},"task":{"index":0,"type":"worker"}}' nohup python3 /root/software/object_detection/models/research/object_detection/legacy/train.py --logtostderr --pipeline_config_path=datasetvocssd/ssd_mobilenet_v1_pascal.config --train_dir=datasetvocssd/train_dir >> datasetvocssd/train.log &
转载地址:http://tnxin.baihongyu.com/