[add]上传训练benchmark by z00560161

2020-10-19 20:22:23 +08:00
parent 22b83024f5
commit 82522e2f61
1225 changed files with 345421 additions and 0 deletions
@@ -0,0 +1,197 @@
+# Alexnet for Tensorflow 
+
+This repository provides a script and recipe to train the AlexNet model .
+
+## Table Of Contents
+
+* [Model overview](#model-overview)
+  * [Model Architecture](#model-architecture)  
+  * [Default configuration](#default-configuration)
+* [Data augmentation](#data-augmentation)
+* [Setup](#setup)
+  * [Requirements](#requirements)
+* [Quick start guide](#quick-start-guide)
+* [Advanced](#advanced)
+  * [Command line arguments](#command-line-arguments)
+  * [Training process](#training-process)
+* [Performance](#performance)
+  * [Results](#results)
+    * [Training accuracy results](#training-accuracy-results)
+    * [Training performance results](#training-performance-results)
+
+
+    
+
+## Model overview
+
+AlexNet model from
+`Alex Krizhevsky. "One weird trick for parallelizing convolutional neural networks". <https://arxiv.org/abs/1404.5997>.`
+reference implementation:  <https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html#alexnet>
+### Model architecture
+
+
+
+### Default configuration
+
+The following sections introduce the default configurations and hyperparameters for AlexNet model.
+
+#### Optimizer
+
+This model uses Momentum optimizer from Tensorflow with the following hyperparameters:
+
+- Momentum : 0.9
+- Learning rate (LR) : 0.06
+- LR schedule: cosine_annealing
+- Batch size : 128 
+- Weight decay :  0.0001. 
+- Label smoothing = 0.1
+- We train for:
+  - 150 epochs ->  60.1% top1 accuracy
+
+#### Data augmentation
+
+This model uses the following data augmentation:
+
+- For training:
+  - RandomResizeCrop, scale=(0.08, 1.0), ratio=(0.75, 1.333)
+  - RandomHorizontalFlip, prob=0.5
+  - Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
+- For inference:
+  - Resize to (256, 256)
+  - CenterCrop to (224, 224)
+  - Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
+
+## Setup
+The following section lists the requirements to start training the Alexnet model.
+### Requirements
+
+Tensorflow
+NPU environmemnt
+
+## Quick Start Guide
+
+### 1. Clone the respository
+
+```shell
+git clone xxx
+cd  Model_zoo_Alexnet_HARD
+```
+
+### 2. Download and preprocess the dataset
+
+1. down load the imagenet dataset
+2. Extract the training data
+3. The train and val images are under the train/ and val/ directories, respectively. All images within one folder have the same label.
+
+### 3. Train
+- train on single NPU
+    - **edit** *scripts/train_alexnet_1p.sh*( see example below)
+    - bash scripts/run_npu_1p.sh
+- train on 8 NPUs
+    - **edit** *scripts/train_alexnet_8p.sh*(see example below)
+    - bash scripts/run_npu_8p.sh 
+
+
+for example:
+- case for single NPU
+    - In scripts/train_alexnet_1p.sh , python scripts part should look like as follows. For more detailed command lines arguments, please refer to [Command line arguments](#command-line-arguments)
+```shell
+python3.7 ${EXEC_DIR}/train.py --rank_size=1 \
+	--iterations_per_loop=100 \
+	--batch_size=256 \
+	--data_dir=/path/to/dataset \
+	--mode=train \
+	--lr=0.015 \
+	--log_dir=./model_1p > ./train_${device_id}.log 2>&1 
+```
+run the program  
+```
+bash scripts/run_npu_1p.sh
+```
+- case for 8 NPUs
+    - In scripts/train_alexnet_8p.sh , python scripts part should look like as follows.
+```shell 
+python3.7 ${EXEC_DIR}/train.py --rank_size=8 \
+	--iterations_per_loop=100 \
+	--batch_size=128 \
+	--data_dir=/path/to/dataset \
+	--mode=train \
+	--lr=0.06 \
+	--log_dir=./model_8p > ./train_${device_id}.log 2>&1 
+```
+run the program  
+```
+bash scripts/run_npu_1p.sh
+```
+
+### 4. Test
+- same procedure as training except 2 following modifications
+    - change `--mode=train` to `--mode=evaluate`
+    - add `--checkpoint_dir=/path/to/checkpoints`
+
+
+## Advanced
+### Commmand-line options
+
+```
+  --data_dir                        train data dir
+  --num_classes                     num of classes in ImageNet（default:1000)
+  --image_size                      image size of the dataset
+  --batch_size                      mini-batch size (default: 128) per npu
+  --pretrained                      path of pretrained model
+  --lr                              initial learning rate
+  --max_epochs                      max epoch num to train the model
+  --warmup_epochs                   warmup epoch(when batchsize is large)
+  --weight_decay                    weight decay (default: 1e-4)
+  --momentum                        momentum(default: 0.9)
+  --label_smoothing                 use label smooth in CE, default 0.1
+  --save_summary_steps              logging interval(dafault:100)
+  --log_dir                         path to save checkpoint and log
+  --log_name                        name of log file
+  --save_checkpoints_steps          the interval to save checkpoint
+  --mode                            mode to run the program (train, evaluate)
+  --checkpoint_dir                  path to checkpoint for evaluation
+  --max_train_steps                 max number of training steps 
+  --synthetic                       whether to use synthetic data or not
+  --version                         weight initialization for model
+  --do_checkpoint                   whether to save checkpoint or not 
+  --rank_size                       local rank of distributed(default: 0)
+  --group_size                      world size of distributed(default: 1)
+  --max_train_steps                 number of training step , default : None, when set ,it will override the max_epoch
+```
+for a complete list of options, please refer to `train.py`
+### Training process
+
+All the results of the training will be stored in the directory `results`.
+Script will store:
+ - checkpoints.
+ - log.
+ 
+## Performance
+
+### Result
+
+Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide.
+
+#### Training accuracy results
+
+| **epochs** |   Top1/Top5   |
+| :--------: | :-----------: |
+|    150     | 60.12%/82.06% |
+
+#### Training performance results
+
+| **NPUs** | train performance |
+| :------: | :---------------: |
+|    8     |   30000+  img/s   |
+
+
+
+
+
+
+
+
+
+
+
@@ -0,0 +1,9 @@
+{
+    "server_count": "1",
+    "server_list": [{
+        "device": [{devices}],
+        "server_id": "127.0.0.1"
+    }],
+    "status": "completed",
+    "version": "1.0"
+}
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+rm -rf /var/log/npu/slog/host-0/*
+# main env
+if [ -d /usr/local/Ascend/nnae/latest ];then
+
+	export LD_LIBRARY_PATH=/usr/local/:/usr/local/lib/:/usr/lib/:/usr/local/Ascend/nnae/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/local/Ascend/driver/tools/hccn_tool/:/usr/local/mpirun4.0/lib
+	export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages
+	export PATH=$PATH:/usr/local/Ascend/nnae/latest/fwkacllib/ccec_compiler/bin:/usr/local/mpirun4.0/bin
+	export ASCEND_OPP_PATH=/usr/local/Ascend/nnae/latest/opp
+else
+	export LD_LIBRARY_PATH=/usr/local/lib/:/usr/lib/:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/local/mpirun4.0/lib
+	export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest//fwkacllib/python/site-packages/:/usr/local/Ascend/ascend-toolkit/latest/tfplugin/python/site-packages:$projectDir
+	export PATH=$PATH:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin:/usr/local/mpirun4.0/bin
+	export ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/
+	
+fi
+
+export DDK_VERSION_FLAG=1.60.T17.B830
+export HCCL_CONNECT_TIMEOUT=600
+export JOB_ID=9999001
+
+export NEW_GE_FE_ID=1
+export GE_AICPU_FLAG=1
+export SOC_VERSION=Ascend910
+export DUMP_GE_GRAPH=1
+export DUMP_GRAPH_LEVEL=3
+export PRINT_MODEL=1
+export SLOG_PRINT_TO_STDOUT=1
+
+
+export PROFILING_MODE=false
+export PROFILING_OPTIONS=training_trace
+export FP_POINT=ssd/block7-conv1x1/Relu
+export BP_POINT=gradients/resnet34/Relu_grad/ReluGrad
+export AICPU_PROFILING_MODE=false