[add]上传训练benchmark by z00560161

This commit is contained in:
liang_chaoming@huawei.com
2020-10-19 20:22:23 +08:00
parent 22b83024f5
commit 82522e2f61
1225 changed files with 345421 additions and 0 deletions
@@ -0,0 +1,197 @@
# Alexnet for Tensorflow
This repository provides a script and recipe to train the AlexNet model .
## Table Of Contents
* [Model overview](#model-overview)
* [Model Architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Data augmentation](#data-augmentation)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick start guide](#quick-start-guide)
* [Advanced](#advanced)
* [Command line arguments](#command-line-arguments)
* [Training process](#training-process)
* [Performance](#performance)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training performance results](#training-performance-results)
## Model overview
AlexNet model from
`Alex Krizhevsky. "One weird trick for parallelizing convolutional neural networks". <https://arxiv.org/abs/1404.5997>.`
reference implementation: <https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html#alexnet>
### Model architecture
### Default configuration
The following sections introduce the default configurations and hyperparameters for AlexNet model.
#### Optimizer
This model uses Momentum optimizer from Tensorflow with the following hyperparameters:
- Momentum : 0.9
- Learning rate (LR) : 0.06
- LR schedule: cosine_annealing
- Batch size : 128
- Weight decay : 0.0001.
- Label smoothing = 0.1
- We train for:
- 150 epochs -> 60.1% top1 accuracy
#### Data augmentation
This model uses the following data augmentation:
- For training:
- RandomResizeCrop, scale=(0.08, 1.0), ratio=(0.75, 1.333)
- RandomHorizontalFlip, prob=0.5
- Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
- For inference:
- Resize to (256, 256)
- CenterCrop to (224, 224)
- Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
## Setup
The following section lists the requirements to start training the Alexnet model.
### Requirements
Tensorflow
NPU environmemnt
## Quick Start Guide
### 1. Clone the respository
```shell
git clone xxx
cd Model_zoo_Alexnet_HARD
```
### 2. Download and preprocess the dataset
1. down load the imagenet dataset
2. Extract the training data
3. The train and val images are under the train/ and val/ directories, respectively. All images within one folder have the same label.
### 3. Train
- train on single NPU
- **edit** *scripts/train_alexnet_1p.sh*( see example below)
- bash scripts/run_npu_1p.sh
- train on 8 NPUs
- **edit** *scripts/train_alexnet_8p.sh*(see example below)
- bash scripts/run_npu_8p.sh
for example:
- case for single NPU
- In scripts/train_alexnet_1p.sh , python scripts part should look like as follows. For more detailed command lines arguments, please refer to [Command line arguments](#command-line-arguments)
```shell
python3.7 ${EXEC_DIR}/train.py --rank_size=1 \
--iterations_per_loop=100 \
--batch_size=256 \
--data_dir=/path/to/dataset \
--mode=train \
--lr=0.015 \
--log_dir=./model_1p > ./train_${device_id}.log 2>&1
```
run the program
```
bash scripts/run_npu_1p.sh
```
- case for 8 NPUs
- In scripts/train_alexnet_8p.sh , python scripts part should look like as follows.
```shell
python3.7 ${EXEC_DIR}/train.py --rank_size=8 \
--iterations_per_loop=100 \
--batch_size=128 \
--data_dir=/path/to/dataset \
--mode=train \
--lr=0.06 \
--log_dir=./model_8p > ./train_${device_id}.log 2>&1
```
run the program
```
bash scripts/run_npu_1p.sh
```
### 4. Test
- same procedure as training except 2 following modifications
- change `--mode=train` to `--mode=evaluate`
- add `--checkpoint_dir=/path/to/checkpoints`
## Advanced
### Commmand-line options
```
--data_dir train data dir
--num_classes num of classes in ImageNetdefault:1000)
--image_size image size of the dataset
--batch_size mini-batch size (default: 128) per npu
--pretrained path of pretrained model
--lr initial learning rate
--max_epochs max epoch num to train the model
--warmup_epochs warmup epoch(when batchsize is large)
--weight_decay weight decay (default: 1e-4)
--momentum momentum(default: 0.9)
--label_smoothing use label smooth in CE, default 0.1
--save_summary_steps logging interval(dafault:100)
--log_dir path to save checkpoint and log
--log_name name of log file
--save_checkpoints_steps the interval to save checkpoint
--mode mode to run the program (train, evaluate)
--checkpoint_dir path to checkpoint for evaluation
--max_train_steps max number of training steps
--synthetic whether to use synthetic data or not
--version weight initialization for model
--do_checkpoint whether to save checkpoint or not
--rank_size local rank of distributed(default: 0)
--group_size world size of distributed(default: 1)
--max_train_steps number of training step , default : None, when set ,it will override the max_epoch
```
for a complete list of options, please refer to `train.py`
### Training process
All the results of the training will be stored in the directory `results`.
Script will store:
- checkpoints.
- log.
## Performance
### Result
Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide.
#### Training accuracy results
| **epochs** | Top1/Top5 |
| :--------: | :-----------: |
| 150 | 60.12%/82.06% |
#### Training performance results
| **NPUs** | train performance |
| :------: | :---------------: |
| 8 | 30000+ img/s |
@@ -0,0 +1,9 @@
{
"server_count": "1",
"server_list": [{
"device": [{devices}],
"server_id": "127.0.0.1"
}],
"status": "completed",
"version": "1.0"
}
@@ -0,0 +1,36 @@
#!/bin/bash
rm -rf /var/log/npu/slog/host-0/*
# main env
if [ -d /usr/local/Ascend/nnae/latest ];then
export LD_LIBRARY_PATH=/usr/local/:/usr/local/lib/:/usr/lib/:/usr/local/Ascend/nnae/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/local/Ascend/driver/tools/hccn_tool/:/usr/local/mpirun4.0/lib
export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages
export PATH=$PATH:/usr/local/Ascend/nnae/latest/fwkacllib/ccec_compiler/bin:/usr/local/mpirun4.0/bin
export ASCEND_OPP_PATH=/usr/local/Ascend/nnae/latest/opp
else
export LD_LIBRARY_PATH=/usr/local/lib/:/usr/lib/:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/local/mpirun4.0/lib
export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest//fwkacllib/python/site-packages/:/usr/local/Ascend/ascend-toolkit/latest/tfplugin/python/site-packages:$projectDir
export PATH=$PATH:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin:/usr/local/mpirun4.0/bin
export ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/
fi
export DDK_VERSION_FLAG=1.60.T17.B830
export HCCL_CONNECT_TIMEOUT=600
export JOB_ID=9999001
export NEW_GE_FE_ID=1
export GE_AICPU_FLAG=1
export SOC_VERSION=Ascend910
export DUMP_GE_GRAPH=1
export DUMP_GRAPH_LEVEL=3
export PRINT_MODEL=1
export SLOG_PRINT_TO_STDOUT=1
export PROFILING_MODE=false
export PROFILING_OPTIONS=training_trace
export FP_POINT=ssd/block7-conv1x1/Relu
export BP_POINT=gradients/resnet34/Relu_grad/ReluGrad
export AICPU_PROFILING_MODE=false