[add]上传训练benchmark by z00560161
This commit is contained in:
+197
@@ -0,0 +1,197 @@
|
||||
# Alexnet for Tensorflow
|
||||
|
||||
This repository provides a script and recipe to train the AlexNet model .
|
||||
|
||||
## Table Of Contents
|
||||
|
||||
* [Model overview](#model-overview)
|
||||
* [Model Architecture](#model-architecture)
|
||||
* [Default configuration](#default-configuration)
|
||||
* [Data augmentation](#data-augmentation)
|
||||
* [Setup](#setup)
|
||||
* [Requirements](#requirements)
|
||||
* [Quick start guide](#quick-start-guide)
|
||||
* [Advanced](#advanced)
|
||||
* [Command line arguments](#command-line-arguments)
|
||||
* [Training process](#training-process)
|
||||
* [Performance](#performance)
|
||||
* [Results](#results)
|
||||
* [Training accuracy results](#training-accuracy-results)
|
||||
* [Training performance results](#training-performance-results)
|
||||
|
||||
|
||||
|
||||
|
||||
## Model overview
|
||||
|
||||
AlexNet model from
|
||||
`Alex Krizhevsky. "One weird trick for parallelizing convolutional neural networks". <https://arxiv.org/abs/1404.5997>.`
|
||||
reference implementation: <https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html#alexnet>
|
||||
### Model architecture
|
||||
|
||||
|
||||
|
||||
### Default configuration
|
||||
|
||||
The following sections introduce the default configurations and hyperparameters for AlexNet model.
|
||||
|
||||
#### Optimizer
|
||||
|
||||
This model uses Momentum optimizer from Tensorflow with the following hyperparameters:
|
||||
|
||||
- Momentum : 0.9
|
||||
- Learning rate (LR) : 0.06
|
||||
- LR schedule: cosine_annealing
|
||||
- Batch size : 128
|
||||
- Weight decay : 0.0001.
|
||||
- Label smoothing = 0.1
|
||||
- We train for:
|
||||
- 150 epochs -> 60.1% top1 accuracy
|
||||
|
||||
#### Data augmentation
|
||||
|
||||
This model uses the following data augmentation:
|
||||
|
||||
- For training:
|
||||
- RandomResizeCrop, scale=(0.08, 1.0), ratio=(0.75, 1.333)
|
||||
- RandomHorizontalFlip, prob=0.5
|
||||
- Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
|
||||
- For inference:
|
||||
- Resize to (256, 256)
|
||||
- CenterCrop to (224, 224)
|
||||
- Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
|
||||
|
||||
## Setup
|
||||
The following section lists the requirements to start training the Alexnet model.
|
||||
### Requirements
|
||||
|
||||
Tensorflow
|
||||
NPU environmemnt
|
||||
|
||||
## Quick Start Guide
|
||||
|
||||
### 1. Clone the respository
|
||||
|
||||
```shell
|
||||
git clone xxx
|
||||
cd Model_zoo_Alexnet_HARD
|
||||
```
|
||||
|
||||
### 2. Download and preprocess the dataset
|
||||
|
||||
1. down load the imagenet dataset
|
||||
2. Extract the training data
|
||||
3. The train and val images are under the train/ and val/ directories, respectively. All images within one folder have the same label.
|
||||
|
||||
### 3. Train
|
||||
- train on single NPU
|
||||
- **edit** *scripts/train_alexnet_1p.sh*( see example below)
|
||||
- bash scripts/run_npu_1p.sh
|
||||
- train on 8 NPUs
|
||||
- **edit** *scripts/train_alexnet_8p.sh*(see example below)
|
||||
- bash scripts/run_npu_8p.sh
|
||||
|
||||
|
||||
for example:
|
||||
- case for single NPU
|
||||
- In scripts/train_alexnet_1p.sh , python scripts part should look like as follows. For more detailed command lines arguments, please refer to [Command line arguments](#command-line-arguments)
|
||||
```shell
|
||||
python3.7 ${EXEC_DIR}/train.py --rank_size=1 \
|
||||
--iterations_per_loop=100 \
|
||||
--batch_size=256 \
|
||||
--data_dir=/path/to/dataset \
|
||||
--mode=train \
|
||||
--lr=0.015 \
|
||||
--log_dir=./model_1p > ./train_${device_id}.log 2>&1
|
||||
```
|
||||
run the program
|
||||
```
|
||||
bash scripts/run_npu_1p.sh
|
||||
```
|
||||
- case for 8 NPUs
|
||||
- In scripts/train_alexnet_8p.sh , python scripts part should look like as follows.
|
||||
```shell
|
||||
python3.7 ${EXEC_DIR}/train.py --rank_size=8 \
|
||||
--iterations_per_loop=100 \
|
||||
--batch_size=128 \
|
||||
--data_dir=/path/to/dataset \
|
||||
--mode=train \
|
||||
--lr=0.06 \
|
||||
--log_dir=./model_8p > ./train_${device_id}.log 2>&1
|
||||
```
|
||||
run the program
|
||||
```
|
||||
bash scripts/run_npu_1p.sh
|
||||
```
|
||||
|
||||
### 4. Test
|
||||
- same procedure as training except 2 following modifications
|
||||
- change `--mode=train` to `--mode=evaluate`
|
||||
- add `--checkpoint_dir=/path/to/checkpoints`
|
||||
|
||||
|
||||
## Advanced
|
||||
### Commmand-line options
|
||||
|
||||
```
|
||||
--data_dir train data dir
|
||||
--num_classes num of classes in ImageNet(default:1000)
|
||||
--image_size image size of the dataset
|
||||
--batch_size mini-batch size (default: 128) per npu
|
||||
--pretrained path of pretrained model
|
||||
--lr initial learning rate
|
||||
--max_epochs max epoch num to train the model
|
||||
--warmup_epochs warmup epoch(when batchsize is large)
|
||||
--weight_decay weight decay (default: 1e-4)
|
||||
--momentum momentum(default: 0.9)
|
||||
--label_smoothing use label smooth in CE, default 0.1
|
||||
--save_summary_steps logging interval(dafault:100)
|
||||
--log_dir path to save checkpoint and log
|
||||
--log_name name of log file
|
||||
--save_checkpoints_steps the interval to save checkpoint
|
||||
--mode mode to run the program (train, evaluate)
|
||||
--checkpoint_dir path to checkpoint for evaluation
|
||||
--max_train_steps max number of training steps
|
||||
--synthetic whether to use synthetic data or not
|
||||
--version weight initialization for model
|
||||
--do_checkpoint whether to save checkpoint or not
|
||||
--rank_size local rank of distributed(default: 0)
|
||||
--group_size world size of distributed(default: 1)
|
||||
--max_train_steps number of training step , default : None, when set ,it will override the max_epoch
|
||||
```
|
||||
for a complete list of options, please refer to `train.py`
|
||||
### Training process
|
||||
|
||||
All the results of the training will be stored in the directory `results`.
|
||||
Script will store:
|
||||
- checkpoints.
|
||||
- log.
|
||||
|
||||
## Performance
|
||||
|
||||
### Result
|
||||
|
||||
Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide.
|
||||
|
||||
#### Training accuracy results
|
||||
|
||||
| **epochs** | Top1/Top5 |
|
||||
| :--------: | :-----------: |
|
||||
| 150 | 60.12%/82.06% |
|
||||
|
||||
#### Training performance results
|
||||
|
||||
| **NPUs** | train performance |
|
||||
| :------: | :---------------: |
|
||||
| 8 | 30000+ img/s |
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
+9
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"server_count": "1",
|
||||
"server_list": [{
|
||||
"device": [{devices}],
|
||||
"server_id": "127.0.0.1"
|
||||
}],
|
||||
"status": "completed",
|
||||
"version": "1.0"
|
||||
}
|
||||
+36
@@ -0,0 +1,36 @@
|
||||
#!/bin/bash
|
||||
|
||||
rm -rf /var/log/npu/slog/host-0/*
|
||||
# main env
|
||||
if [ -d /usr/local/Ascend/nnae/latest ];then
|
||||
|
||||
export LD_LIBRARY_PATH=/usr/local/:/usr/local/lib/:/usr/lib/:/usr/local/Ascend/nnae/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/local/Ascend/driver/tools/hccn_tool/:/usr/local/mpirun4.0/lib
|
||||
export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/nnae/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/nnae/latest/fwkacllib/python/site-packages/:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages
|
||||
export PATH=$PATH:/usr/local/Ascend/nnae/latest/fwkacllib/ccec_compiler/bin:/usr/local/mpirun4.0/bin
|
||||
export ASCEND_OPP_PATH=/usr/local/Ascend/nnae/latest/opp
|
||||
else
|
||||
export LD_LIBRARY_PATH=/usr/local/lib/:/usr/lib/:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/Ascend/add-ons/:/usr/local/mpirun4.0/lib
|
||||
export PYTHONPATH=$PYTHONPATH:/usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/ascend-toolkit/latest//fwkacllib/python/site-packages/:/usr/local/Ascend/ascend-toolkit/latest/tfplugin/python/site-packages:$projectDir
|
||||
export PATH=$PATH:/usr/local/Ascend/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin:/usr/local/mpirun4.0/bin
|
||||
export ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/
|
||||
|
||||
fi
|
||||
|
||||
export DDK_VERSION_FLAG=1.60.T17.B830
|
||||
export HCCL_CONNECT_TIMEOUT=600
|
||||
export JOB_ID=9999001
|
||||
|
||||
export NEW_GE_FE_ID=1
|
||||
export GE_AICPU_FLAG=1
|
||||
export SOC_VERSION=Ascend910
|
||||
export DUMP_GE_GRAPH=1
|
||||
export DUMP_GRAPH_LEVEL=3
|
||||
export PRINT_MODEL=1
|
||||
export SLOG_PRINT_TO_STDOUT=1
|
||||
|
||||
|
||||
export PROFILING_MODE=false
|
||||
export PROFILING_OPTIONS=training_trace
|
||||
export FP_POINT=ssd/block7-conv1x1/Relu
|
||||
export BP_POINT=gradients/resnet34/Relu_grad/ReluGrad
|
||||
export AICPU_PROFILING_MODE=false
|
||||
Reference in New Issue
Block a user