AlexNet model from Alex Krizhevsky. "One weird trick for parallelizing convolutional neural networks". <https://arxiv.org/abs/1404.5997>. reference implementation: https://pytorch.org/docs/stable/_modules/torchvision/models/alexnet.html#alexnet

Model architecture

Default configuration

The following sections introduce the default configurations and hyperparameters for AlexNet model.

Optimizer

This model uses Momentum optimizer from Tensorflow with the following hyperparameters:

Momentum : 0.9
Learning rate (LR) : 0.06
LR schedule: cosine_annealing
Batch size : 128
Weight decay : 0.0001.
Label smoothing = 0.1
We train for:
- 150 epochs -> 60.1% top1 accuracy

Data augmentation

This model uses the following data augmentation:

For training:
- RandomResizeCrop, scale=(0.08, 1.0), ratio=(0.75, 1.333)
- RandomHorizontalFlip, prob=0.5
- Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
For inference:
- Resize to (256, 256)
- CenterCrop to (224, 224)
- Normalize, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)

Setup

The following section lists the requirements to start training the Alexnet model.

Requirements

Tensorflow NPU environmemnt

Quick Start Guide

1. Clone the respository

git clone xxx
cd  Model_zoo_Alexnet_HARD

2. Download and preprocess the dataset

down load the imagenet dataset
Extract the training data
The train and val images are under the train/ and val/ directories, respectively. All images within one folder have the same label.

3. Train

train on single NPU
- edit scripts/train_alexnet_1p.sh( see example below)
- bash scripts/run_npu_1p.sh
train on 8 NPUs
- edit scripts/train_alexnet_8p.sh(see example below)
- bash scripts/run_npu_8p.sh

for example:

case for single NPU
- In scripts/train_alexnet_1p.sh , python scripts part should look like as follows. For more detailed command lines arguments, please refer to Command line arguments

python3.7 ${EXEC_DIR}/train.py --rank_size=1 \
	--iterations_per_loop=100 \
	--batch_size=256 \
	--data_dir=/path/to/dataset \
	--mode=train \
	--lr=0.015 \
	--log_dir=./model_1p > ./train_${device_id}.log 2>&1

run the program

bash scripts/run_npu_1p.sh

case for 8 NPUs
- In scripts/train_alexnet_8p.sh , python scripts part should look like as follows.

python3.7 ${EXEC_DIR}/train.py --rank_size=8 \
	--iterations_per_loop=100 \
	--batch_size=128 \
	--data_dir=/path/to/dataset \
	--mode=train \
	--lr=0.06 \
	--log_dir=./model_8p > ./train_${device_id}.log 2>&1

run the program

bash scripts/run_npu_1p.sh

4. Test

same procedure as training except 2 following modifications
- change --mode=train to --mode=evaluate
- add --checkpoint_dir=/path/to/checkpoints

Advanced

Commmand-line options

  --data_dir                        train data dir
  --num_classes                     num of classes in ImageNet（default:1000)
  --image_size                      image size of the dataset
  --batch_size                      mini-batch size (default: 128) per npu
  --pretrained                      path of pretrained model
  --lr                              initial learning rate
  --max_epochs                      max epoch num to train the model
  --warmup_epochs                   warmup epoch(when batchsize is large)
  --weight_decay                    weight decay (default: 1e-4)
  --momentum                        momentum(default: 0.9)
  --label_smoothing                 use label smooth in CE, default 0.1
  --save_summary_steps              logging interval(dafault:100)
  --log_dir                         path to save checkpoint and log
  --log_name                        name of log file
  --save_checkpoints_steps          the interval to save checkpoint
  --mode                            mode to run the program (train, evaluate)
  --checkpoint_dir                  path to checkpoint for evaluation
  --max_train_steps                 max number of training steps 
  --synthetic                       whether to use synthetic data or not
  --version                         weight initialization for model
  --do_checkpoint                   whether to save checkpoint or not 
  --rank_size                       local rank of distributed(default: 0)
  --group_size                      world size of distributed(default: 1)
  --max_train_steps                 number of training step , default : None, when set ,it will override the max_epoch

for a complete list of options, please refer to train.py

Training process

All the results of the training will be stored in the directory results. Script will store:

checkpoints.
log.

Performance

Result

Our result were obtained by running the applicable training script. To achieve the same results, follow the steps in the Quick Start Guide.

Training accuracy results

epochs	Top1/Top5
150	60.12%/82.06%

Training performance results

NPUs	train performance
8	30000+ img/s

README.md Unescape Escape

Alexnet for Tensorflow

Table Of Contents

Model overview