项目作者: Zhenye-Na

项目描述 :
✍️ Convolutional Recurrent Neural Network in Pytorch | Text Recognition
高级语言: Jupyter Notebook
项目地址: git://github.com/Zhenye-Na/crnn-pytorch.git
创建时间: 2019-04-27T03:11:48Z
项目社区:https://github.com/Zhenye-Na/crnn-pytorch

开源协议:MIT License

下载


Text Recognizer Pro

Deploy a computer vision and natural language processing system into production.



PRs Welcome


contributions welcome


HitCount


Open Source Love



High level architecture of the Full Stack Deep Learning project. Image Source: Pieter Abbeel, Sergey Karayev, Josh Tobin. Spring 2019 Full Stack Deep Learning Bootcamp



Table of Contents generated with DocToc

Project structure

Web backend

  1. api/ # Code for serving predictions as a REST API.
  2. tests/test_app.py # Test that predictions are working
  3. Dockerfile # Specificies Docker image that runs the web server.
  4. __init__.py
  5. app.py # Flask web server that serves predictions.
  6. serverless.yml # Specifies AWS Lambda deployment of the REST API.

Data (not under version control - one level up in the heirarchy)

  1. data/ # Training data lives here
  2. raw/
  3. emnist/metadata.toml # Specifications for downloading data

Experimentation

  1. evaluation/ # Scripts for evaluating model on eval set.
  2. evaluate_character_predictor.py
  3. evaluate_line_predictor.py
  4. notebooks/ # For snapshots of initial exploration, before solidfying code as proper Python files.
  5. 01-look-at-emnist.ipynb
  6. 02-look-at-emnist-lines.ipynb
  7. 03-look-at-iam-lines.ipynb
  8. 04-look-at-iam-paragraphs.ipynb
  9. 04b-look-at-line-detector-predictions.ipynb
  10. 05-look-at-fsdl-handwriting.ipynb

Convenience scripts

  1. tasks/
  2. # Deployment
  3. build_api_docker.sh
  4. deploy_api_to_lambda.sh
  5. # Code quality
  6. lint.sh
  7. # Tests
  8. run_prediction_tests.sh
  9. run_validation_tests.sh
  10. test_api.sh
  11. # Training
  12. train_character_predictor.sh
  13. train_cnn_line_predictor.sh
  14. train_line_detector.sh
  15. train_lstm_line_predictor.sh
  16. train_lstm_line_predictor_on_iam.sh
  17. # Update metadata
  18. update_fsdl_paragraphs_metadata.sh

Main model

  1. text_recognizer/ # Package that can be deployed as a self-contained prediction system
  2. __init__.py
  3. character_predictor.py # Takes a raw image and obtains a prediction
  4. line_predictor.py
  5. paragraph_text_recognizer.py
  6. datasets/ # Code for loading datasets
  7. __init__.py
  8. dataset.py # Base class for datasets - logic for downloading data
  9. dataset_sequence.py
  10. emnist_dataset.py
  11. emnist_essentials.json
  12. emnist_lines_dataset.py
  13. fsdl_handwriting_dataset.py
  14. iam_dataset.py
  15. iam_lines_dataset.py
  16. iam_paragraphs_dataset.py
  17. sentence_generator.py
  18. models/ # Code for instantiating models, including data preprocessing and loss functions
  19. __init__.py
  20. base.py # Base class for models
  21. character_model.py
  22. line_detector_model.py
  23. line_model.py
  24. line_model_ctc.py
  25. networks/ # Code for building neural networks (i.e., 'dumb' input->output mappings) used by models
  26. __init__.py
  27. ctc.py
  28. fcn.py
  29. lenet.py
  30. line_cnn_all_conv.py
  31. line_lstm_ctc.py
  32. misc.py
  33. mlp.py
  34. tests/
  35. support/ # Raw data used by tests
  36. create_emnist_lines_support_files.py
  37. create_emnist_support_files.py
  38. create_iam_lines_support_files.py
  39. emnist/ # emnist sample images
  40. emnist_lines/ # emnist line sample images
  41. iam_lines/ # iam line sample images
  42. iam_paragraphs/ # iam paragraphs sample images
  43. test_character_predictor.py # Test model on a few key examples
  44. test_line_predictor.py
  45. test_paragraph_text_recognizer.py
  46. weights/ # Weights for production model
  47. CharacterModel_EmnistDataset_mlp_weights.h5
  48. LineDetectorModel_IamParagraphsDataset_fcn_weights.h5
  49. LineModelCtc_EmnistLinesDataset_line_lstm_ctc_weights.h5
  50. LineModelCtc_IamLinesDataset_line_lstm_ctc_weights.h5
  51. util.py

Training

  1. training/ # Code for running training experiments and selecting the best model.
  2. gpu_manager.py
  3. gpu_util_sampler.py
  4. prepare_experiments.py
  5. run_experiment.py # Parse experiment config and launch training.
  6. update_metadata.py
  7. util.py # Logic for training a model with a given config

Set up

Run

  1. pipenv install --dev

From now on, precede commands with pipenv run to make sure they use the correct
environment.

Run

  1. pipenv sync --dev

to make sure your package versions are correct.

Dataset

EMNIST dataset

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset . Further information on the dataset contents and conversion process can be found in the paper available at https://arxiv.org/abs/1702.05373v1.



EMNIST datset.






EMNIST Lines dataset

This is a synthetic dataset built for this project. It contains the sample sentences from Brown corpus. For each character, sample random EMNIST character and place on a line (with some random overlap).



EMNIST Lines datset.



IAM Lines dataset

The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments. The database contains forms of unconstrained handwritten text, which were scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels. The figure below provides samples of a complete form, a text line and some extracted words.



IAM Lines datset.



Usage

EMNIST dataset

Networks and training code

  • Look at text_recognizer/networks/mlp.py
  • Look at text_recognizer/networks/lenet.py
  • Look at text_recognizer/models/base.py
  • Look at text_recognizer/models/character_model.py
  • Look at training/util.py

Train MLP and CNN

You can run the shortcut command tasks/train_character_predictor.sh, which runs the following:

  1. pipenv run python training/run_experiment.py --save '{"dataset": "EmnistDataset", "model": "CharacterModel", "network": "mlp", "train_args": {"batch_size": 256}}'

It will take a couple of minutes to train your model.

Just for fun, you could also try a larger MLP, with a smaller batch size:

  1. pipenv run python training/run_experiment.py '{"dataset": "EmnistDataset", "model": "CharacterModel", "network": "mlp", "network_args": {"num_layers": 8}, "train_args": {"batch_size": 128}}'

Let’s also train a CNN on the same task.

  1. pipenv run python training/run_experiment.py '{"dataset": "EmnistDataset", "model": "CharacterModel", "network": "lenet", "train_args": {"epochs": 1}}'

Training the single epoch will take about 2 minutes (that’s why we only do one epoch in this lab :)). Leave it running while we go on to the next part.

It is very useful to be able to subsample the dataset for quick experiments. This is possibe by passing subsample_fraction=0.1 (or some other fraction) at dataset initialization, or in dataset_args in the run_experiment.pydictionary, for example:

  1. pipenv run python training/run_experiment.py '{"dataset": "EmnistDataset", "dataset_args": {"subsample_fraction": 0.1}, "model": "CharacterModel", "network": "mlp"}'

Testing

First, let’s take a look at how the test works at

  1. text_recognizer/tests/test_character_predictor.py

Now let’s see if it works by running:

  1. pipenv run pytest -s text_recognizer/tests/test_character_predictor.py

Or, use the shorthand tasks/run_prediction_tests.sh

Testing should finish quickly.

EMNIST Lines dataset

Train LSTM model with CTC loss

Let’s train an LSTM model with CTC loss.

  1. pipenv run python training/run_experiment.py --save '{"train_args": {"epochs": 16}, "dataset": "EmnistLinesDataset", "model": "LineModelCtc", "network": "line_lstm_ctc"}'

or the shortcut tasks/train_lstm_line_predictor.sh

IAM Lines Dataset

Training

Let’s train with the default params by running tasks/train_lstm_line_predictor_on_iam.sh, which runs the follwing command:

  1. pipenv run python training/run_experiment.py --save '{"dataset": "IamLinesDataset", "model": "LineModelCtc", "network": "line_lstm_ctc"}'

This uses our LSTM with CTC model. 8 epochs gets accuracy of 40% and takes about 10 minutes.

Training longer will keep improving: the same settings get to 60% accuracy in 40 epochs.

Line Detection

Our approach will be to train a model that, when given an image containing lines of text, returns a pixelwise labeling of that image, with each pixel belonging to either background, odd line of handwriting, or even line of handwriting. Given the output of the model, we can find line regions with an easy image processing operation.

We are starting from the IAM dataset, which includes not only lines but the original writing sample forms, with each line and word region annotated.

Let’s load the IAM dataset and then look at the data files. Run pipenv run python text_recognizer/datasets/iam_dataset.py Let’s look at the raw data files, which are in ./data/raw/iam/iamdb/forms.

We want to crop out the region of each page corresponding to the handwritten paragraph as our model input, and generate corresponding ground truth.

Code to do this is in text_recognizer/datasets/iam_paragraphs_dataset.py

We can look at the results in notebooks/04-look-at-iam-paragraphs.ipynb and by looking at some debug images we output in data/interim/iam_paragraphs.

Training data augmentation

The model code for our new LineDetector is in text_recognizer/models/line_detector_model.py.

Because we only have about a thousand images to learn this task on, data augmentation will be crucial. Image augmentations such as streching, slight rotations, offsets, contrast and brightness changes, and potentially even mirror-flipping are tedious to code, and most frameworks provide optimized utility code for the task.

We use Keras’s ImageDataGenerator, and you can see the parameters for it in text_recognizer/models/line_detector_model.py. We can take a look at what the data transformations look like in the same notebook.

Network description

The network used in this model is text_recognizer/networks/fcn.py.

The basic idea is a deep convolutional network with resnet-style blocks (input to block is concatenated to block output). We call it FCN, as in “Fully Convolutional Network,” after the seminal paper that first used convnets for segmentation.

Unlike the original FCN, however, we do not maxpool or upsample, but instead rely on dilated convolutions to rapidly increase the effective receptive field. Here is a very calculator of the effective receptive field size of a convnet.

The crucial thing to understand is that because we are labeling odd and even lines differently, each predicted pixel must have the context of the entire image to correctly label — otherwise, there is no way to know whether the pixel is on an odd or even line.

Review results

The model converges to something really good.

Check out notebooks/04b-look-at-line-detector-predictions.ipynb to see sample predictions on the test set.

We also plot some sample training data augmentation in that notebook.

Training result



IAM Lines datset.



Testing result



IAM Lines datset.



Combining the two models

Now we are ready to combine the new LineDetector model and the LinePredictor model that we trained yesterday.

This is done in text_recognizer/paragraph_text_recognizer.py, which loads both models, find line regions with one, and runs each crop through the other.

We can see that it works as expected (albeit not too accurately yet) by running pipenv run pytest -s text_recognizer/tests/test_paragraph_text_recognizer.py.

Data Labeling and Versioning

Export data and update metadata file

You have noticed the metadata.toml files in all of our data/raw directories. They contain the remote source of the data, the filename it should have when downloaded, and a SHA-256 hash of the downloaded file.

The idea is that the data file has all the information needed for our dataset. In our case, it has image URLs and all the annotations we made. From this, we can download the images, and transform the annotation data into something usable by our training scripts. The hash, combined with the state of the codebase (tracked by git), then uniquely identifies the data we’re going to use to train.

We replace the current fsdl_handwriting.json with the one we just exported, and now need to update the metadata file, since the hash is different. SHA256 hash of any file can be computed by running shasum -a 256 <filename>. We can also update metadata.toml with a convenient script that replace the SHA-256 of the current file with the SHA-256 of the new file. There is a convenience task script defined: tasks/update_fsdl_paragraphs_metadata.sh.

The data file itself is checked into version control, but tracked with git-lfs, as it can get heavyweight and can change frequently as we keep adding and annotating more data. Git-lfs actually does something very similar to what we more manually do with metadata.toml. The reason we also use the latter is for standardization across other types of datasets, which may not have a file we want to check into even git-lfs — for example, EMNIST and IAM, which are too large as they include the images.

Download images

The class IamHandwritingDataset in text_recognizer/datasets/iam_handwriting.py must be able to load the data in the exported format and present it to consumers in a format they expect (e.g. dataset.line_regions_by_id).

Since this data export does not come with images, but only pointers to remote locations of the images, the class must also be responsible for downloading the images.

In downloading many images, it is very useful to do so in parallel. We use the concurrent.futures.ThreadPoolExecutormethod, and use the tqdm package to provide a nice progress bar.

Looking at the data

We can confirm that we loaded the data correctly by looking at line crops and their corresponding strings.

Take a look at notebooks/05-look-at-fsdl-handwriting.ipynb.




Training on the new dataset

We’re not going to have time to train on the new dataset, but that is something that is now possible. As an exercise, you could write FsdlHandwritingLinesDataset and FsdlHandwritingParagraphsDataset, and be able to train a model on a combination of IAM and FSDL Handwriting data on both the line detection and line text prediction tasks.

Web Deployment

Serving predictions from a web server

First, we will get a Flask web server up and running and serving predictions.

  1. pipenv run python api/app.py

Open up another terminal tab (click on the ‘+’ button under ‘File’ to open the launcher). In this terminal, we’ll send some test image to the web server we’re running in the first terminal.

Make sure to cd into the lab9 directory in this new terminal.

  1. export API_URL=http://0.0.0.0:8000
  2. curl -X POST "${API_URL}/v1/predict" -H 'Content-Type: application/json' --data '{ "image": "data:image/png;base64,'$(base64 -w0 -i text_recognizer/tests/support/emnist_lines/or\ if\ used\ the\ results.png)'" }'

If you want to look at the image you just sent, you can navigate to lab9/text_recognizer/tests/support/emnist_lines in the file browser on the left, and open the image.

We can also send a request specifying a URL to an image:

  1. curl "${API_URL}/v1/predict?image_url=http://s3-us-west-2.amazonaws.com/fsdl-public-assets/emnist_lines/or%2Bif%2Bused%2Bthe%2Bresults.png"

Adding web server tests

The web server code should have a unit test just like the rest of our code.

Let’s check it out: the tests are in api/tests/test_app.py. You can run them with

  1. tasks/test_api.sh

Running web server in Docker

Now, we’ll build a docker image with our application. The Dockerfile in api/Dockerfile defines how we’re building the docker image.

Still in the lab9 directory, run:

  1. tasks/build_api_docker.sh

This should take a couple of minutes to complete.

When it’s finished, you can run the server with

  1. docker run -p 8000:8000 --name api -it --rm text_recognizer_api

You can run the same curl commands as you did when you ran the flask server earlier, and see that you’re getting the same results.

  1. curl -X POST "${API_URL}/v1/predict" -H 'Content-Type: application/json' --data '{ "image": "data:image/png;base64,'$(base64 -w0 -i text_recognizer/tests/support/emnist_lines/or\ if\ used\ the\ results.png)'" }'
  2. curl "${API_URL}/v1/predict?image_url=http://s3-us-west-2.amazonaws.com/fsdl-public-assets/emnist_lines/or%2Bif%2Bused%2Bthe%2Bresults.png"

If needed, you can connect to your running docker container by running:

  1. docker exec -it api bash

Different setttings to try

  • Change sliding window width/stride
  • Not using a sliding window: instead of sliding a LeNet over, you could just run the input through a few conv/pool layers, squeeze out the last (channel) dimension (which should be 0), and input the result into the LSTM. You can play around with the parameters there.
  • Change number of LSTM dimensions
  • Wrap the LSTM in a Bidirectional() wrapper, which will have two LSTMs read the input forward and backward and concatenate the outputs
  • Stack a few layers of LSTMs
  • Try to get an all-conv approach to work for faster training
  • Add BatchNormalization
  • Play around with learning rate. In order to launch experiments with different learning rates, you will have to implement something in training/run_experiment.py and text_recognizer/datasets/base.py
  • Train on EmnistLines and fine-tune on IamLines. In order to do that, you might want to implement a model wrapper class that can take multiple datasets.
  • Try adding more data augmentations, or mess with the parameters of the existing ones
  • Try the U-Net architecture, that MaxPool’s down and then UpSamples back up, with increased conv layer channel dimensions in the middle (https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/).

References

[1] Pieter Abbeel, Sergey Karayev, Josh Tobin. “Spring 2019 Full Stack Deep Learning Bootcamp”.
[2] Alex Graves, Santiago Fernandez, Faustino Gomez, Jurgen Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
[3] Batuhan Balci, Dan Saadati, Dan Shiferaw. “Handwritten Text Recognition using Deep Learning”.