项目作者: kzvdar42

项目描述 :
Russian to English machine translation using Transformer.
高级语言: Jupyter Notebook
项目地址: git://github.com/kzvdar42/machine-translation.git
创建时间: 2020-03-06T20:56:40Z
项目社区:https://github.com/kzvdar42/machine-translation

开源协议:

下载


Machine Translation

Russian to English machine translation using Transformer model.

Purpose

This project was done during the Practical Machine Learning and Deep Learning course at Spring 2020 semester at Innopolis University (Innopolis, Russia).

How to run

Firstly install all requirements from requirements.txt.

Data preparation

Then you need to prepare your dataset for the model, running the preparation.py:

  1. usage: preparation.py [-h] [--dataset_size [DATASET_SIZE]]
  2. [--vocab_size [VOCAB_SIZE]]
  3. [--temp_file_path [TEMP_FILE_PATH]]
  4. dataset_path store_path
  5. positional arguments:
  6. dataset_path path to the dataset.
  7. store_path path there to store the results.
  8. optional arguments:
  9. -h, --help show this help message and exit
  10. --dataset_size [DATASET_SIZE]
  11. max number of samples in dataset.
  12. --vocab_size [VOCAB_SIZE]
  13. the size of the tokenizer's vocabulary.
  14. --temp_file_path [TEMP_FILE_PATH]
  15. path there to save the temp file for the tokenizer.

It will create the train/test/val datasets and a BPE tokenizer for the model.

Training

After that you can train the model using the run_transformer.py:

  1. usage: run_transformer.py [-h] [--model_save_path [MODEL_SAVE_PATH]]
  2. [--batch_size [BATCH_SIZE]]
  3. [--learning_rate [LEARNING_RATE]]
  4. [--n_words [N_WORDS]] [--emb_size [EMB_SIZE]]
  5. [--n_hid [N_HID]] [--n_layers [N_LAYERS]]
  6. [--n_head [N_HEAD]] [--dropout [DROPOUT]]
  7. data_path n_epochs tokenizer_path
  8. positional arguments:
  9. data_path path to the train/test/val sets.
  10. n_epochs number of training epochs.
  11. tokenizer_path path to the tokenizer.
  12. optional arguments:
  13. -h, --help show this help message and exit
  14. --model_save_path [MODEL_SAVE_PATH]
  15. there to load/save the model.
  16. --batch_size [BATCH_SIZE]
  17. batch size for training/validation.
  18. --learning_rate [LEARNING_RATE]
  19. learning rate for training.
  20. --n_words [N_WORDS] number of words to train on.
  21. --emb_size [EMB_SIZE]
  22. embedding dimension.
  23. --n_hid [N_HID] the dimension of the feedforward network model in
  24. nn.TransformerEncoder.
  25. --n_layers [N_LAYERS]
  26. the number of encoder/decoder layers in transformer.
  27. --n_head [N_HEAD] the number of heads in the multiheadattention layers.
  28. --dropout [DROPOUT] dropout rate during the training.

After the training it will compute the BLEU score on test dataset.

Translation

To translate the text use the translate.py:

  1. usage: translate.py [-h] [--encoding [ENCODING]] [--max_len [MAX_LEN]]
  2. model_path tokenizer_path in_data_path out_data_path
  3. positional arguments:
  4. model_path path to the trained model.
  5. tokenizer_path path to the tokenizer.
  6. in_data_path path to the input data.
  7. out_data_path path where to save the results.
  8. optional arguments:
  9. -h, --help show this help message and exit
  10. --encoding [ENCODING]
  11. encoding for files.
  12. --max_len [MAX_LEN] maximum translation length.