English-Hindi Machine Language Transliteration in Tensorflow
In this project, we have built a character level language model for transliterating English text into Hindi. It is implemented using TensorFlow.
Transliteration is the phonetic translation of the words in a source language ( here English ) into equivalent words in a target language ( here Hindi ). It preserves the pronunciation of the words.
This project has been inspired by the NMT model developed by deep-diver.
Transliteration being a type of many to many problem, we built a encoder-decoder model in TensorFlow. The objective of the model is Transliterating English text to Hindi script.
The preprocessing contains two important steps which include:
We then created a sequence to sequence model i.e. Encoder-Decoder layers.
Inputs to the encoding layer:
enc_dec_model_inputs()
: we have defined a function to create placeholders for encoder-decoder inputs.hyperparam_inputs()
: to create placeholders for the hyper parameters that are needed later, lr_rate
(learning rate) and keep_prob
(probability of dropouts).Encoder:
It contains two parts: Embedding layer & RNN layer. The former converts each character in the word with the number of features specified as encoding_embedding_size
. The later part being the RNN layer, we have used LSTM cells stacked together after dropout technique.
Decoder:
Decoding model mainly comprises of two phases, Training and Inference.
Both of them share the same architecture and parameters, but the difference comes in feeding the shared model.
<GO>
in front of all target data to imply the start of transliteration.tf.contrib.layers.embed_sequence
like in encoder. The reason being, we need to pass the embedding layer to both training and inference phases and tf.contrib.layers.embed_sequence
embeds the prepared dataset before running, but in inference phase we need the dynamic embedding capability.tf.variable.scope()
is used for the sharing of parameters and variables between training and inference processes since they both share the same architecture. Here, the previously created layers, encoding_layer
, process_decoder_input
and decoding_layer
are put together to build the full-fledged Sequence to Sequence model.
Cost Function:
tf.contrib.seq2seq.sequence_loss
is used for calculating loss function i.e., a weighted softmax cross entropy loss function. Weights are explicitly provided as an argument, and it can be created by tf.sequence_mask
.
Optimizer:
We used Adam optimizer (tf.train.AdamOptimizer
) with specified learning rate.
Gradient Clipping:tf.clip_by_value
is used to do the gradient clipping to overcome exploding gradient.
We defined get_accuracy function to compute train and validation accuracy.
We then saved the batch_size
and save_path
parameters for inference.
We then defined a function word_to_seq
to do the transliteration of input words by the user.
.