Welcome to Day-to-Day Codes By Indrajith: Baseline Translation Sinhala-English

Hi !

Welcome to the baseline translation tutorial from Sinhala to English. First of all you need to identify the target language. Target language is the language into which a text, document, or speech is translated. In our SATS tutorial, the target language is "English".

To continue with the tutorial, you need to have a Sinhala, English parallel corpora. If you do not have an idea about what is a parallel corpora, just look at the picture below.

[picture: a sinhala-english parallel corpus]

We need to save these lines separately in two different files with extensions (.si) for sinhala and (.en) for English.

indrajith.en-si.si
indrajith.en-si.en

You can manually create these two files.
⚠ Both the files should have equal number of lines and they should be encoded in UTF-8 format

If you need, I can give you my files if you made a request at aiukumara@gmail.com

Go to sats folder and execute the following commands

indrajith@indrajith-Inspiron-N5050 ~/sats $  mkdir corpus
indrajith@indrajith-Inspiron-N5050 ~/sats $  cd corpus
indrajith@indrajith-Inspiron-N5050 ~/sats/corpus $ mkdir training

Place indrajith.en-si.si and indrajith.en-si.en files in the "training" direcotory

The first step is

tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.

Save the tokenized data into "corpus" directory.

~/sats/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
    < ~/sats/corpus/training/indrajith.en-si.en    \
    > ~/sats/corpus/indrajith.en-si.tok.en

⚠ Tokenizing the Sinhalese Unicode text will give you the following error message. So avoid this step.

Tokenizer Version 1.1
Language: si
Number of threads: 1

WARNING: No known abbreviations for language 'si', attempting fall-back to English version...

 ~/sats/mosesdecoder/scripts/tokenizer/tokenizer.perl -l si \
    < ~/sats/corpus/training/indrajith.en-si.si    \
    > ~/sats/corpus/indrajith.en-si.tok.si

So, just copy the indrajith.en-si.si file into "corpus" directory and rename it as indrajith.en-si.tok.si
DONE !!! 😊😊😊

Now check whether you have indrajith.en-si.tok.si and indrajith.en-si.tok.en files in the sats/corpus directory.

The next step is

truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.

The truecaser first requires training, in order to extract some statistics about the text:

That means, we need to create two separate truecase-models for English and Sinhala.

~/sats/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/sats/corpus/indra-truecase-model.en --corpus     \
     ~/sats/corpus/indrajith.en-si.tok.en
~/sats/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/sats/corpus/indra-truecase-model.si --corpus     \
     ~/sats/corpus/indrajith.en-si.tok.si

But, the truecaser did not work fine with the sinhala file. Just let it be ! No harm at all.

Now, we use the above trained truecase-models to truecase the tokenized data.

~/sats/mosesdecoder/scripts/recaser/truecase.perl \
   --model ~/sats/corpus/indra-truecase-model.en         \
   < ~/sats/corpus/indrajith.en-si.tok.en \
   > ~/sats/corpus/indrajith.en-si.true.en

~/sats/mosesdecoder/scripts/recaser/truecase.perl \
   --model ~/sats/corpus/indra-truecase-model.si         \
   < ~/sats/corpus/indrajith.en-si.tok.si \
   > ~/sats/corpus/indrajith.en-si.true.si

After these steps, you'll find two files indrajith.en-si.true.si and indrajith.en-si.true.en in the corpus directory, which might look similar as the original files.

Next we need to clean the data. (Don't avoid this step. This step is very important when training the system. If not the "died with signal 11, without coredump" error will come in the last step)

cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.

Notice that the following command processes both sides at once.

~/sats/mosesdecoder/scripts/training/clean-corpus-n.perl \
    ~/sats/corpus/indrajith.en-si.true si en \
    ~/sats/corpus/indrajith.en-si.clean 1 80

This cleaning step is very important ! So do not skip.

Next is Language Model Training. We should do this step only to the target language. In our case, the target language is English.

First create a directory for language model "lm" in sats folder and go into that directory. Then execute the following code.

mkdir ~/lm
cd ~/lm

~/sats/mosesdecoder/bin/lmplz -o 3 <~/sats/corpus/indrajith.en-si.true.en > indrajith.en-si.arpa.en

After this step we need to binarise (for faster loading) the *.arpa.en file using KenLM:

 ~/sats/mosesdecoder/bin/build_binary \
  indrajith.en-si.arpa.en \
   indrajith.en-si.blm.en

If you see the following line in your console, then you are SUCCESS. A file named indrajith.en-si.blm.en should be created inside the 'lm' directory.

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

If you did not get this, try re-doing from the beginning.

Now we can check whether our language model has built successfully, by using the following command. This command

indrajith@indrajith-Inspiron-N5050 ~/sats/lm $ echo "මේක මායාව සහ පුදුමය පිරුනු බිමක්" | ~/sats/mosesdecoder/bin/query indrajith.en-si.blm.en

You will get an output as follows means, the language model has been created successfully.

මේක=0 1 -4.9586806 මායාව=0 1 -4.291745 සහ=0 1 -4.291745 පුදුමය=0 1 -4.291745 පිරුනු=0 1 -4.291745 බිමක්=0 1 -4.291745 </s>=2 1 -0.913644 Total: -27.331049 OOV: 6
Perplexity including OOVs: 8024.824928231607
Perplexity excluding OOVs: 8.196763534546985
OOVs: 6
Tokens: 7
Name:query VmPeak:29620 kB VmRSS:8584 kB RSSMax:9540 kB user:0 sys:0.008 CPU:0.008 real:0.0391223

Training the Translation System

First of all create a sub-directory called 'working' in sats directory. Go into working.

mkdir ~/working
 cd ~/working

Finally we come to the main event - training the translation model. To do this, we run word-alignment (using GIZA++), phrase extraction and scoring, create lexicalised reordering tables and create your Moses configuration file, all with a single command. I recommend that you create an appropriate directory as follows, and then run the training command, catching logs:

Try this:

nohup nice ~/sats/mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus ~/sats/corpus/indrajith.en-si.clean -f si -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:$HOME/sats/lm/indrajith.en-si.blm.en:8 -external-bin-dir ~/sats/mosesdecoder/tools >& indra-training.out &

To do this training, cleaning the corpus is definite. Cleaning the corpus will erase any unnecessary lines in our corpus, which may lead to error 11.
the error 11 is-
ERROR: Execution of: /home/indrajith/sats/mosesdecoder/tools/GIZA++ -CoocurrenceFile /home/indrajith/sats/working/train/giza.si-en/si-en.cooc -c /home/indrajith/sats/working/train/corpus/si-en-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 1 -nsmooth 4 -o /home/indrajith/sats/working/train/giza.si-en/si-en -onlyaldumps 1 -p0 0.999 -s /home/indrajith/sats/working/train/corpus/en.vcb -t /home/indrajith/sats/working/train/corpus/si.vcb
died with signal 11, without coredump

I overcame this error after cleaning the corpus. Now you can check the mert.out code

If any issue comes, open and see the indra-training.out file which is created in working directory. It will be easy for troubleshooting.
Once it's finished there should be a moses.ini file in the directory ~/working/train/model.

After this step, the next is the tuning part. We need to prepare two small parallel corpora in Sinhala and English to tune the system. Here I am using indra-test.en and indra-test.si for tuning. We have to tokenise and truecase it first (don't forget to use the correct language if you're not building a si->en system).

 cd ~/corpus
~/sats/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
   < indra-test.en > indra-test.tok.en

~/sats/mosesdecoder/scripts/tokenizer/tokenizer.perl -l si \
   < indra-test.si > indra-test.tok.si

 ~/sats/mosesdecoder/scripts/recaser/truecase.perl --model indra-truecase-model.en \
   < indra-test.tok.en > indra-test.true.en

 ~/sats/mosesdecoder/scripts/recaser/truecase.perl --model indra-truecase-model.si \
   < indra-test.tok.si > indra-test.true.si

Now go back to working directory.
cd ~/working

Let's now launch the tuning process.

nohup nice ~/sats/mosesdecoder/scripts/training/mert-moses.pl \
~/sats/corpus/indra-test.true.si ~/sats/corpus/indra-test.true.en \
~/sats/mosesdecoder/bin/moses train/model/moses.ini --mertdir \
~/sats/mosesdecoder/bin/ &> mert.out &

This tuning step takes time.
If you have several cores at your disposal, then it'll be a lot faster to run Moses multi-threaded. Add --decoder-flags="-threads 4" to the last line above in order to run the decoder with 4 threads. The end result of tuning is an ini file with trained weights, which should be in ~/working/mert- work/moses.ini if you've used the same directory structure as me.

TADA !!!!!

Now we have come to the end of our training.
Execute the following code from anywhere of the sats directory, and try entering Sinhala sentences. You will get the English translation.

 ~/sats/mosesdecoder/bin/moses -f ~/sats/working/mert-work/moses.ini

TUTORIAL COMPLETED !

For someone who wish to do the BLEU Test:

You can test the decoder by first translating the test set (takes a wee while) then running the BLEU script on it:
Goto working directory

cd work

here we translate and check the test dataset indra-test.true.en

nohup nice ~/sats/mosesdecoder/bin/moses            \
   -f ~/sats/working/mert-work/filtered/moses.ini   \
   < ~/sats/corpus/indra-test.true.si                \
   > ~/sats/working/indra-test.translated.en         \
   > ~/sats/working/indra.out

//worked
//now bleu test


~/sats/mosesdecoder/scripts/generic/multi-bleu.perl \

   -lc ~/sats/corpus/indra-test.true.en              \

   < ~/sats/working/indra-test.translated.en

Welcome to Day-to-Day Codes By Indrajith

Pages

Baseline Translation Sinhala-English

Training the Translation System

1 comment: