In Python, I'm using NLTK's alignment module
to create word alignments between parallel texts. Aligning bitexts can be a time-consuming process, especially when done over considerable corpora. It would be nice to do alignments in batch one day and use those alignments later on.
from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)
with open(path + "eng-taq_model.txt", 'w') as f:
f.write(model.train(biverses, 20)) // makes empty file
Once I create a model, how can I (1) save it to disk and (2) reuse it later?
You discuss saving the aligner model, but your question seems to be more about saving the aligned bitexts that you have aligned: "It would be nice to do alignments in batch one day and use those alignments later on." I'm going to answer this question.
In the nltk environment, the best way to use a corpus-like resource it to access it with a corpus reader. The NLTK doesn't come with corpus writers, but the format supported by the NLTK's
AlignedCorpusReader
is very easy to generate: (NLTK 3 version)That's it. You can later reload and use your aligned sentences exactly as you'd use the
comtrans
corpus:As you can see, you don't need the aligner object itself. The aligned sentences can be loaded with a corpus reader, and the aligner itself is pretty useless unless you want to study the embedded probabilities.
Comment: I'm not sure I would call the aligner object a "model". In NLTK 2, the aligner is not set up to align new text-- it doesn't even have an
align()
method. In NLTK 3 the functionalign()
can align new text but only if used from python 2; in Python 3 it is broken, apparently because of the tightened rules for comparing objects of different types. If nevertheless you want to be able to pickle and reload the aligner, I'll be happy to add it to my answer; from what I've seen it can be done with vanillacPickle
.The immediate answer is to pickle it, see https://wiki.python.org/moin/UsingPickle
But because IBMModel1 returns a lambda function, it's not possible to pickle it with the default
pickle
/cPickle
(see https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74 and https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)So we'll use
dill
. Firstly, installdill
, see Can Python pickle lambda functions?Then:
To use pickled model:
If you try to pickle the
IBMModel1
object, which is a lambda function, you'll end up with this:(Note: the above code snippet comes from NLTK version 3.0.0)
In python3 with NLTK 3.0.0, you will also face the same problem because IBMModel1 returns a lambda function:
(Note: In python3,
pickle
iscPickle
, see http://docs.pythonsprints.com/python3_porting/py-porting.html)if you want, and it looks like it, you can store it as an AlignedSent list:
After that, you can save it with dill as pickle: