We're looking for an open source Machine Translation Engine that could be incorporated into our localization workflow. We're looking at the options below:
- Moses (C++)
- Joshua (Java)
- Phrasal (Java)
Among these, Moses has the widest community support and has been tried out by many localization companies and researchers. We are actually leaning towards a Java-based engine since our applications are all in Java. Have any of you used either Joshua or Phrasal as part of your workflow. Could you please share your experiences with them? Or, is Moses way too far ahead of these in terms of the features it provides and ease of integration.
And, we require that the engine supports:
- Domain-specific training (i.e. it should maintain separate phrase tables for each domain that the input data belongs).
- Incremental training (i.e. avoiding having to retrain the model from scratch every time we wish to use some new training data).
- Parallelizing the translation process.
This question is better asked on the Moses mailing list (moses-support@mit.edu), I think. There are lots of people there working with different types of systems, so you'll get an objective answer. Apart from that, here's my input:
- With respect to Java: it does not matter in which language the MT system is written. No offense, but you may safely assume that even if the code was written in a language you were familiar with, it would be too difficult to understand without a deeper knowledge of MT. So what you are looking for are interfaces. Moses's xml-rpc works fine.
- With respect to MT systems: look for the best results, ignore the programming language it is written in. Results are here: matrix.statmt.org. The people using your MT system are interested in output not in your coding preferences.
- With respect to the whole venture: once you start offering MT output, make sure you can adapt it quickly. MT is rapidly shifting towards a pipeline process in which an MT system is the core (and not the only) component. So focus on maintainability. In the ideal case, you would be able to connect any MT system to your framework.
And here's some input on your feature requests:
- Domain-specific training: you don't need that feature. You get the best MT results by using customer specific data training.
- Incremental training: see Stream Based Statistical Machine Translation
- Parallelizing the translation process: you will have to implement this yourself. Note that most MT software is purely academic and will never reach a 1.0 milestone. It helps of course if a multi-threaded server is available (Moses), but even then, you will need lots of harnessing code.
Hope this helps. Feel free to PM me if you have any more questions.
A lot has been moving forward, so I thought to give an update on this topic, and leave the previous answer there to document the progress.
Domain-specific training: domain adaptation techniques can be useful if your data is taken from various sources and you need to optimise towards a sub-domain. From our experience, there is no single solution that consistently performs best, so you need to try out as many as possible approaches and compare results. There is a mail on the Moses mailing list that lists possible methods: http://thread.gmane.org/gmane.comp.nlp.moses.user/9742/focus=9799various. The following page also gives an overview of the current research: http://www.statmt.org/survey/Topic/DomainAdaptation
Incremental training: there was an interesting talk on IWSLT 2013: http://www.iwslt2013.org/downloads/Assessing_Quick_Update_Methods_of_Statistical_Translation_Models.pdf it demonstrated that current incremental methods (1) take your system offline, so you have no real "live-update" of your models (2) are outperformed by full re-trainings. It seems that the problem has not been solved yet.
Parallelizing the translation process: the moses server lags behind on the moses-cmd binary. So if you want to use the latest features, it is better to start from moses-cmd. Also, the community has not kept its promise of never releasing a 1.0 version :-). In fact, you can find the latest release (2.1) here: http://www.statmt.org/moses/?n=Moses.Releases