What is the best way to translate a big amount of

2020-02-26 09:14发布

I have a lot of text data and want to translate it to different languages.

Possible ways I know:

  • Google Translate API
  • Bing Translate API

The problem is that all these services have limitations on text length, number of calls etc. which makes them inconveniente in use.

What services / ways you could advice to use in this case?

10条回答
趁早两清
2楼-- · 2020-02-26 10:03

Disclaimer: While I definitely find tokenizing as a means of translation suspect, splitting on sentences as later illustrated by typoking may produce results that fill your requirements.

I suggested that his code could be improved by reducing the 30+ lines of string munging to the 1 line regex he asked for in another question but the suggestion was not well recieved.

Here is an implementation using google api for .net in VB and CSharp

Program.cs

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;
using Google.API.Translate;

namespace TokenizingTranslatorCS
{
    internal class Program
    {
        private static readonly TranslateClient Client =
            new TranslateClient("http://code.google.com/p/google-api-for-dotnet/");

        private static void Main(string[] args)
        {
            Language originalLanguage = Language.English;
            Language targetLanguage = Language.German;

            string filename = args[0];

            StringBuilder output = new StringBuilder();

            string[] input = File.ReadAllLines(filename);

            foreach (string line in input)
            {
                List<string> translatedSentences = new List<string>();
                string[] sentences = Regex.Split(line, "\\b(?<sentence>.*?[\\.!?](?:\\s|$))");
                foreach (string sentence in sentences)
                {
                    string sentenceToTranslate = sentence.Trim();

                    if (!string.IsNullOrEmpty(sentenceToTranslate))
                    {
                        translatedSentences.Add(TranslateSentence(sentence, originalLanguage, targetLanguage));
                    }
                }


                output.AppendLine(string.Format("{0}{1}", string.Join(" ", translatedSentences.ToArray()),
                                                Environment.NewLine));
            }

            Console.WriteLine("Translated:{0}{1}{0}", Environment.NewLine, string.Join(Environment.NewLine, input));
            Console.WriteLine("To:{0}{1}{0}", Environment.NewLine, output);
            Console.WriteLine("{0}Press any key{0}", Environment.NewLine);


            Console.ReadKey();
        }

        private static string TranslateSentence(string sentence, Language originalLanguage, Language targetLanguage)
        {
            string translatedSentence = Client.Translate(sentence, originalLanguage, targetLanguage);
            return translatedSentence;
        }
    }
}

Module1.vb

Imports System.Text.RegularExpressions
Imports System.IO
Imports System.Text
Imports Google.API.Translate


Module Module1

    Private Client As TranslateClient = New TranslateClient("http://code.google.com/p/google-api-for-dotnet/")

    Sub Main(ByVal args As String())

        Dim originalLanguage As Language = Language.English
        Dim targetLanguage As Language = Language.German

        Dim filename As String = args(0)

        Dim output As New StringBuilder

        Dim input As String() = File.ReadAllLines(filename)

        For Each line As String In input
            Dim translatedSentences As New List(Of String)
            Dim sentences As String() = Regex.Split(line, "\b(?<sentence>.*?[\.!?](?:\s|$))")
            For Each sentence As String In sentences

                Dim sentenceToTranslate As String = sentence.Trim

                If Not String.IsNullOrEmpty(sentenceToTranslate) Then

                    translatedSentences.Add(TranslateSentence(sentence, originalLanguage, targetLanguage))

                End If

            Next

            output.AppendLine(String.Format("{0}{1}", String.Join(" ", translatedSentences.ToArray), Environment.NewLine))

        Next

        Console.WriteLine("Translated:{0}{1}{0}", Environment.NewLine, String.Join(Environment.NewLine, input))
        Console.WriteLine("To:{0}{1}{0}", Environment.NewLine, output)
        Console.WriteLine("{0}Press any key{0}", Environment.NewLine)
        Console.ReadKey()


    End Sub

    Private Function TranslateSentence(ByVal sentence As String, ByVal originalLanguage As Language, ByVal targetLanguage As Language) As String

        Dim translatedSentence As String = Client.Translate(sentence, originalLanguage, targetLanguage)
        Return translatedSentence
    End Function

End Module

Input (stolen directly from typoking)

Just to prove a point I threw this together :) It is rough around the edges, but it will handle a WHOLE lot of text and it does just as good as Google for translation accuracy because it uses the Google API. I processed Apple's entire 2005 SEC 10-K filing with this code and the click of one button (took about 45 minutes). The result was basically identical to what you would get if you copied and pasted one sentence at a time into Google Translator. It isn't perfect (ending punctuation is not accurate and I didn't write to the text file line by line), but it does show proof of concept. It could have better punctuation if you worked with Regex some more.

Results (to german for typoking):

Nur um zu beweisen einen Punkt warf ich dies zusammen:) Es ist Ecken und Kanten, aber es wird eine ganze Menge Text umgehen und es tut so gut wie Google für die Genauigkeit der Übersetzungen, weil es die Google-API verwendet. Ich verarbeitet Apple's gesamte 2005 SEC 10-K Filing bei diesem Code und dem Klicken einer Taste (dauerte ca. 45 Minuten). Das Ergebnis war im wesentlichen identisch zu dem, was Sie erhalten würden, wenn Sie kopiert und eingefügt einem Satz in einer Zeit, in Google Translator. Es ist nicht perfekt (Endung Interpunktion ist nicht korrekt und ich wollte nicht in die Textdatei Zeile für Zeile) schreiben, aber es zeigt proof of concept. Es hätte besser Satzzeichen, wenn Sie mit Regex arbeitete einige mehr.

查看更多
放我归山
3楼-- · 2020-02-26 10:04

You could use Amazon's Mechanical Turk https://www.mturk.com/

You set a fee for translating a sentence or paragraph, and real people will do the work. Plus you can automate it with Amazon's APIs.

查看更多
beautiful°
4楼-- · 2020-02-26 10:05

I had to solve the same problem when integrating language translation with an xmpp chat server. I partitioned my payload (the text i needed to translate) into smaller subsets of complete sentences. I cant recall the exact number but with googles rest based translation url, i translated a set of completed sentences that collectivly had a total of less than (or equal to) 1024 characters, so a large paragraph would result in multiple translation service calls.

查看更多
唯我独甜
5楼-- · 2020-02-26 10:05

There are a plenty of different Machine Translation APIs: Google, Microsoft, Yandex, IBM, PROMT, Systran, Baidu, YeeCloud, DeepL, SDL, SAP.

Some of them support batch requests (translating an array of text at once). I would translate sentence by sentence with proper processing of 403/429 errors (usually used to respond for exceeded quota)

I may refer you to our recent evaluation study (November 2017): https://www.slideshare.net/KonstantinSavenkov/state-of-the-machine-translation-by-intento-november-2017-81574321

查看更多
兄弟一词,经得起流年.
6楼-- · 2020-02-26 10:10

Google provides a useful tool, Google Translator Toolkit, which allows you to upload files and translate them, to whichever language Google Translate supports, at once. It's free if you want to use the automated translations but there is an option to hire real persons to translate your documents for you.

From Wikipedia:

Google Translator Toolkit is a web application designed to allow translators to edit the translations that Google Translate automatically generates. With the Google Translator Toolkit, translators can organize their work and use shared translations, glossaries and translation memories. They can upload and translate Microsoft Word documents, OpenOffice.org, RTF, HTML, text, and Wikipedia articles.

Link

查看更多
混吃等死
7楼-- · 2020-02-26 10:11

This is a long shot, but here it goes:

Perhaps this blog post which describes using Second Life to translate articles be helpful for you too?

I am not too sure if Second Life's API allows you to do the translation in an automated way though.

查看更多
登录 后发表回答