Implementing internationalization (language string

2020-05-19 06:43发布

问题:

I want to build a CMS that can handle fetching locale strings to support internationalization. I plan on storing the strings in a database, and then placing a key/value cache like memcache in between the database and the application to prevent performance drops for hitting the database each page for a translation.

This is more complex than using PHP files with arrays of strings - but that method is incredibly inefficient when you have 2,000 translation lines.

I thought about using gettext, but I'm not sure that users of the CMS will be comfortable working with the gettext files. If the strings are stored in a database, then a nice administration system can be setup to allow them to make changes whenever they want and the caching in RAM will insure that the fetching of those strings is as fast, or faster than gettext. I also don't feel safe using the PHP extension considering not even the zend framework uses it.

Is there anything wrong with this approach?

Update

I thought perhaps I would add more food for thought. One of the problems with string translations it is that they doesn't support dates, money, or conditional statements. However, thanks to intl PHP now has MessageFormatter which is what really needs to be used anyway.

// Load string from gettext file
$string = _("{0} resulted in {1,choice,0#no errors|1#single error|1<{1, number} errors}");

// Format using the current locale
msgfmt_format_message(setlocale(LC_ALL, 0), $string, array('Update', 3));

On another note, one of the things I don't like about gettext is that the text is embedded into the application all over the place. That means that the team responsible for the primary translation (usually English) has to have access to the project source code to make changes in all the places the default statements are placed. It's almost as bad as applications that have SQL spaghetti-code all over.

So, it makes sense to use keys like _('error.404_not_found') which then allow the content writers and translators to just worry about the PO/MO files without messing in the code.

However, in the event that a gettext translation doesn't exist for the given key then there is no way to fall back to a default (like you could with a custom handler). This means that you either have the writter mucking around in your code - or have "error.404_not_found" shown to users that don't have a locale translation!

In addition, I am not aware of any large projects which use PHP's gettext. I would appreciate any links to well-used (and therefore tested), systems which actually rely on the native PHP gettext extension.

回答1:

Gettext uses a binary protocol that is quite quick. Also the gettext implementation is usually simpler as it only requires echo _('Text to translate');. It also has existing tools for translators to use and they're proven to work well.

You can store them in a database but I feel it would be slower and a bit overkill, especially since you'd have to build the system to edit the translations yourself.

If only you could actually cache the lookups in a dedicated memory portion in APC, you'd be golden. Sadly, I don't know how.



回答2:

For those that are interested, it seems full support for locales and i18n in PHP is finally starting to take place.

// Set the current locale to the one the user agent wants
$locale = Locale::acceptFromHttp(getenv('HTTP_ACCEPT_LANGUAGE'));

// Default Locale
Locale::setDefault($locale);
setlocale(LC_ALL, $locale . '.UTF-8');

// Default timezone of server
date_default_timezone_set('UTC');

// iconv encoding
iconv_set_encoding("internal_encoding", "UTF-8");

// multibyte encoding
mb_internal_encoding('UTF-8');

There are several things that need to be condered and detecting the timezone/locale and then using it to correctly parse and display input and output is important. There is a PHP I18N library that was just released which contains lookup tables for much of this information.

Processing User input is important to make sure you application has clean, well-formed UTF-8 strings from whatever input the user enters. iconv is great for this.

/**
 * Convert a string from one encoding to another encoding
 * and remove invalid bytes sequences.
 *
 * @param string $string to convert
 * @param string $to encoding you want the string in
 * @param string $from encoding that string is in
 * @return string
 */
function encode($string, $to = 'UTF-8', $from = 'UTF-8')
{
    // ASCII is already valid UTF-8
    if($to == 'UTF-8' AND is_ascii($string))
    {
        return $string;
    }

    // Convert the string
    return @iconv($from, $to . '//TRANSLIT//IGNORE', $string);
}


/**
 * Tests whether a string contains only 7bit ASCII characters.
 *
 * @param string $string to check
 * @return bool
 */
function is_ascii($string)
{
    return ! preg_match('/[^\x00-\x7F]/S', $string);
}

Then just run the input through these functions.

$utf8_string = normalizer_normalize(encode($_POST['text']), Normalizer::FORM_C);

Translations

As Andre said, It seems gettext is the smart default choice for writing applications that can be translated.

  1. Gettext uses a binary protocol that is quite quick.
  2. The gettext implementation is usually simpler as it only requires _('Text to translate')
  3. Existing tools for translators to use and they're proven to work well.

When you reach facebook size then you can work on implementing RAM-cached, alternative methods like the one I mentioned in the question. However, nothing beats "simple, fast, and works" for most projects.

However, there are also addition things that gettext cannot handle. Things like displaying dates, money, and numbers. For those you need the INTL extionsion.

/**
 * Return an IntlDateFormatter object using the current system locale
 *
 * @param string $locale string
 * @param integer $datetype IntlDateFormatter constant
 * @param integer $timetype IntlDateFormatter constant
 * @param string $timezone Time zone ID, default is system default
 * @return IntlDateFormatter
 */
function __date($locale = NULL, $datetype = IntlDateFormatter::MEDIUM, $timetype = IntlDateFormatter::SHORT, $timezone = NULL)
{
    return new IntlDateFormatter($locale ?: setlocale(LC_ALL, 0), $datetype, $timetype, $timezone);
}

$now = new DateTime();
print __date()->format($now);
$time = __date()->parse($string);

In addition you can use strftime to parse dates taking the current locale into consideration.

Sometimes you need the values for numbers and dates inserted correctly into locale messages

/**
 * Format the given string using the current system locale
 * Basically, it's sprintf on i18n steroids.
 *
 * @param string $string to parse
 * @param array $params to insert
 * @return string
 */
function __($string, array $params = NULL)
{
    return msgfmt_format_message(setlocale(LC_ALL, 0), $string, $params);
}

// Multiple choices (can also just use ngettext)
print __(_("{1,choice,0#no errors|1#single error|1<{1, number} errors}"), array(4));

// Show time in the correct way
print __(_("It is now {0,time,medium}), time());

See the ICU format details for more information.

Database

Make sure your connection to the database is using the correct charset so that nothing gets currupted on storage.

String Functions

You need to understand the difference between the string, mb_string, and grapheme functions.

// 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_a_ring_nfd = "a\xCC\x8A";

var_dump(grapheme_strlen($char_a_ring_nfd));
var_dump(mb_strlen($char_a_ring_nfd));
var_dump(strlen($char_a_ring_nfd));

// 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_A_ring = "\xC3\x85";

var_dump(grapheme_strlen($char_A_ring));
var_dump(mb_strlen($char_A_ring));
var_dump(strlen($char_A_ring));

Domain name TLD's

The IDN functions from the INTL library are a big help processing non-ascii domain names.



回答3:

There are a number of other SO questions and answers similar to this one. I suggest you search and read them as well.

Advice? Use an existing solution like gettext or xliff as it will save you lot's of grief when you hit all the translation edge cases such as right to left text, date formats, different text volumes, French is 30% more verbose than English for example that screw up formatting etc. Even better advice Don't do it. If the users want to translate they will make a clone and translate it. Because Localisation is more about look and feel and using colloquial language this is usually what happens. Again giving and example Anglo-Saxon culture likes cool web colours and san-serif type faces. Hispanic culture like bright colours and Serif/Cursive types. Which to cater for you would need different layouts per language.

Zend actually cater for the following adapters for Zend_Translate and it is a useful list.

  • Array:- Use PHP arrays for Small pages; simplest usage; only for programmers
  • Csv:- Use comma separated (.csv/.txt) files for Simple text file format; fast; possible problems with unicode characters
  • Gettext:- Use binary gettext (*.mo) files for GNU standard for linux; thread-safe; needs tools for translation
  • Ini:- Use simple INI (*.ini) files for Simple text file format; fast; possible problems with unicode characters
  • Tbx:- Use termbase exchange (.tbx/.xml) files for Industry standard for inter application terminology strings; XML format
  • Tmx:- Use tmx (.tmx/.xml) files for Industry standard for inter application translation; XML format; human readable
  • Qt:- Use qt linguist (*.ts) files for Cross platform application framework; XML format; human readable
  • Xliff:- Use xliff (.xliff/.xml) files for A simpler format as TMX but related to it; XML format; human readable
  • XmlTm:- Use xmltm (*.xml) files for Industry standard for XML document translation memory; XML format; human readable
  • Others:- *.sql for Different other adapters may be implemented in the future


回答4:

I'm using the ICU stuff in my framework and really finding it simple and useful to use. My system is XML-based with XPath queries and not a database as you're suggesting to use. I've not found this approach to be inefficient. I played around with Resource bundles too when researching techniques but found them quite complicated to implement.

The Locale functionality is a god send. You can do so much more easily:

// Available translations
$languages = array('en', 'fr', 'de');

// The language the user wants
$preference = (isset($_COOKIE['lang'])) ?
    $_COOKIE['lang'] : ((isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) ?
        Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']) : '');

// Match preferred language to those available, defaulting to generic English
$locale = Locale::lookup($languages, $preference, false, 'en');

// Construct path to dictionary file
$file = $dir . '/' . $locale . '.xsl';

// Check that dictionary file is readable
if (!file_exists($file) || !is_readable($file)) {
    throw new RuntimeException('Dictionary could not be loaded');
}

// Load and return dictionary file
$dictionary = simplexml_load_file($file);

I then perform word lookups using a method like this:

$selector = '/i18n/text[@label="' . $word . '"]';
$result = $dictionary->xpath($selector);
$text = array_shift($result);

if ($formatted && isset($text)) {
    return new MessageFormatter($locale, $text);
 }

The bonus for my system is that the template system is XSL-based which means I can use the same translation XML files directly in my templates for simple messages that don't need any i18n formatting.



回答5:

Stick with gettext, you won't find a faster alternative in PHP.

Regarding the how, you can use a database to store your catalog and allow other users to translate the strings using a friendly gui. When the new changes are reviewed/approved, hit a button, compile a new .mo file and deploy.

Some resources to get you on track:

  • http://code.google.com/p/simplepo/
  • http://www.josscrowcroft.com/2011/code/php-mo-convert-gettext-po-file-to-binary-mo-file-php/
  • https://launchpad.net/php-gettext/
  • http://sourceforge.net/projects/tcktranslator/


回答6:

What about csv files (which can be easily edited in many apps) and caching to memcache (wincache, etc.)? This approach works well in magento. All languages phrases in the code are wrapped into __() function, for example

<?php echo $this->__('Some text') ?>

Then, for example before new version release, you run simple script which parses source files, finds all text wrapped into __() and puts into .csv file. You load csv files and cache them to memcache. In __() function you look into your memcache where translations are cached.



回答7:

In a recent project, we considered using gettext, but it turned out to be easier to just write our own functionality. It really is quite simple: Create a JSON file per locale (e.g. strings.en.json, strings.es.json, etc.), and create a function somewhere called "translate()" or something, and then just call that. That function will determine the current locale (from the URI or a session var or something), and return the localized string.

The only thing to remember is to make sure any HTML you output is encoded in UTF-8, and marked as such in the markup (e.g. in the doctype, etc.)



回答8:

Maybe not really an answer to your question, but maybe you can get some ideas from the Symfony translation component? It looks very good to me, although I must confess I haven't used it myself yet.

The documentation for the component can be found at

http://symfony.com/doc/current/book/translation.html

and the code for the component can be found at

https://github.com/symfony/Translation.

It should be easy to use the Translation component, because Symfony components are intended to be able to be used as standalone components.



回答9:

On another note, one of the things I don't like about gettext is that the text is embedded into the application all over the place. That means that the team responsible for the primary translation (usually English) has to have access to the project source code to make changes in all the places the default statements are placed. It's almost as bad as applications that have SQL spaghetti-code all over.

This isn't actually true. You can have a header file (sorry, ex C programmer), such as:

<?php
define(MSG_404_NOT_FOUND, 'error.404_not_found')
?>

Then whenever you want a message, use _(MSG_404_NOT_FOUND). This is much more flexible than requiring developers to remember the exact syntax of the non-localised message every time they want to spit out a localised version.

You could go one step further, and generate the header file in a build step, maybe from CSV or database, and cross-reference with the translation to detect missing strings.



回答10:

have a zend plugin that works very well for this.

<?php
/** dependencies **/
require 'Zend/Loader/Autoloader.php';
require 'Zag/Filter/CharConvert.php';

Zend_Loader_Autoloader::getInstance()->setFallbackAutoloader(true);

//filter
$filter = new Zag_Filter_CharConvert(array(
    'replaceWhiteSpace' => '-',
    'locale' => 'en_US',
    'charset'=> 'UTF-8'
));

echo $filter->filter('ééé ááá 90');//eee-aaa-90
echo $filter->filter('óóó 10aáééé');//ooo-10aaeee

if you do not want to use the zend framework can only use the plugin.

hug!