Language issues
Part of an article on Building a Multilingual Web Service Using the ACS, by John Lowry (lowry@arsdigita.com)
ArsDigita : ArsDigita Systems Journal : One article
Language issues
Translating site content into different languages is one of the most
tangible requirements for building a multilingual site. What is less
obvious are all the steps involved in getting content translated. In
this section, we describe how to determine which language a user
prefers; how to build a message catalog of text strings that get
displayed; how to build a data model with language-dependent columns;
how to do language-aware sorting; and managing the translation
process.
Language negotiation
A web site chooses the language in which to serve its content by a
process known as language negotiation. This process beings with finding
out which language a user prefers. Here are some possible ways to do this:
- Preference determined from the HTTP request
The HTTP standard includes an Accept-Language header which the
client can send to the server as part of a request for a web
page. Here is a request sent by a web browser, which includes a
language preference:
GET /index.adp HTTP/1.0
User-Agent: Mozilla/4.61 [en] (X11; I; Linux 2.2.12-20 i686) default
Host: www.arsdigita.com
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en-GB, en
Accept-Charset: iso-8859-1,*,utf-8
This user prefers British English and accepts other types of English.
An advantage of using the Accept-Language header is that it can
contain a list, thereby increasing the likelihood of a preferred
language being available. This header, however, may not always be the
most convenient or reliable way to choose a user's language. The
user's preferred languages may not be available on our web site, or
the user may have incorrectly configured his language preference in
his browser.
- Preference specified by user on first visit to the site
The first time a user visits a site he is prompted to select a
language from the list available. This causes a cookie to be stored on
the user's local computer, which the user's browser sends back to the server
on every subsequent request. A user's language preference will be
lost if he moves to a different computer or deletes his cookie file.
- Preference determined by page that links to the site
A user's language preference can be encoded in a URL. For example,
http://host/en/index.tcl
would specify that the contents
be served in English. In order for this method to work, all links to
the site must have the language identifier embedded correctly. The
page that contains the link lives on a remote server so it would be
hard to ensure that this method works reliably. Users that preferred
different languages would not be able to share URLs because their
language preferences would be included.
- Preference specified by user at registration
A user can select a language preference when he registers for the
site. Whenever he returns to the site the language can be ascertained
from his login token. On the surface, this seems almost identical to
the cookie mechanism described above. However, the implementation of
the filter that populates the ad_locale data structure is quite
different. It is inexpensive to get the value of a cookie. However, it
would not be possible to retrieve the user's language preference from
the database without a more significant loss in performance. Therefore
we memoize
the code that creates the user's preferences. If a user changes his
preferred language in the database, the change will not be recognized
until the cache
is flushed or the server is restarted.
In practice, the method used to discover a user's language preference
will depend on the requirements of the site. What works for one site
may be inappropriate for another. In some cases, the cookie solution
will turn out to be best. In others, however, we will need an
algorithm to select a default choice if the language specified by a
user's Accept-Language header is not available. For now, let's just
assume that we have written a procedure that will implement whatever
solution we have chosen.
A site may not be able, or may not choose, to serve content in a
user's preferred language. In some cases, a site may wish to serve a
page in multiple languages. For example, part of the content may be in
a user's preferred language and part may be in the language of the
group that owns the page. We need to provide an API for
determining in which language to serve a page that can be
sensitive to the page context.
Because we need to run this procedure before serving each page on the
site, it is convenient to return other details besides the language to
use. For example, we may want to return other properties that can be
used to localize the content on a web page, such as timezone and
locale. For this reason, our procedure is separate from the language API we
describe later. Here is its signature and a
short description of how it is invoked:
- ad_locale context property
- This procedure returns the value of a locale-specific property
within the context of a web request. Possible values for context are
user (the current logged-in user) or community (the
group that owns the requested web page). Possible values for property
are language, locale or timezone.
A programmer will need to call the ad_locale
procedure
before any content is generated in response to a request. For
performance reasons, we populate the data structure that lies behind
ad_locale
the first time a user requests a page or a group's content
is accessed.
We accomplish this by using a pre-authorization filter
that runs for every web page that gets requested. For each possible context,
the filter creates a global variable containing a set of properties. Adding new contexts or properties is easily done by either adding a new global variable or an additional key to the set of properties.
The programmer can determine a user's language as follows:
set user_lang [ad_locale user language]
Message catalog
In a static web site, we can translate the content on a page by page
basis. But on a dynamic site, the content on a page can be different
each time it is displayed, because it is generated by a program. In
order to translate this content, we need to divide it up into shorter
elements that are at the same level of granularity as handled by the
program.
A message catalog is a list of the text strings that the application
will display. For example, if the program needs to display a phrase
such as Welcome Philip Greenspun, the string that gets stored
in the message catalog is Welcome. The name Philip
Greenspun is dynamically generated from the database.
We maintain a message catalog for each language that the
web site uses. A programmer can lookup a string in the appropriate message
catalog using a key. On a typical web page
there will be dozens of strings that need to be displayed. The catalogs
need to be very efficient at doing lookups to return the translated
content.
A typical web page contains the text to display interposed with HTML
tags. Ideally, each chunk of text between HTML tags should be a single
message in the catalog. This makes it simpler to separate the content
from the markup tags so that translators don't have to enter HTML
when they make their entries in the catalog. A web page designer that
is using a message catalog cannot expect to have the same freedom to
arrange content as he would in a site that was displayed in only one
language. He faces the following constraints:
- Content retrieved from the database cannot be interposed with
content that comes from the message catalog. Imagine a shopping site
that is selling different types of wine, such as red,
white and rose. The types are stored in the database and
then displayed on a web page. A page designer must be careful not to
specify a content string such as red wine where the first word
is taken from the database and the second is a message catalog
lookup. In French, this would display as rouge vin which is an
incorrect translation. It should be vin rouge.
- For best performance, we need to ensure that there are as few
message catalog lookups as possible per web page. One way to do this is
to use the templating module,
supplied as part of the ACS toolkit, which can cache partial pages
under certain conditions. We can thus ensure that the only parts of
the page that require lookup are those that are dynamically generated
for each request.
- Designers should strive to have big chunks of text between HTML
tags so that there will be as few catalog lookups as possible per
page. Each catalog lookup requires, at a minimum, a procedure call and
a lookup in a hash table. If performance becomes critical, we
could avoid message catalog lookups entirely by coding a different web
page for each language. However, this becomes much more expensive to
maintain.
Let's now look at our proposed implementation for a message catalog
using AOLserver, Oracle and Tcl. First, here is the data
model. Messages are stored in the lang_messages table:
CREATE TABLE lang_messages (
key VARCHAR(200),
locale REFERENCES ad_locales,
message CLOB,
-- if the message is "registered" then it exists in a file
-- so we should not allow editing via the web interface
registered_p CHAR(1) CHECK(registered_p IN ('t','f')),
PRIMARY KEY (key, locale)
);
Each message has a unique key and locale combination. You might be
surprised to see that we claim to be storing messages for each
language, but the column is actually named locale rather than
language. A locale can specify language, country and
dialect. Specifying dialiect, however, is needlessly fine-grained for most web
sites. We would likely not want to provide different translations for
countries with the same language such as England and the United States.
In fact, we have chosen to populate the ad_locales
table
with locale identifiers that omit the dialect and country and use just
a language identifier as its primary key. Examples include en
(English) or fr
(French). Consequently, everywhere else
within the web site that we need to specify a locale in a database
table, we will use the same level of breadth. A full list of
possible language codes, which we can use in the primary key column of
the ad_locales
table, is defined in the ISO 3166
standard.
The procedures for inserting and retrieving messages to
and from the catalog are described below:
- ad_lang_message_lookup lang key
- This procedure
retrieves a message from the catalog given a key and a locale. We provide
_
(underscore) as a synonym for
ad_lang_message_lookup
, so that our API follows a naming convention similar to that used by
Gnu's gettext tools
- ad_lang_message_register lang key message
- This procedure
inserts a message into the catalog. Callers to this procedure need to
specify a key and a locale for the message. We provide
_mr
as a synonym for
ad_lang_message_register
.
The implementation of ad_lang_message_lookup
needs to pay great
attention to performance, since this could be called many times per
page request. The messages are cached in the server's memory in a set
structure. There is one set for each language. Each message lookup
thus requires a procedure call and a set lookup, but the overhead
of a database query is normally avoided once the data structure has
been initialized.
How does a page designer include translated strings in the HTML for a
web page? A programmer can use the procedure above to return a string
within a Tcl script. For example, the code below retrieves the message
with key hello_world
for the French language:
_ fr hello_world
However, we also need to provide an interface to the message
catalog for web page designers that are not programmers. Since most
modern web servers have a mechanism for specifying custom markup tags,
the natural choice for our interface is to provide web page desginers
with a custom tag for displaying a translated string. We name our new
tag <TRN> and it is used as follows:
<TRN symbolic="hello_world">Hello world</TRN>
When the server's parser encounters this tag, it returns the results of a
message catalog lookup for the key hello_world
in the current
user's language. We can configure the language to depend on the
context in which this page is served. By default, the context is
user
but we can set it using the type
property. In this
case, the ADP parser returns the result of the catalog lookup in the
language of the group that owns the web page:
<TRN type="community" symbolic="hello_world">Hello world</TRN>
The string between the opening and closing tags is not printed out on the
web page that gets displayed to the user. It serves two
purposes. First, it allows the web page designer to identify what is
going to get printed out to the user. Second, it is a method for
entering a translation for the specified key into the message
catalog.
The first time the tag is encountered by the server, the content of the tag
is automatically registered in the Oracle database and the server's cache.
By default, English is assumed to be the language of the
translation. We could specify a different language
for the message text that gets registered with the following code:
<TRN lang="fr" symbolic="hello_world">Bonjour monde</TRN>
There is another way that an entry can be added to the message
catalog. This is by calling the ad_lang_message_register
procedure described earlier. Programmers can do this within a Tcl
library that gets sourced on server startup. This must be done for all
messages that are not present in ADP pages. Examples would include the
results of procedures that return strings that get displayed on a web
page. These procedures should return a message catalog key that can
get translated into the user's language before display on a page.
A few problems can arise with this message catalog. The code that
registers the message can be contained within conditional statements
and never be executed. In that case, it will not be stored in the
database and flagged for translation. Or a programmer could
write a procedure that returns a key to the message catalog without
ensuring that the message has been registered. Programmers must
therefore be careful to ensure that all their messages have
been registered properly. The design of our system does not implement
a mechanism to enforce proper registration of messages in the catalog.
An additional problem is that there is no automatic means for a
programmer to ensure that a message catalog key is unique. It is
possible to have different ADP pages and Tcl libraries with different
messages provided for the same key. The programmers who write the HTML
pages and the Tcl code must be careful not to duplicate keys. We help
ensure this by using a unique prefix to the key for each module so
that a programmer can have his own namespace. For example, the events
module will have all message keys begin with events.
so that
they will never conflict with the global keys.
Data Model
Columns in the database may need to be translated into
each language that will be used on the web site. The approach we use
to solve this problem is the same in each case. We split all tables
into language-dependent and language-independent tables.
As an example, we will look at the country codes table which is part
of the core ACS toolkit. Here is a version of the original table.
CREATE TABLE country_codes (
iso CHAR(2) PRIMARY KEY,
country_name VARCHAR(150) NOT NULL,
telephone_code VARCHAR(10)
);
The iso
column is the primary key and is thus included in each
table so that we can join on this column. The telephone code, which is
the country's international dialing code is language independent,
because it is the same in each locale. For example, the international
dialing code for the UK is 44
, whatever country you are
in. The country name is language dependent: United
States in English becomes Etats-Unis in French. Here are
the tables created after we have split country_codes
into
language dependent and independent parts. By convention, we've used
the suffix _data
to denote a language independent table and
_lang
to denote the language dependent table.
CREATE TABLE country_codes_data (
iso CHAR(2) PRIMARY KEY,
telephone_code VARCHAR(10)
);
CREATE TABLE country_codes_lang (
iso REFERENCES country_codes_data,
locale REFERENCES ad_locales,
country_name VARCHAR(150) NOT NULL,
UNIQUE(locale, iso)
);
As a convenience, we create a view with the same name as the original
table.
CREATE OR REPLACE VIEW country_codes AS
SELECT ccd.iso,
ccd.telephone_code,
ccl.locale,
ccl.country_name
FROM country_codes_data ccd,
country_codes_lang ccl
WHERE ccd.iso = ccl.iso;
Using this view, we need only make a small modification to our
database queries to ensure that the results appear in the correct
language. Before introducing the multilingual data model, we might have a query
that looked like this.
SELECT country_name FROM country_codes;
We now replace this query with the following which ensures that the
country name is translated.
SELECT country_name FROM country_codes WHERE locale = '$user_locale';
The example table we have chosen, country_codes
, is not subject to
any insert, update or delete statements in the ACS
toolkit. We populate this table when we load the data model and do not
need to provide an interface to change a country name in the
database.
However, in many cases a user of the web site will be permitted to
modify language dependent columns of tables. All the insert, update
and delete statements that affect these tables will need to be
modified so that they refer to the user's locale when selecting the
rows of the table that need to be modified. In some cases, when a row
is inserted or updated, it will be necessary to create or change
translations for every language. An example, where this may be
required is the user_groups
table. Lists of groups
appear on various pages in the ACS. If a new group is created, it will
be necessary to have the group name translated so that it can be read
in each language.
Language-aware sorting
Lists of textual data displayed on a web page often need to be sorted
in alphabetical order. But each language can use a different sorting
sequence for characters. Oracle provides the NLSSORT function that
carries out linguistic sorting. An example using this function for a
Spanish language sort is shown below.
SELECT key
FROM testsort
ORDER BY NLSSORT(key, 'NLS_SORT = XSpanish');
Rather than find and replace each query that uses an ORDER BY, we
originally wrote all queries to use a Tcl procedure,
ad_lang_sort
, that returns an appropriate call to NLSSORT for a
particular database column and language.
Translation
It is easy to forget that someone must go through a multilingual web
site and translate every single item of content.
We need to provide an interface that can be used by translators to ensure that
all necessary parts of the web site are translated into each
language.
First, we need to record all database tables that contain columns that
need to be translated. We do this in this table:
CREATE TABLE lang_translate_columns (
column_id INTEGER PRIMARY KEY,
on_which_table VARCHAR2(50),
on_which_column VARCHAR2(50),
required_p CHAR(1)
CHECK(required_p in ('t','f')),
UNIQUE (on_which_table, on_what_column)
);
We can use the required_p
flag to indicate whether all entries
in a column must be translated for the site to function. Ideally, we
would also maintain dependency information so that we automatically
identify things that need to be retranslated, though this is not done
in the data model shown.
We also need to maintain a list of all entries in the message catalog
and flag the ones which require translation. It would be possible to
present a translator with a web page which simply listed each string
in the catalog and provide a text input box for entering the
translation. However, it can be hard to provide a good translation
when a message is shown out of context.
A better way to do this is to provide a special view of the web site
to translators. We can modify the ad_lang_message_lookup
procedure
to display a hyperlink to a translation page beside each message. A
translator can simply browse the web site and click on the appropriate
link to get a form to enable him to translate a string.
Code example
Below is an example of an AOLserver Tcl page that creates an entry in
the message catalog for four languages: English, French, Spanish and
German. We use Babelfish as a quick and dirty way to provide
translations for our original message text which is Hello
world
. The script displays a web page showing the translations
taken from the message catalog:
# procedure to translate strings using Babelfish
proc babel_translate { msg lang } {
set marker "XXYYZZXX. "
set qmsg "$marker $msg"
set url "http://babel.altavista.com/translate.dyn?doit=done&BabelFishFrontPage=ye\s&bblType=urltext&url="
set babel_result [ns_httpget "$url&lp=$lang&urltext=[ns_urlencode $qmsg]"]
set result_pattern "$marker (\[^<\]*)"
set msg_tr "** Babelfish TRANSLATION ERROR **"
regexp -nocase $result_pattern $babel_result ignore msg_tr
regsub "$marker." $msg_tr "" msg_tr
return [string trim $msg_tr]
}
# set a test message and add it to the catalog
set msg "Hello world"
_mr en "hello_world" $msg
# add the translations to the catalog
_mr fr "hello_world" [babel_translate $msg en_fr]
_mr es "hello_world" [babel_translate $msg en_es]
_mr de "hello_world" [babel_translate $msg en_de]
# return a web page that displays the translations
ns_return 200 text/plain "
English: [_ en hello_world]
Français: [_ fr hello_world]
Español: [_ es hello_world]
Deutsch: [_ de hello_world]"
If you have the code described in this article,
you can run the above Tcl script to get the following results:
English: Hello world
Français: Bonjour monde
Español: Hola mundo
Deutsch: Hallo Welt
We have to wait for three requests to Babelfish before the web page
is returned, so this page might be quite slow.
More information
The Accept-Language header section of the HTTP specification
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Gnu gettext tools http://www.gnu.org/manual/gettext/html_mono/gettext.html
AOLserver NSV sets http://aolserver.com/doc/3.0/nsv.txt
ACS templating module http://www.arsdigita.com/doc/templates/
Apache Content Negotiation http://www.apache.org/docs/content-negotiation.html
Go Translator http://translator.go.com/
Babelfish http://babel.altavista.com/
asj-editors@arsdigita.com
Reader's Comments
Using an automatic translator like babelfish is very useful, but you will need to warn the user as the translation is shocking!
I have tried viewing Japanese pages through these online translators and they are often absolutely meaningless (and very funny!) in English. For example here's Yahoo Japan in English.
Also here in Japan the use of web based translators is already very common, particularly Excite's. Be careful that you offer the original language also, and not forcing the user to see your page through a translator.
If you want to look professional then you would be better off using a software localization company. And while I'm about it I will shamelessly plug the company I work for in Tokyo! Intersoft who offer localization into Japanese. :)
-- Matthew Lock, February 4, 2001