Character Set Encoding
Part of an article on Building a Multilingual Web Service Using the ACS, by Henry Minsky (hqm@ai.mit.edu)
ArsDigita : ArsDigita Systems Journal : One article : One Chapter
Overview of Character Set Handling in AOLserver and ACS
The ISO-8859-1 character set, also known as ISO Latin 1, can handle
most characters in Western European languages. You can avoid character
set problems if you are building a multilingual web site that only
uses these languages providing you use a programming language and
database that handle ISO Latin 1 correctly. AOLserver's embedded
scripting language, Tcl, uses Unicode internally in recent
versions. Versions of AOLserver after the ad5 release can
automatically convert between ISO Latin 1 and Unicode when data is
transferred to and from the client.
If, however, you need to use languages from outside Western Europe or
you need to use characters that are not part of ISO Latin 1, such as
the Euro currency symbol, you will need to deal with character
sets. The following areas may require character set conversion:
- Loading data from the filesystem into AOLserver/Tcl
- Exchanging data between AOLserver and Oracle
- Delivering content to web browsers
- Processing text input from users
We will discuss each of these areas, and then describe a solution for
handling character sets in a uniform way, in a character set encoding management scheme for
content in the ACS.
Unicode in AOLserver and ACS
Our solution to managing different character sets is to convert
content to Unicode as soon as possible, and keep it in Unicode for as
long as possible. Unicode is a character set that can represent most
of the world's writing systems. By using a Unicode-centric approach,
we reduce the complexity of trying to manage content in many different
character set encodings throughout the system.
AOLserver 3.0 uses Tcl 8.3 as its internal scripting engine. Tcl 8.3
represents strings internally in Unicode using the UTF-8 encoding. Tcl 8.3 has
support for conversion between about 30 common character set
encodings, and new encodings can be added to the system library if
needed.
In addition, ArsDigita has augmented the current AOLserver 3.0 API
with functions which perform character
set encoding conversions on HTTP connection data.
Unfortunately, Unicode is only beginning to be widely supported. So
for the near future, content will have to be delivered to browsers in
legacy encodings, publishers will want to author content in specific
encodings, and developers will want to edit their code using their
favorite tools which only support certain encodings. In making
decisions about how we manage content and character set encodings in
ACS, we have to decide how accommodating to be towards these different
"legacy" users of the system, while trying to reduce the complexity of
the publishing environment.
It is important to understand that encoding a text string as Unicode
does not relieve us of the task of representing the language or
languages of the content it contains. Unicode, by design, does not
attempt to represent what language a string of characters belongs to.
When we need to know what language a string contains, say to sort it
correctly or to present the correct user interface in a specific
language, we must implement some mechanism to associate a language or
locale with the string. That is the job of the internationalization
and localization facilities of the system. The Language section of this article describes the
approach for representing and storing this language and locale
information for database fields and text strings.
Definitions
- Character
- A character is an abstract entity, such as "LATIN
CAPITAL LETTER A" or "JAPANESE HIRAGANA
KA".
- Coded Character Set
- A Coded Character Set (CCS) is a mapping from a set of characters to a
set of integers defined in RFC
2277 and RFC 2130
- Character Encoding Scheme
- A Character Encoding Scheme (CES) is a mapping from a CCS or several CCSs to
a set of bytes defined in RFC
2277 and RFC 2130. UTF-8
is an example of a character encoding scheme.
- Character Set
- This document uses the term character set or
charset to mean a set of rules for mapping from
a sequence of bytes to a sequence of characters, such as
the combination of a coded character set and a character
encoding scheme; this is also what is used as an
identifier in MIME "charset=" parameters, and
registered in the IANA charset registry.
In this document we will use the
terms charset and encoding somewhat interchangeably;
using "encoding" when referring to
Tcl/AOLServer character set encoding conversions, and
"character set" when talking about the MIME and
HTTP content-type information.
- ISO-8859-1
- ISO-8859-1, also known as ISO Latin 1, is the
default character set for HTML. It uses a single byte character set
encoding, and can be used to represent text in most Western European
languages. ISO-8859-1 is a superset of 7-bit ASCII.
- Unicode
- Unicode defines a coded character set
(also known as UCS, the Universal Character Set) which encompasses most of
the world's writing systems. The Unicode Standard, Version 3.0, is code-for-code identical with International Standard ISO/IEC 10646.
- UTF-8
- UTF-8 is an example of a common character encoding scheme for
representing Unicode. UTF-8 uses a variable length encoding for the
character values, where a single character can be represented by from
one to six bytes. UTF-8 has some features which make it convenient
for representing Unicode on today's operating systems. One of the
primary features of UTF-8 is backward compatibility with ordinary
7-bit ASCII text. ISO-8859-1, however, is not compatible with UTF-8,
because it makes use of every bit and hence does not leave the
ability to do variable length encoding.
Loading Data from the Filesystem into AOLServer
While using Unicode internally to encode strings, Tcl uses a default
"system" encoding to communicate with the operating system. That is,
it will convert between UTF-8 and the system encoding for passing
string data back and forth from the underlying operating system; data
such as filenames and other system call arguments and return values.
This description of the system encoding process comes from the Tcl
Internationalization How-To:
Tcl attempts to determine the system encoding during initialization
based on the platform and locale settings. Tcl usually can determine a
reasonable default system encoding based on these settings, but if for
some reason it cannot, it uses ISO 8859-1 as the default system
encoding.
You can override the default system encoding with the encoding
system command. The internationalization how-to recommends that
you avoid using this command if at all possible. If you set the
default system encoding to anything other than the actual encoding
used by your operating system, Tcl will likely find it impossible to
communicate properly with your operating system.
When dealing with text strings in international character sets, you
must take care when you use Tcl or AOLserver facilities which
communicate with the operating system. Trying to create a file with a
filename containing Japanese characters on a Solaris machine, for
example, would cause trouble, since the filename cannot be represented
directly in the operating system's character set. In this particular
case an extra layer of encoding using a 7-bit compatible encoding such
as ISO-2022-JP might be used, but you would have to make
sure you explicitly encoded and decoded strings from this character
set when they were passed to the operating system. Windows NT, by
contrast uses Unicode internally and so will support such
filenames. To avoid portability issues, restricting filenames to 7-bit
ASCII is recommended.
Character Set Encoding in Tcl Script Files
Whenever Tcl attempts to load a script or library file from disk into
the interpreter for evaluation, it must convert that file into UTF-8
internally. To do that correctly, it must be told what character set
encoding the source file is in.
There is one exception to this process, during server bootstrap the
.tcl
startup file is read directly with no encoding
conversion, so we must use UTF-8 encoding for these files.
ACS and AOLserver load Tcl files in a number of places. At server
startup, Tcl library files are loaded from certain directories in a
bootstrap process, and then the ACS package loader takes over. At
runtime, .tcl files may be sourced dynamically to service URL requests
from HTTP clients.
In the common case, a Tcl file is loaded using the
source
command. This will read the file in using
the Tcl system encoding. That will generally be ISO-8859-1, unless someone
has explicitly set it differently. However, when developing applications
in other languages, we may want to author Tcl files containing text strings
in Unicode UTF-8 or in other character sets. In that case we must explicitly
tell Tcl what encoding conversion to use when loading the file.
In order to load a file in another encoding, Japanese EUC in this
example, the following Tcl code can be used to set the file channel
encoding
set fd [open "app.tcl" r]
fconfigure $fd -encoding euc-jp
set jpscript [read $fd]
close $fd
eval $jpscript
In general, of course, we don't want web site developers to be doing
this kind of conversion manually on individual files. Some approaches
to a framework for providing automatic encoding conversion will be
discussed in the ACS Encoding
Management section.
Character Set Encoding in HTML Files
AOLserver is capable of delivering static content files from disk
directly to a browser with no character set conversion or
interpretation. This may be useful in some cases, however the general
ACS content-processing methodology is to allow various dynamic
transformations on static content. This could mean appending general
comments, headers and footers, or other dynamic modifications. In any
cases where content will be loaded into Tcl, the file must be
converted into UTF-8 internally, and thus we need a way to determine
the file's source encoding before loading it.
Thus, the same encoding questions which apply to Tcl script files also
apply to HTML or any other content that gets loaded into Tcl
internally for processing; if the file is not in the system encoding,
then how do we specify it's encoding to the ACS, so it can be
converted to Unicode correctly when it is read from disk?
See the sidebar on HTML Character Entity
References for ISO-Latin-1 and Unicode Characters for information
on displaying characters from international character sets within an
HTML document.
Character Set Encoding in Template (ADP) Files
The AOLserver ADP parser assumes that a template is in UTF-8. If the
content has been loaded into a Tcl string, then it will already be in
UTF-8. Of course the proper encoding conversions must have been done
to get it into a legal UTF-8 string in the first place. However if
ns_adp_parse
is reading a file directly from disk
(with the -file
option) then we must ensure that
that the ADP file is in UTF-8 encoding.
The ACS currently has several different mechanisms for handling
templates. There's the Dynamic Publishing
system, the new Document API, and various project-specific
templating mechanisms in use on different sites.
We need a mechanism for the ACS which lets
us specify the source language and encoding of template documents, and
also possibly independently specify the desired output character set
for a template document. The required encoding conversions can then
be performed automatically by the template handling section of the
ACS request processor.
Delivering Content to Browsers
When a URL is requested from the web server, via an HTTP GET or POST
request, the web server returns the requested content, along with some
HTTP/MIME headers. The most important header is the Content-Type,
which has a common value of text/html
for HTML
documents. Here is a stripped down HTTP response to a GET request from
an ACS server
HTTP/1.0 200 OK
MIME-Version: 1.0
Content-Type: text/html
The Content-Type header can contain a parameter which specifies the
character set, such as in the example below. Adding Content-Language
header is a good thing to do as well, as it provides more
information to the browser on how to best present the document.
HTTP/1.0 200 OK
MIME-Version: 1.0
Content-Type: text/html; charset=euc-jp
Content-Language: ja
The character set parameter tells the client what encoding the content
is in. According to the HTTP specifications, if no character set
parameter is sent, the client should assume it is in
ISO-8859-1. However, today in practice on the Web it is the case that many
servers of non-ISO-8859-1 content neglect to send the
character set encoding information, and it is up to the browser to
either guess the correct encoding, or for the user to manually set
their browser to what looks like the correct encoding.
Note on Browser Autodetect of Character Set Encodings
Particularly with respect to the Asian languages, there may be
multiple encodings in common use for documents. Japanese for example
has ShiftJIS, EUC, ISO-2022-JP, and possibly Unicode encodings. Many
browsers have an auto-select mode where they will try to guess the
charset encoding of a document. There are algorithms which can quickly
determine if a document is unambiguously in a particular Japanese or
Chinese encoding.
To allow browsers to properly render our content, we must make sure
that a character set encoding parameter is always sent with every
web page we serve.
In many cases the author of content in a non-ISO-8859-1 character set
will include a HTML META tag in the document body which specifies the
content type and character set. For example
<meta http-equiv="content-type" content="text/html;charset=x-sjis">
Many browsers will use the META tag to help determine the page
encoding, although if it conflicts with a charset specified in the
Content-Type header, then the results will be unpredictable.
The following HTTP GET from Yahoo Japan's web site shows that they
included no character set information in the HTTP header. None was
present in the HTML content itself either via inclusion of a META tag. The
encoding of the document was in fact Japanese EUC, but the browser
must guess this heuristically in order to correctly render the page.
telnet www.yahoo.co.jp 80
Trying 210.140.200.16...
Connected to www.yahoo.co.jp.
Escape character is '^]'.
GET / HTTP/1.0
HTTP/1.0 200 OK
Content-Type: text/html
Content-Length: 21673
|
Managing the Output Encoding
AOLserver has several pathways for returning data to the HTTP connection.
Serving files directly from disk can be done verbatim; the file is
sent byte-for-byte with no translation performed by the
server. The AOLserver
encoding API addresses those cases, and describes
how to configure the MIME content type tables to help identify the
character set encoding to a client. In that case no encoding
translation is required.
The AOLserver API functions for returning content to a network stream are
ns_writefp
ns_connsendfp
ns_returnfp
ns_respond
ns_returnfile
ns_return (and variants like ns_returnerror)
ns_write
We extended the AOLserver API to support explicit specification of a
character set encoding conversion for the output stream. For example,
ns_return
has been modified to inspect the Content-Type
header, and if it finds a
charset
parameter, it will
perform encoding conversion to that character set.
ns_return 200 "text/html; charset=euc-jp" $html
If no character set is specified, the output will
default to the init parameter
ns/parameters/OutputCharset
, or ISO-8859-1 if no explicit
default is specified.
The new API function, ns_startcontent
is used to explicitly
set the conversion on the network output stream to a given charset encoding.
This is used in conjunction with ns_write
to manually
set the output encoding for a document.
ReturnHeaders "text/html; charset=euc-jp"
ns_startcontent -charset "euc-jp"
Why not just send Unicode?
At this point you might ask if we could just encode all output in
Unicode (UTF-8), thus relieving us of the task of deciding which
output encoding to use. Ultimately that would be the most portable way
for all documents to be delivered, greatly relieving the burden of
tracking the numerous redundant character sets in use today.
However, Unicode is still a relatively new standard, and many browsers
and other tools can not work with it yet. So we must provide support
to encode content in the character sets that are in common use today.
Given this requirement, we must decide how we are going to build a an
API for developers and content publishers to specify the desired
output encoding for individual or entire classes of documents.
Exchanging Data between AOLserver and Oracle
Text data can be passed between Oracle and Tcl without requiring any character
set conversion if the database is configured to use UTF-8
encoding. Ideally, the UTF-8 character encoding is specified when the
CREATE DATABASE command is run, though it is also possible to change a
database's encoding after it has been created.
You need to set the character encoding as an environment variable when
you start up your Oracle client process. If you are using the
multi-threaded server (MTS) option, the client processes are started
when the Oracle server is started. If, however, have configured your
Oracle server to have a dedicated server-side process for each client,
you can explicitly set its character set encoding as UTF-8 in a
start-up script such as the following:
#!/bin/sh
. /etc/shell-mods.sh
TCL_LIBRARY=/home/aol30/lib/tcl8.3
export TCL_LIBRARY
NLS_LANG=AMERICAN_AMERICA.UTF8
export NLS_LANG
TZ=GMT
export TZ
exec `dirname $0`/nsd8x-i18n $*
The AOLserver nsd8x-i18n executable referenced above is a version of
AOLserver compiled with the ArsDigita extensions for character set
encoding support described in
Character Encoding in
AOLserver 3.0 and ACS
You must also use a version of ArsDigita's Oracle
driver with the patch for LOB fetches and variable width
character sets (version 2.2 or later):
$ strings /home/aol30/bin/ora8.so | grep ArsDigita
ArsDigita Oracle Driver version 2.2
If your driver does not have this support, you may get the following
error when sending multibyte characters from AOLserver to Oracle:
[05/Jun/2000:23:20:04][5128.11][-conn1-] Error: ora8.c:1398:ora_exec:error in `OCIStmtExecute()':
ORA-03127: no new operations allowed until the active operation ends
When loading the ACS data model into Oracle, the following script
ensures that the character set is explicitly set to ISO-8859-1, so
that any high-bit characters in the SQL code are properly converted to
UTF-8.
#!/bin/sh
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
export NLS_LANG
exec sqlplus $* < load-data-model.sql
Processing Text Input from Users
Arbitrary text data can be posted to the server in HTTP POST and GET
requests. Unfortunately the HTTP protocol is very weak in the area of
annotating user submitted data with character set information. This creates
some problems in unambiguously converting user-submitted data into
a common UTF-8 encoding.
Form data can arrive at the server in three ways
- Query data in the URL path query string
- Form (POST) data in
application/x-www-form-urlencoded
encoding
- Form (POST) data in
multipart/form-data
encoding
URL Query String Data
A URL can contain arbitrary data in the query section of the path (the
part after the first '?' character). Imagine a web server receives a
GET request for a URL from a browser, which looks like this
http://hqm.arsdigita.com/i18n/examples/form-1.tcl?mydata=%C5%EC%B5%FE
Looking at just the structure of this URL, we cannot determine which
charset the variable
mydata
is encoded in. We will have
to actually examine the content itself, and make our best guess. It
happens this was the Kanji for "Tokyo" encoded in Japanese EUC, but it could
be a legal string in any number of character sets.
The HTTP specification mandates that text data in a URL query be
URLencoded, hence the %XX escape sequences, however it does not say
anything about how to specify what character set is being used.
In practice we have to rely on other context knowledge to figure out
what character set query data are encoded in. Most browsers will
look at the character set in the HTTP header of the last document they
retrieved , or at the META tag there is one, and set their current
character set to that automatically, so in that sense, the browser
character set gets set automatically to that of the last document
viewed (usually the form that you are clicking 'submit' on). However,
the user can usually override that character set choice by setting the
encoding manually from a menu. If we have some way implicit or
explicit of knowing the encoding of the form which the GET or POST
came from, then we can properly decode the query strings.
Note on Autodetect of Character Set Data from Browsers
Under some circumstances a user's browser may POST or GET data in an
encoding which is different from that which your application
expects. At that point it will be impossible for AOLserver to
correctly convert the data into UTF-8. In the case of Japanese, it is
usually possible to heuristically guess the encoding, because the
common Japanese encodings have enough redundancy that in even a small
sample of text, some code sequences can be unambiguously assigned to a
particular character set. Several software libraries are available to
heuristically detect Japanese character sets. In order to use this
mechanism for character set detection it would be necessary to modify
the AOLserver URL and form data decoding process by adding a call-out
to a user-configurable character set detection library.
|
Handling Simple URL and Form Data
When a simple HTTP GET or POST request comes in from a browser, query
and form data has been
URL encoded . URL encoding is a content
transfer encoding where a single byte may be encoded as a '%' character
followed by two hexadecimal characters, such as
%CF
. The
encoding is designed to be safe for 7-bit transmission channels, using
only a safe common subset of ASCII printing characters. The request
data needs to be decoded back into bytes, using the methods described
above to discover the character set in use by the client browser.
Composing URL-Encoded Strings
In developing an application you may need to create hyperlinks in your
documents, and encode data into the URL. The W3C recommends that UTF-8
be used to
encode
data within a URL. However, we have provided an API to allow
different character encodings within URLs. The
ns_urlencode
function has been extended to accept a
character set argument. Thus, you could call:
ns_urlencode -charset shift_jis $your_data
See
Character
Encoding in AOLserver 3.0 and ACS for more details on this API
function.
Decoding Form Data
The
AOLserver
encoding API defines the
ns_urlcharset
command which
can be used to set the automatic decoding of submitted form data from
a specified character set to UTF-8. In the current API,
ns_urlcharset
must be called before form data is
requested for the first time from AOLserver.
ns_urlcharset "euc-jp"
...
...
ad_page_contract { foo } {
Allow user to submit foo data
}
By default, form data is treated as if it is in ISO-8859-1
decoding. This can be modified as a configuration parameter
Another new API function ns_formfieldcharset
allows you
to designate a field in the form data as containing the name of the
form's character set encoding. This is then extracted and used as if
it were passed to ns_urlcharset
.
At this time the chance is pretty low of getting correct language or
character set HTTP header information from browsers submitting user
data. Both Internet Explorer 5 and Netscape headers 4 headers do not
specify the character set for either GET or POST requests. Even if
browsers do send character set headers, it would be unwise to depend
on them since they may be incorrect. Thus, we are not going to suggest
trying to use browser HTTP language or charset header information to
decode user submitted data. Since HTTP and HTML have up to now been
primarily developed in either the English speaking world or in single
language communities, the internationalization protocols for HTTP have
been loosely adhered to at best.
Deducing the correct character set in which to interpret user
submitted data is largely a matter of setting up your application
context so that you already know what language you expect, as much as
possible.
Multipart form submission
An HTML form which contains the
enctype=multipart/form-data
parameter will encode the
POSTed form in a MIME-style multipart message.
[RFC 2388]
4.5 Charset of text in form data
Each part of a multipart/form-data is supposed to have a Content-
Type. In the case where a field element is text, the charset
parameter for the text indicates the character encoding used.
For example, a form with a text field in which a user typed 'Joe owes
100' where is the Euro symbol might have form data returned
as:
--AaB03x
Content-Disposition: form-data; name="field1"
Content-Type: text/plain;charset=windows-1250
Content-Transfer-Encoding: quoted-printable
This would seem to be a great help in allowing us to find the
character set encoding of submitted form data. Unfortunately, most
browsers do not obey this standard and will not provide any
Content-Type header at all with the parts of a multipart form
submission, much less character set information.
Below is the trace of a POST request from Internet Explorer 5.0 from
a form which is encoded in Japanese EUC. Note that the only
information provided in the multipart/form-data entry for the form
variable mydata is the name of the form field itself. No
character set or content-transfer-encoding field was sent. Note that even
though the page containing the form was being viewed in Japanese, the
browser sent a Accept-Language header stating that it only accepts English.
<form method=post action=mpform-1.tcl enctype=multipart/form-data>
<input size=40 type=text name=mydata value=\"[util_quotehtml $mydata]\">
<input type=submit>
</form>>
POST /i18n/examples/mpform-1.tcl HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, ...
Accept-Language: en-us
Content-Type: multipart/form-data; boundary=---------------------------7d02c699c0
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
Host: hqm.arsdigita.com
Content-Length: 139
Pragma: no-cache
-----------------------------7d02c699c0
Content-Disposition: form-data; name="mydata"
[... RAW EUC-JP BYTE DATA HERE ...]
-----------------------------7d02c699c0--
Another warning, because of carelessness or bugs in browser
implementations, if you deliver a form which contains default data for
an input field, even if the user does not modify the input field at
all, when they POST the form back to the server, it may still come
back in a different encoding than it was sent down in. Thus it is
wise to compose forms with a hidden "reference" field that contains a
string that could be decoded unambiguously using a character set
auto-detection routine. This would provide protection against this
kind of "floating character-set" problem.
It is wise to use only 7-bit ASCII for the names of form variables.
That way, there is little chance of improperly decoding them when
processing the form submission, since most character set
encodings support 7-bit ASCII as a subset.
Currently while ns_urlcharset
will do automatic
charset conversion on data submitted in
application/x-www-form-urlencoded
encoding, it does not
do any automatic encoding conversion on data from
multipart/form-data
submissions. That is, the
ns_urlcharset
command will have no effect if the form
enctype
was multipart/form-data
.
If you want to decode multipart posted data, you will need to call
encoding convertfrom
explicitly to convert each string
to UTF-8 for use in your application.
This restriction is currently present in order to allow uploaded file
data to be read without character set conversion being performed. This
is necessary if the form data is going to be stored verbatim directly
into the database in BLOB form, without being interpreted in a
particular character set.
Character Set Management for ACS
As of ACS version 3.3, absolutely every request is served through a
unified request processor (rpp_handler in
/packages/acs-core/request-processor-procs.tcl), described in detail
in the
request
processor documentation.
We can solve the character set conversion problem by extending the request
processor. It needs to be aware of the encodings of source documents
and the
desired output encodings, then automatically perform encoding
conversion where needed. Below is an example of how this
facility would appear to the site developers and publisher.
As a general principle, we need to have a convention for specifying
the source encoding of a document, and for specifying the output
encoding in which we deliver the document to a browser. In the common
case, the source encoding and output encoding for a document will be
the same, but we want to provide an easy way for a page author to
override this. So, for example, it should be possible for a developer
to create a Japanese template file in UTF-8, but have it served to
browsers encoded in ShiftJIS.
For multi-lingual web sites, it is probably more appropriate to try to
provide an API which lets authors work at the "language" level rather
than the character set level. In most cases, the content authors will
want to think in terms of managing content in different languages, and
would like to have the character set encoding issues take care of
themselves as much as possible. However we also want to provide a way
to explicitly specify character set encodings to give developers full
control over how their documents are interpreted.
File Name Conventions for a Multilingual Site
Maintaining a multilingual web site can be made easier by making use
of the
abstract
URL facility to create the logical structure of the site, independent of
language, and then having the system automatically dispatch to the
correct language-specific file for a given user or session. Language
identification is discussed in the
Language section of this article.
In the common case, a document will be requested by a browser using an
abstract URL. We need the request processor to combine this request
with the connection's environment information, based on cookies, user
and session id's, URL pathname, etc., and come up with the desired
target language for the document.
Once we have computed the desired language for a document, the request
processor can be extended to search for the file on
disk which matches the language and perform the necessary encoding
conversions in order to process and deliver the file to the browser.
Here are some techniques for structuring the files on a
multi-lingual web site:
- One Language per Template, with Abstract URLs
-
A site with where each source file has an explicit
language. For a given abstract URL, these might be named using
overloading of filename extensions, such as
foo.en_US.html American HTML file
foo.en_GB.html English HTML file
foo.fr.html French HTML file
bar.ja.adp Japanese ADP file
bar.el.adp Greek ADP file
baz.ru.tcl Russian Tcl file
The language code is a language_country pair, using abbreviations defined by the ISO
639 standard for language names and the ISO 3166
standard for country names.
When we are only dealing with a single
locale for a given language, we can shorten the filename from the full
locale to just the language name for convenience, for example
fr
instead of fr_FR
. It has also been
suggested that aliases for file language suffixes be assigned for
convenience, such as .us
for en_US
and
.gb
for en_GB
.
- Single Fully-Multilingual Template Files
-
The other style of usage is to create multilingual applications
where a single template page serves content in multiple languages,
with all language-specific content generated using the Message Catalog
and Localization API's. In that case, it does not necessarily make
sense to assign a single language to a template or source file.
For a truly multilingual page the source encoding would probably either
ISO-8859-1 or UTF-8. Since the actual language-specific content will
be drawn from the message catalog or database, the page itself should
be thought of as a language-independent logical template. The output
encoding in which we send the content to the browser, however, will
depend on what language we are serving in the request. So we
must have an API for dynamically computing and setting the output
encoding from within the document request itself.
Automatic Mapping of Language to Character Set
A configuration table can be created to map languages to
character set encodings. This might look like
en = iso-8859-1 English
fr = iso-8859-1 French
ja = sjis Japanese ShiftJIS
ru = iso-8859-5 Russian
el = iso-8859-7 Greek
An API function called
ad_charset_for_language
could be
used to select the appropriate character set for a given language.
It is certainly mandatory that we know what character set a source
file is in so that it can be converted to UTF-8 correctly when loading
into Tcl. We can also use the character set of the file as the default
output encoding when we deliver the final page content to the
browser. This will generally be the correct thing to do. In some cases
the publisher may want to manually override this for some files. For
example authoring Japanese content may be more easily done in EUC (or
even Unicode) on Unix systems, but the site may want to deliver the content
to users in the more widely used ShiftJIS encoding. Configuration
options and APIs for authors to do this should be allowed for in our
system.
While we are overloading filenames with language info, it might be
useful to be able to specify a character set directly in the same way.
Thus we could simultaneously allow filenames like this
foo.ej.html Japanese EUC HTML file
foo.sjis.html Japanese ShiftJIS HTML file
foo.iso8859-5.html ISO-8859-5 HTML file
bar.utf8.adp UTF-8 template
bar.iso8859-1.adp ISO-8859-1 template
As long as we use charset names which are distinct from language
names, then there should be no conflict. One algorithm for handling an
abstract URL could have the request processor could look for a file
with an explicit charset first, and then look for language files, or
vice versa.
ACS/AOLserver Platform-specific Changes
The sections below outline some modifications that can be made to
an ACS version 3.x system running on the AOLserver/Tcl plaform
to perform some of the character set encoding handling described above.
For more detailed and up-to-the-minute patch kits, see
ACS I18N Patch Kit and
ACS 3.4.x International Character Set Support v.3.4.5
Serving Static Files
AOLserver is capable of delivering a document from disk directly to a
web browser, byte-for-byte. In that case, it does not need to know
anything about the character set encoding of the document. If you have
a file stored in a non-ISO-8859-1 encoding, it is is possible to have
AOLserver add the correct content-type character-set parameter to the
HTTP header using its built-in MIME-type lookup mechanism. Entries in
the AOLserver MIME-type table can be added by modifying the init file
ns/mimetype
parameter to assign a MIME-type for a given
file name extension.
yourserver.ini:
[ns/mimetypes]
Default=text/plain
NoExtension=text/plain
.html=text/html; charset=iso-8859-1
.html_sj=text/html; charset=Shift_JIS
.html_ej=text/html; charset=euc-jp
.tcl_sj=text/plain; charset=Shift_JIS
.tcl_ej=text/plain; charset=euc-jp
.adp_ej=text/html; charset=euc-jp
.adp=text/html; charset=utf-8
This approach will work as long as the data passes directly from the
file through AOLserver and out to the browser, without being loaded
into Tcl. However the ACS generally does do some processing in Tcl on
a file's contents before it delivers it to the browser. For example,
.tcl scripts will need to be loaded by the interpreter and evaluated,
static pages may have generalized comments and system headers or
footers appended to them, and .adp template files need to have a
template parser run on their content, possibly running arbitrary Tcl
code. In all of these cases we must know the source file's encoding
in order to read it properly into a Tcl string.
Serving HTML Files with Dynamic Annotations
On a typical ACS installation, there may be many static HTML files
which are annotated dynamically by the system before they are
delivered to the browser. In these cases, we must know the source encoding
of the files. Using the filename extensions suggested above we can
tell the system what the encoding is, and the request processor
routine that handles static files (
rp_handle_html_request
) can
be modified to set the correct channel encoding when loading the file.
Patching ad_serve_html_page with the following code will allow HTML files
with alternate encodings to be served. For example, with the ns/mimetypes
shown above, a file foo.html_ej
would be loaded into Tcl as EUC-JP encoding,
and delivered to the the browser in the same encoding, by default.
ad-html.tcl:
ad_serve_html_page
set type [ns_guesstype $full_filename]
set encoding [ns_encodingfortype $type]
set stream [open $full_filename r]
fconfigure $stream -encoding $encoding
set whole_page [read $stream]
close $stream
# set the default output encoding to the file mime type
ns_startcontent -type $type
Actually if we want to use the generalized filename convention
suggested in the previous section, where the file's language is encoded into a second suffix
(
foo.locale.html
) then code above will need to use
an alternate function than the AOLServer built-in
ns_guesstype
command for mapping a filename to a MIME-type.
ns_guesstype
only parses out simple filename extensions,
where everything after the last '.' in the filename is considered to
be the extension. We will need to write a function which parses out
the language field from the filename, looks it up in our language-to-character-set
table.
Specifying Source and Output Encoding of Tcl Files
In the same manner as static HTML files, .tcl script filenames can be
annotated with a language or character set extension, and
the request processor routine
rp_handle_tcl_file
can
be modified to set this channel encoding when loading the file.
For output encoding, we should make rp_handle_tcl_request
to
by default set the output character set to whatever the source character set is,
but the developer should be able to explicitly override this at any point by
one or more of the methods below
- Calling
ns_startcontent
- Setting an output Content-Type header with a
charset
parameter
- Calling
ns_return
or related functions with an explicit
character type
Modifications to the Request Processor
The new API function
ns_encodingfortype
is used
to return the Tcl encoding to use for a given document MIME type.
The following source_with_encoding
function would
be used to replace the basic source
call in
rp_handle_tcl_request
proc_doc source_with_encoding {filename} { loads filename, using a charset encoding
looked up via the ns_encodingforcharset command, based on the ns_guesstype MIME
type of the filename. } {
set type [ns_guesstype $filename]
set encoding [ns_encodingfortype $type]
set fd [open $filename r]
fconfigure $fd -encoding $encoding
set code [read $fd]
close $fd
ns_startcontent -type $type
uplevel 1 $code
}
proc_doc rp_handle_tcl_request {} { Handles a request for a .tcl file. } {
global ad_conn
doc_init
rp_eval [list source_with_encoding [ad_conn file]]
if { [doc_exists_p] } {
# The file returned a document. We need to serve it.
rp_eval doc_serve_document
}
}
acs-core/abstract-url-init.tcl:
foreach { type handler } {
tcl rp_handle_tcl_request
tcl_ej rp_handle_tcl_request
adp rp_handle_adp_request
adp_ej rp_handle_adp_request
html rp_handle_html_request
htm rp_handle_html_request
} {
rp_register_extension_handler $type $handler
}
Specifying Source and Output Encoding of Template Files
Using ns_adp_parse directly, template (.adp) files can only be
authored in UTF-8. However, we can modify the request processor to
read the files into Tcl strings first, performing any needed
conversions, and then to apply the adp parser to the string. There
are some issue of efficiency here, but the same issues are manifest in
general caching architectures, so whatever solution we design for
caching template pages should also be able to improve the performance
of dynamically converting the source file charset encoding.
proc_doc rp_handle_adp_request {} { Handles a request for an .adp file. } {
doc_init
set mimetype [ns_guesstype [ad_conn file]]
set encoding [ns_encodingfortype $mimetype]
set fd [open [ad_conn file] r]
fconfigure $fd -encoding $encoding
set template [read $fd]
close $fd
if { ![rp_eval [list ns_adp_parse -string $template] adp] } {
return
}
if { [doc_exists_p] } {
doc_set_property body $adp
rp_eval doc_serve_document
} else {
set content_type [ns_set iget [ns_conn outputheaders] "content-type"]
if { $content_type == "" } {
set content_type [ns_guesstype [ad_conn file]]
} else {
ns_set idelkey [ns_conn outputheaders] "content-type"
}
ns_return 200 $content_type $adp
}
}
Setting the Template's Output Encoding
When evaluating a template file, there should be a well defined chain
of control for setting the document's output encoding.
The output charset for a template should default to whatever
source charset was computed, as described above.
From that point, code inside the template could be used to override the
default encoding. For example, in an adp file, the output encoding
be explicitly set by the following means:
META HTTP-EQUIV="Content-Type"
The HTML
META HTTP-EQUIV
tag can included in a document
as an extra hint to inform a browser what charset a document is
using. From
RFC 2070:
In any document, it is possible to include an
indication of the encoding scheme like the following, as early as
possible within the HEAD of the document: <META
HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=ISO-2022-JP">
This is not foolproof, but
will work if the encoding scheme is such that ASCII-valued octets
stand for ASCII characters only at least until the META element is
parsed. Note that there are better ways for a server to obtain
character encoding information, instead of the unreliable META
above
You might think that if we are setting the Content-Type header
properly with a charset parameter, then the META tag may be redundant.
However consider what happens if the user downloads and stores the
file on disk or emails it to a friend. In that case, the Content-Type
and other header information will likely be discarded. So it is useful
to annotate your documents this way if possible.
For a multilingual template the content may be output in a variety of
encodings. If we use an HTML META HTTP-EQUIV
tag, then
we must make sure it refers to the same
charset as we are actually serving the document in. This means we
need an API for the code in the template to ask what character set
is being used.
Setting the Default Encoding for the Entire Site
For a web site which is going to be basically monolingual, but using
a different character set than the default ISO-8859-1, it is possible to set the
default encoding for most cases, using the following
ns/parameters
values. For example, to set the default character set encoding to
Japanese ShiftJIS, add the following to the server.ini file
[ns/parameters]
HackContentType=1
URLCharset=Shift_JIS
OutputCharset=Shift_JIS
This will cause files which have no explicit character set parameter
in their content-type to be treated as ShiftJIS, both for output conversion
and for user input conversion. To ensure that Tcl
script files are loaded using ShiftJIS encoding, you should explicitly set
the mime type of
.tcl
files to contain the chosen character set.
The modified version of the request processor, described above, will then
choose the correct encoding conversion to use when loading the file.
[ns/mimetypes]
.tcl=text/plain; charset=Shift_JIS
.adp=text/plain; charset=Shift_JIS
Modifications to the Document API
Currently, there is a new proposed ACS mechanism for document creation
and delivery, the
Document API which
will be performing most of the default steps needed to compose and
return a document from the server.
The new Document API performs similar steps to the request processor's
default document handlers, using the doc_serve_document
function. This function will need to be modified in a similar way to
how we modified rp_handle_adp_request
in order to become
"encoding aware". The document's source encoding will need to be looked
up, the channel encoding set before reading it, and the output
encoding set to the correct value.
Modifications to ad_return_template
The
ad_return_template
function in the ACS is used to tie
a tcl file to a corresponding template from a template library.
ad_return_template
has a model of user language using
either a language preference from the users preferences table, or a
cookie called
language_preference
, and a scoring system
to choose the best matching template. Although it currently uses
ns_adp_parse
with the
-file
option, thus
limiting it to UTF-8 templates, the code could be extended to
incorporate the character set lookup mechanisms described above for
properly setting the source and output encoding.
Note, while ad_return_template
attempts to score files
based on how well they match the user's preferred language, this
feature needs to be used very deliberately. Giving a visitor a link to
a page in a language that they are not expecting should be regarded as
a serious site design problem, and not merely a graceful degradation
of service.
asj-editors@arsdigita.com