animesh kumar

Running water never grows stale. Keep flowing!

Java i18n

with one comment

internationalizeTrivia: “i18n” 18 is the number of letters between the “i” and the “n” in “internationalization”

Basics:

Most traditional character encoding standards are 8-bit, and can only represent 256 different characters. While internationalization this becomes a bottleneck since 256 characters can’t accommodate every character possible. The solution to this problem is the adoption of universal character encoding: Unicode.

Unicode is
a) an industry standard designed
b) to bring together texts and symbols from all the writing systems of the world by
c) providing a unique number (not glyph) for every character.

This means that it represents a character in a number and the underlying application        will render the character (symbol, font, size, or shape) with some rendering/mapping algorithm.

There are several possible representations of Unicode data indicated by the Unicode Transformation Format (UTF). UTF is an algorithmic mapping from every Unicode code point to a unique byte sequence.

There are various UTF algorithms available such as, UTF-8, UTF-16 or UTF-32, but the preferred character encoding used in web environments is UTF-8, which is
a) a variable-length character encoding able to represent any character in the Unicode standard, yet
b) the initial encoding of bytecodes and character assignments for UTF-8 is consistent with ASCII, though not with Latin-1, because the characters greater that 127 differ.

Enabling full internationalization in a typical java web system:

A typical flow will look something like below:

Client   <–>   Internet <–> Web Server <–> Application Sever <–> DBMS
Let’s look at each layer and enable i18n to it.

Client: Web browsers (like Internet Explorer, Mozilla Firefox, Safari, and Opera) represent the client side of a web application. The best way to tell a browser about UTF-8 encoding is by putting the character-set information in the HTTP response header:

Server: Apache-Coyote/1.1
Pragma: No-cache
Cache-Control: no-cache
Expires: <date>
Content-Type: text/html;charset=utf-8
Transfer-Encoding: chunked
Date: <date>

Web Server:

1.      Most web servers use the encoding of the operating system, defined in the system property file.encoding.

This property is usually defined as
a) ISO-8859-1 in unix-based systems or
b) Cp1252 in windows systems.

To ensure UTF-8 support, the file.encoding property has to be redefined during system startup.

2.      Apache2 on Windows NT use UTF-8 for all filename encodings, but otherwise, recommends changing the Tomcat/JBoss startup script (run/catalina) to add the switch

-Dfile.encoding=UTF-8

to the startup call to the JVM to ensure that the HTTP response encoding will be defaulted to UTF-8. However, this can be overridden within the Java Servlet code as needed.

3.      Static hypertext documents should at the top of the <head> section include:
<meta http-equiv=”content-type”content=”text/html; charset=utf-8″>

4.      JavaScript block or file should include the charset attribute:
<script src=” scriptFile.js” type=”text/javascript” charset=”utf-8″></script>

Application Server: Application servers are programs that sit between web server and backend business applications or databases.

1.      Java files do not require any UTF-8 configuration, where JSP files enable UTF-8 encoding by placing a page directive at the top of the file and including pageEncoding and contentType attributes:
<%@ page contentType=”text/html;charset=utf-8″ pageEncoding=”utf-8″ %>

This page directive should be used in all JSP files that are included with the <jsp:include> tag (not the <%@ include %> page directive).

2.      Moreover, if JSP file contains a (X)HTML <head> tag, it should to include UTF-8 page directive:
<meta http-equiv=”content-type” content=”text/html; charset=utf-8″>

3.      When sendRedirect() method is used, query string parameters should be encoded with java.net.URLEncoder.encode() method.

4.      HTML forms should include charset attribute:

<form action=”processData.jsp” method=”post” enctype=”multipart/form-data; charset=utf-8″>
……..
</html:form>

The upper input form submits the form data in UTF-8. And a filter must be implemented to specify character encoding before reading the form parameters.

response.setContentType(“text/html; charset=UTF-8”);

5.      A request submitted through JavaScript with the form’s “GET”, multilanguage query string parameters should be encoded by using the JavaScript encodeURI method, and so should all standard HTML hyperlink tags <a href=””>.

6.      Java Dictionary Files (message bundles) are key-value hash kept to lookup for internationalized data. They do not provide a mechanism for indicating the encoding.

Therefore they have to be encoded manually. Java comes with a native2ascii converter which takes an -encoding switch to indicate the encoding of the file, the name of the source file and the name of the target file:

native2ascii -encoding UTF-8 SourceFile TargetFile

Database: Database management systems (DBMS) require character-set information when a new database or table is created.

1.      Databases that don’t support UTF-8 by default, default character set has to be defined as, for example, in MySQL’s configuration file (my.ini):
default-character-set=utf8

2.      Database drivers usually require extra configuration as for example when connecting to a MySQL database using a Java database connectivity (JDBC) driver:
Connection db =DriverManager.getConnection(“jdbc:mysql://localhost/myDatabase?useUnicode=true&characterEncoding=utf-8″,”username”,”password”);

References:

Advertisements

Written by Animesh

May 2, 2009 at 5:29 am

Posted in Technology

Tagged with , ,

One Response

Subscribe to comments with RSS.

  1. We’ve been struggling with an encoding issue for a couple of weeks now and this post made it work correctly! Thank you!

    I used the following line in my catalina start script:
    export JAVA_OPTS=”-Dfile.encoding=UTF-8″
    (Linux server)

    Thanks again!

    Erik SVensson

    October 10, 2012 at 6:56 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: