Trivia: “i18n” 18 is the number of letters between the “i” and the “n” in “internationalization”
Most traditional character encoding standards are 8-bit, and can only represent 256 different characters. While internationalization this becomes a bottleneck since 256 characters can’t accommodate every character possible. The solution to this problem is the adoption of universal character encoding: Unicode.
a) an industry standard designed
b) to bring together texts and symbols from all the writing systems of the world by
c) providing a unique number (not glyph) for every character.
This means that it represents a character in a number and the underlying application will render the character (symbol, font, size, or shape) with some rendering/mapping algorithm.
There are several possible representations of Unicode data indicated by the Unicode Transformation Format (UTF). UTF is an algorithmic mapping from every Unicode code point to a unique byte sequence.
There are various UTF algorithms available such as, UTF-8, UTF-16 or UTF-32, but the preferred character encoding used in web environments is UTF-8, which is
a) a variable-length character encoding able to represent any character in the Unicode standard, yet
b) the initial encoding of bytecodes and character assignments for UTF-8 is consistent with ASCII, though not with Latin-1, because the characters greater that 127 differ.
Enabling full internationalization in a typical java web system:
A typical flow will look something like below:
Client <–> Internet <–> Web Server <–> Application Sever <–> DBMS
Let’s look at each layer and enable i18n to it.
Client: Web browsers (like Internet Explorer, Mozilla Firefox, Safari, and Opera) represent the client side of a web application. The best way to tell a browser about UTF-8 encoding is by putting the character-set information in the HTTP response header:
1. Most web servers use the encoding of the operating system, defined in the system property file.encoding.
This property is usually defined as
a) ISO-8859-1 in unix-based systems or
b) Cp1252 in windows systems.
To ensure UTF-8 support, the file.encoding property has to be redefined during system startup.
2. Apache2 on Windows NT use UTF-8 for all filename encodings, but otherwise, recommends changing the Tomcat/JBoss startup script (run/catalina) to add the switch
to the startup call to the JVM to ensure that the HTTP response encoding will be defaulted to UTF-8. However, this can be overridden within the Java Servlet code as needed.
3. Static hypertext documents should at the top of the <head> section include:
<meta http-equiv=”content-type”content=”text/html; charset=utf-8″>
Application Server: Application servers are programs that sit between web server and backend business applications or databases.
1. Java files do not require any UTF-8 configuration, where JSP files enable UTF-8 encoding by placing a page directive at the top of the file and including pageEncoding and contentType attributes:
<%@ page contentType=”text/html;charset=utf-8″ pageEncoding=”utf-8″ %>
This page directive should be used in all JSP files that are included with the <jsp:include> tag (not the <%@ include %> page directive).
2. Moreover, if JSP file contains a (X)HTML <head> tag, it should to include UTF-8 page directive:
<meta http-equiv=”content-type” content=”text/html; charset=utf-8″>
3. When sendRedirect() method is used, query string parameters should be encoded with java.net.URLEncoder.encode() method.
4. HTML forms should include charset attribute:
<form action=”processData.jsp” method=”post” enctype=”multipart/form-data; charset=utf-8″>
The upper input form submits the form data in UTF-8. And a filter must be implemented to specify character encoding before reading the form parameters.
6. Java Dictionary Files (message bundles) are key-value hash kept to lookup for internationalized data. They do not provide a mechanism for indicating the encoding.
Therefore they have to be encoded manually. Java comes with a native2ascii converter which takes an -encoding switch to indicate the encoding of the file, the name of the source file and the name of the target file:
native2ascii -encoding UTF-8 SourceFile TargetFile
Database: Database management systems (DBMS) require character-set information when a new database or table is created.
1. Databases that don’t support UTF-8 by default, default character set has to be defined as, for example, in MySQL’s configuration file (my.ini):
2. Database drivers usually require extra configuration as for example when connecting to a MySQL database using a Java database connectivity (JDBC) driver:
Connection db =DriverManager.getConnection(“jdbc:mysql://localhost/myDatabase?useUnicode=true&characterEncoding=utf-8″,”username”,”password”);