Unicode development with DbForms, MySQL 4.1, and friends

Developing a Unicode-enabled application with DbForms and MySQL 4.1

Summary
This article covers Neal Katz's experiences developing a multilingual Website using DbForms and MySQL 4.1. Katz discusses the configuration details and provides tips on how to use the various localization tools available. The article will also explain how to generate reports with JasperReports and POI (Poor Obfuscation Implementation). (3,200 words; September 6, 2004)
By Neal Katz


Developing a Unicode-based application can be a frustrating experience because of the myriad of little "gotchas" involved with making each piece of a complex system work with Unicode data. This article covers my recent experiences in building a college registration system supporting English and the Thai languages. The development was J2EE-based and used the following technologies:

  • Jakarta Tomcat 5.0.19
  • MySQL 4.1.2
  • DbForms 2.4rcX
  • JasperReports 0.5.2
  • Apache POI (Poor Obfuscation Implementation)
  • During the project's initial phases, my team needed to decide on a database to use that would store Thai characters. For maximum flexibility, we decided to use Unicode to store data, instead of a Thai character set. Our database would also store English data, and future versions of this system might need to support other languages as well. Due to our limited budget, we narrowed the database choice to open source products that support Unicode.

    My prior experience with MySQL led me to make the potentially risky choice of using MySQL 4.1, which is clearly marked as a develop release. After doing research on the Web and reading mail archives, I concluded that MySQL 4.1 had some problems with complex queries and complicated collating with fields of different character sets; however, 4.1 had no reported problems with data integrity or database corruption.

    We started development with the MySQL 4.1.1 release and noticed a few complex queries where the ordering did not work as expected; we were forced to rewrite the queries or break a single query into two queries. We were also able to create one complex query that crashed the system (but did not cause any corruption). Upgrading to 4.1.2 solved the problem with the crashing query. To date, we have not experienced any crashing with 4.1.2.

    We are happy with our decision to use 4.1.2, but it was a risky choice, and we were prepared to back out if necessary. Also, given the application type, this risk level was acceptable. For a more mission-critical application, I probably would not have used MySQL 4.1.

    MySQL supports multiple table types, the default is ISAM (indexed sequential access method), which is fast, but does not offer referential integrity through foreign key constraints. Instead, we used InnoDB tables in our system and used the constraint feature for referential integrity. Any potential performance loss was more than offset by the reliability that referential integrity adds to a complex database.

    Using Unicode with MySQL 4.1 is simple, and there are two ways to specify which character set to use. The first approach specifies a character set as the default for a table; the second declares a character set for a specific field. Specifying a field's character set takes precedence over the table's character set.

    Example 1. Define a default character set for a table

    CREATE TABLE `city` (
      `city_id` int(11) NOT NULL default '0',
      `name` varchar(25) NOT NULL default '',
      `name_eng` varchar(25) NOT NULL default '',
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

    Example 2. Define a default character set and specific character set for a field

    CREATE TABLE `test2` (
      `alpha` varchar(5) character set utf8 NOT NULL default '',
      `beta` varchar(10) default NULL
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

    In the first example, all fields are UTF-8: name stores Thai data and name_eng stores English language data. In the second example, the table's character set is latin1 and field alpha is UTF-8. latin1 is the standard character set for tables that don't specify a default character set.

    On my version of Linux (SuSE 9.1), I can display Thai characters using the MySQL shell by starting it with the command mysql --default-character-set=utf8. Unfortunately, this did not work on Windows. Under Windows, we had to wait until we had a Webpage working before we could visually verify that we had Thai data. When using the commands mysqldump to back up a database and mysql to restore the database, you must use --default-character-set=utf8 option, otherwise character set information is lost.

    We used a commercial product MySQL Manager 2 from EMS to design the database and generate our queries. The tool offers a nice visual query builder that can then display the query's text. We then copied the SQL code into our source. The tool is designed to work with MySQL 4.0 and did not support 4.1's character set extensions. If we created a table with the tool, we then had to manually execute an alter table command to change the character set.

    Server platform
    After selecting MySQL, our next choice was deciding which J2EE server to use. Development started with Tomcat 4.1.x, but we quickly switched to 5.0.19 for the better tag library support. As an extra bonus, we were able to take advantage of the EL (expression language) support integrated with 5.0.x.

    The final major decision was our selection of DbForms. DbForms bills itself as a tool that "enables developers to build sophisticated Web-based database-driven applications in very short time and with very little efforts. DbForms applications are built in a manner conceptually similar to RAD [rapid application development]."

    DbForms is best thought of as many different pieces: first, a J2EE framework for writing database-driven applications; second, a rich library of tags to take advantage of the framework; and third, a GUI tool that generates JSP (JavaServer Pages) pages from a database via XSL (Extensible Stylesheet Language) templates. We modified some of these template files to add support for including a common file at the top and generating all pages as UTF-8.

    DBForms's GUI tool is useful in creating initial JSP pages for data views and entry. For simple screens, the code created by the GUI tool can be used unaltered. In our application, approximately 40 percent of the data entry screens were generated by the GUI tool and deployed unaltered. Another 30 percent of the screens required minimal coding to be useful, usually adding DbForms's Select tag so that the user could select a field value from a foreign database.

    One limitation of the GUI tool is that it is not a "lifecycle" tool. If you use it to regenerate a page or the global dbforms-config.xml file, it will overwrite the old file. The GUI tool does not remember changes from previous edits. To get around this limitation, I created a parallel directory structure called "dummy," made new edits there, then copied and pasted from the dummy directory to the real directory. This proved useful in situations where I added a few fields to a table and wanted to regenerate the data entry screen, but did not want to lose any changes I had previously made to the JSP page.

    Now that we have covered the big picture, let's start looking at the individual steps required to get things working.

    Getting Unicode to work
    The first thing we did was write a standalone Java program that would read in Thai characters from a MySQL database and write them out to a file. The test database was populated by hand. The program's key part is the section that opens the connection:

    DriverManager.setLogWriter(new PrintWriter(System.out));
    Class.forName("com.mysql.jdbc.Driver");
    Connection db =
    DriverManager.getConnection(
    "jdbc:mysql:///" + dbname +
    "?requireSSL=false&useUnicode=true&characterEncoding=UTF-8",
       "root", "");

    The setLogWriter() function is used for diagnostics and isn't needed in production, but is helpful for tracking down problems.

    The JDBC (Java Database Connectivity) connect string has three parameters. Their meaning should be obvious. We used requireSSL=false since we had problems building MySQL 4.1.x from scratch using SSL (Secure Socket Layer) on Linux. We didn't try very hard though; in our environment, SSL was not important. The useUnicode and characterSetEncoding parameters tell the JDBC subsystem that the database is Unicode and everything is UTF-8. Note that the string UTF-8 is the Java way of specifying the charset; MySQL uses the string utf8. If changing charsets, you'll have to compare what Java supports with what MySQL supports.

    We also copied the above code to a JSP page and ensured we could display Thai. The only tricky part of that exercise was setting the page output charset. That was accomplished by putting the following line at the top of each page:

    <%@ page contentType="text/html;charset=UTF-8" language="java" %>

    Note: If you use Apache with mod_jk (the Apache module for running Tomcat as a plug-in within Apache), you will have to configure Apache to treat the page as UTF-8 as well; the Apache setting might overwrite the Tomcat setting (depending on which version you use). Our project used Tomcat as a standalone server.

    Our next step was to get things working with DbForms. We started by using the GUI tool and setting the JDBC connect string, jdbc:mysql:///dbtest?requireSSL=false&useUnicode=true&characterEncoding=UTF-8. The tool then created a dbforms-config.xml file. Scroll down to the bottom of the file and you should see:

    <dbconnection
      Name="jdbc:mysql:///stpneal2?requireSSL=false&amp;useUnicode=true
        &amp;characterEncoding=UTF-8"
      isJndi = "false"
      conClass  = "com.mysql.jdbc.Driver"
      username = "root"
      password  = ""/>

    Note: If you do not see &amp;, then you are using an old version of the GUI tool, which does not handle XML escaping correctly.

    In the DbForms GUI tool, click on the XSL Transformation tab and create some JSP files. You can then review the code produced to see what DbForms and the GUI tool can do for you in terms of generating ready-to-use JSP pages.

    We then manually edited the generated JSP files and inserted the <%@ page contentType="text/html;charset=UTF-8" language="java" %> directive. After verifying it worked, we modified the XSL files used by the GUI tool to generate the code. The change will vary file-to-file, but it will look something like this:

    //--file "
      <xsl:value-of select="$fileName" />
      " ------------------------------------------------
      <xsl:text disable-output-escaping="yes">
         <%@ page contentType="text/html;charset=UTF-8" language="java" %>
         <%@ include file="inc_taglibs.jsp" %>
      </xsl:text>

    As you can see, we modified the XSL file to output two directives at the top of each file generated. One sets the page contentType and charset; the other includes a file that we used to define all our taglibs.

    Everything worked fine in terms of displaying data, but updating data did not work correctly. A little digging revealed the problem to be that the form-processing code did not know what charset to use. A limitation of HTML form processing is that the data's charset entered in the form or in the page used to display the form is not passed back to the server. Once we discovered the problem, the answer was simple: we set Tomcat to treat all form data as UTF-8. We did that by defining a filter in the web.xml file. A filter is a code snippet called by the Web server that manipulates the input before passing it on. The filter code is in the class file org.dbforms.util.external.SetCharacterEncodingFilter. This file is a copy of the sample filter that ships with Tomcat, but it has been copied into the DbForms JAR for ease of use:

    <!-- ======== Filter for Set Character Encoding  ======= -->
    <filter>
      <filter-name>Set Character Encoding</filter-name>
      <filter-class>
         org.dbforms.util.external.SetCharacterEncodingFilter
      </filter-class>
    <init-param>
      <param-name>encoding</param-name>
      <param-value>UTF-8</param-value>
      </init-param>
    </filter>

    At this point, our application displayed and updated Thai information in the database, but we still had many tasks to complete in order to make our application fully usable as a multilingual site.

    The next task was to display static content on the page that was localized. The easiest way to perform localization is with tags, and we could choose from two different tag libraries: the standard formatting (prefix FMT) tag library that is part of JSTL (Java Standard Tag Library), or DbForms's db:message tag. We ended up supporting both. In most cases, we used the FMT tag explicitly in our JSP code, but we used the DbForms message tag code implicitly to perform localization on HTML widgets like buttons.

    Fortunately, both tag libraries worked the same way—they used a set of property files in the WEB-INF/ classes directory composed of key-value statements.

    A brief introduction to ApplicationResources files for localization
    Localization is performed through a hierarchy of property files. There is a base file and then descendent files. If a given key cannot be found in a descendent file, its parent files will be checked in ascending order. The hierarchy model is used to support a default file, a specific language, and its dialects.

    Assuming we use the name ApplicationResources for our file naming, we can have four files in WEB-INF/ classes:

    ApplicationResources.properties The base file
    ApplicationResources_en.properties English language
    ApplicationResources_en_UK.properties English, British dialect
    ApplicationResources_th.properties, Thai language

    Depending on our current locale's setting, we start with one of these files. If a key cannot be found, then the next ascending file is checked until we run out of files and generate a not found.

    Here are some sample ApplicationResources files showing how they look and describing how the lookup logic works:

    FILE: ApplicationResources.properties:
    One=first
    Two=second
    Elevator=up thing
    user_name=User Name
    RoomSchedule.font=Lucida-plain-10

    FILE: ApplicationResources_en.properties:
    Two=dos
    Elevator=Elevator

    FILE: ApplicationResources_en_UK.properties:
    Two=double
    Elevator=LIFT

    FILE: ApplicationResources_th.properties:
    user_name=&#3594;&#3639;&#3656;&#3629;&#3612;&#3641;
    &#3657;&#3651;&#3594;&#3657;&#3591;&#3634;&#3609;

    If our locale is en_UK, then searching for "Elevator" would find "LIFT," and searching for "One," would return "first."

    In the Thai example, we must use Unicode escape sequences to encode our strings. Unfortunately, the property files are plain 8-bit ASCII. You can use the native2ascii -encoding UTF-8 command src dtst command to convert a UTF-8 file to ASCII, but in our environment, we just found it easier to use Dreamweaver in design mode, then cut and paste from code view. Microsoft Word can also be used; type in some characters, save as a Webpage, then view code.

    You should adopt a naming convention early in your project to keep things organized. In our environment, we used the prefix bean.xyz.key if a message was used in bean xyz; abc.key if a key was used on JSP page abc; and no prefix for common things like user_name. Ours is not the best example, but better than no naming scheme at all.

    For our application, we used only the default file ApplicationResources.properties and ApplicationResources_th.properties, effectively making English the default language. The files also contained internal configuration information that was locale-specific, such as the Bean.RoomSchedule.font=Lucida-plain-10 setting. (For more details, see Resources).

    Configuration for both localization methods is done in web.xml. For the JSTL, the configuration setting is a top-level context-param:

    <web-app>
    <context-param>
      <param-name>javax.servlet.jsp.jstl.fmt.localizationContext</param-name>
      <param-value>ApplicationResources</param-value>
    </context-param>

    For DbForms, it is an init-param to the servlet:

    <!-- ======= DbForms Configuration Servlet =======   -->
    <servlet>
      <servlet-name>config</servlet-name>
      <servlet-class>org.dbforms.servlets.ConfigServlet</servlet-class>
      <init-param>
        <param-name>resourceBundle</param-name>
        <param-value>ApplicationResources</param-value>
      </init-param>

    Using the FMT tag is very simple: download and copy the file jstl.jar into your WEB-INF/bin directory. Then at the top of your JSP file, include a tag directive:

    <%@ taglib prefix="fmt" uri="http://java.sun.com/jsp/jstl/fmt" %>

    Usage is as easy as <fmt:message key="user_name"/>.

    To use the other localization approach, DbForms's db:message tag:

    <db:message key="user_name" />

    The db:message tag also takes an optional attribute params, which is a comma separated list of values that replaces {0},{1}, etc., in the found string.

    DbForms's localization also works with other DbForms's tags like <db:gotoButton> to enable the set-caption attribute's localization to the key <db:gotoButton caption="user_name" >, and have caption resource="true" in the enclosing dbform tag:

    <db:dbform followUp="/menu.jsp" captionResource="true">

    Note: DbForms's GUI tool sometimes embeds spaces in the caption attribute for the code it generates; you should remove them.

    Setting locale
    Changing locale is performed in the file options.jsp. The key point is that when the user changes the locale, we use the fmt:setLocale tag to set the locale used by JSTL, and we invoke the following code to set the locale used by DbForms:

    <%  session.setAttribute("org.dbforms.LOCALE",
            new Locale( request.getParameter("locale") )); %>

    Dynamic localization
    In our database model, we needed to keep the Thai and English text for various attributes. We did that by having a separate field in the table. We also used a common naming convention: name for Thai and name_eng for English.

    We keep this code snippet in an include file:

    <c:choose>
        <c:when test="${sessionScope.locale=='th'}">
            <c:set var="name_field" value="name"/>
        </c:when>
        <c:otherwise>
            <c:set var="name_field value="name_eng" ""/>
       </c:otherwise>
    </c:choose>

    This sets the variable name_field to the appropriate field name to access depending on our locale setting. We used EL notation ${name_field} to access the variable.

    We then display a single value in the correct language like this:

    <db:dataLabel  fieldName="city_id"  >
       <db:tableData    name="city" foreignTable="city"  
    visibleFields="${name_field}" storeField="city_id" />
    </db:dataLabel>

    This will show the name of the current city in the correct language.

    To let the user select a list of cities using the current language, we use the db:select tag; inside that, we use a db:tabledata tag:

    <db:select pattern="#" fieldName="city_id" size="0"
        selectedIndex="<%=(String)currentRow_person.get("city_id")%>" >
           <db:tableData name="city" foreignTable="city"  
              visibleFields="${name_field}" storeField="city_id" />
    </db:select>

    The selectedIndex line reads city_id's current value for this record in the person table and uses it as the initial selection.

    Reports
    We used JasperReports, to create reports in PDF format. DbForms has integrated support for JasperReports. To use it, you simply create a JasperReports definition file and store it in the WEB-INF/reports directory or change the configuration in web.xml to use some other directory. On your JSP page, include a line similar to the following inside a db:dbforms tag body:

    <db:gotoButton  destination="/reports/city_report" caption="print_report" destTable="city"  singleRow="false" />

    In the db:gotoButton, the destination attribute points to "/reports/city_report". /report is a servlet mapping handled by the StartReportServlet (see the web.xml for the configuration). The report definition file is called city_report.xml (the development release of DbForms uses jxml instead; this change will be incorporated into the next stable release). The report definition file will be compiled on demand to a .jasper file. You can edit report-file definition files at any time—no need to restart the application to test new reports. The testTable attribute should match the enclosing db:dbform tag's tableName attribute. The singleRow attribute is true if you want to display only one record, otherwise false.

    One gotcha we discovered: In the report definition file, make sure that the <jasperReport> tag's name attribute is the same as the file's base name. Otherwise it will not work:

    <jasperReport name="city_report" ... >

    Getting JasperReports to work with Unicode requires textElement sections to be defined like this:

    <textElement textAlignment="Left" verticalAlignment="Top"
       lineSpacing="Single">
          <font fontName="Angsana New"
                pdfFontName="c:\fonts\ANGSA.TTF" pdfEncoding ="Identity-H"
                isPdfEmbedded ="false"  size="12"
                isBold="false" isItalic="false" isUnderline="false"
                isStrikeThrough="false" />
    </textElement>

    As you can see, the <font> settings are tricky and require a lot of research and playing around. pdfEncoding specifies that the font is an external Unicode font being drawn horizontally. pdfFontName is a full path to a font, in this case, ANGSA.TTF, which is a Thai language font. For Unix systems, you will have to change the path. We use a sed script to do that automatically. isPdfEmbedded should be false, even though true seems to make more sense.

    To create our report definition files, we used an excellent tool called iReport. Unfortunately, with iReport, we could not create the <font> settings exactly the way we needed them to be for Thai support. We had to edit the XML file by hand and modify the font setting. Then we used iReport to modify the file further. If we needed another field to be Thai, we just copied the one we edited and modified that. In short, we only needed to hand modify the file once.

    POI
    During my development with DbForms, I wrote a report generator called ExcelReportWriter to generate an Excel spreadsheet using POI. POI turned out to be easy to use; the only change required for supporting Unicode data was the addition of this line: cell.setEncoding(HSSFCell.ENCODING_UTF_16);. This method call is applied to all character data; i.e., I do not bother checking if the incoming data is non-UTF-8, I just make the whole spreadsheet Unicode. This might result in a bigger output file, but in my situation, that was not a concern. The full code for the ExcelReportWriter is in the development release of DbForms.

    Conclusions
    Hopefully this article will help those people interested in writing multilingual applications. Even if you don't plan to support multilingual applications immediately, you can do a few tasks to make the process easier if you plan to add such functionality later:

    1. Use ApplicationResources files. Separating out the messages' text from the code is just good programming practice. Avoid hard code strings in your beans and JSP files.
    2. Use a taglib to display strings from ApplicationResources files.
    3. Handle times, dates, numbers, and currency values, which are all subject to regional changes. Java has a rich API for handling these. Look at the Calendar class for time and date handling, and the Formatter class for handling numbers and currency values.
    4. Set your page output to UTF-8. Sending pages as UTF-8 should not result in any drawbacks, even if all the character data is ASCII.
    5. Set form-handling to treat all data as UTF-8. This is a little trickier since you are now opening yourself up for a user to input Unicode data. If you are not prepared to handle this (or your database is not prepared for it), then do not enable UTF-8 form-handling.

    DbForms is a great tool for generating applications that are data-entry centric. This article did not cover all of the program's aspects. To get started, I suggest downloading the project and the sample codes. Look at the bookstore first, then move onto the other demos. The default installation uses a memory resident database, so you can get started quickly without configuring any database.

    JasperReports is great for generating PDF files; it can create Excel files too, but I found it easier to use the ExcelReportWriter that I developed to generate simple Excel files.

    MySQL 4.1 proved to be a good choice, with no problems from a stability perspective. As I write this, the 4.1 branch has graduated to beta from developer release.


    About the author
    Neal Katz is president of Nechsi Software of Bangkok, Thailand. He has more than 15 years of professional software development work and has been using Java since version 0.9. Katz prefers to use Linux and open source technologies in his work. He is a contributor to DbForms and other open source projects. Prior to Nechsi Software, Katz was a founder of MessageOne technologies and an early member of Mail.Com and Bigfoot.