Category: Scripting

  • UTF-8 web-applications using PHP and MySQL

    This document presents you a list of things to keep in mind when creating UTF-8 aware web-applications using Dojo and PHP.

    Although the focus lies on Dojo and PHP most of the tips below can be used for any other web (scripting) language too.

    1. write files UTF8-encoded with an UTF-8 capable editor, like PSPad

      Basically, ensure that your text editor is capable of writing the BOM correctly (which has been an issue with some editors in the past).

    2. insert the following in your .htaccess:
      php_value default_charset UTF-8
      

      This will direct PHP to always send out data UTF-8 encoded.

    3. specify correct character set in HTML header:
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
      
    4. specify correct language in HTML header:
      <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
      
    5. force Dojo to use UTF-8 for dojo.iobind:
      var djConfig = {isDebug:true, parseOnLoad:false, bindEncoding:"UTF-8"};
      
    6. specify character set for your <form>:
      <form accept-charset="UTF-8">
      

      Hint: Although this is optional (browsers will automatically use the encoding previously used by the server) when specified it is much more obvious what encoding you want to use.

    7. define character set to be used for returning data on the server side:
      header("Content-Type: text/html; charset=utf-8");
      
    8. encode data to be sent to the server on the client side:
      encodeURIComponent();
      

      This makes sure that special characters such as “’$%&,… will be transferred correctly.

    9. decode data received on the server side:
      html_entity_decode(urldecode($p), ENT_QUOTES, "UTF-8");
      
    10. use utf8_general_ci as collation for your MySQL database.

      This ensures that data will actually be saved UTF-8 encoded and not re-coded to for instance latin1_swedish_ci, which is the default settings of MySQL.

    11. right before reading/writing from your database issue the following query:
      mysql_query("SET NAMES 'utf8'");
      

      This will direct the MySQL server to process UTF-8 encoded data.

    12. don’t forget to escape quotation marks to avoid code-injection when issuing database queries:
      addSlashes();
      

      Note that addSlashes is not the optimal way to escape special characters! This just serves as an example.

    13. when using PHP’s json_encode() function be sure to utf8_encode strings as json_encode will cut off german umlauts (etc.) that were not encoded before
      json_encode(utf8_encode(rawurlencode($x));
      

    Summary:

    Client:

    • encodeURIComponent() , decodeURIComponent

    Server:

    • html_entity_decode(urldecode($p), ENT_QUOTES, “UTF-8”)
    • mysql_query(“SET NAMES ‘utf8′”), addSlashes()
    • header(“Content-Type: text/html; charset=utf-8”)
    • json_encode(utf8_encode(rawurlencode($x))

    MySQL:

    • utf8_general_ci