Unicode, UTF-8, UTF-16, UCS-2, UCS-4 and URIs

Unicode can be confusing !

For a start there are a number of different encodings such as :

  • UTF-8 (for example in UTF-8 is 0xE2 0x82 0xAC)
  • UTF-16 (which uses surrogate pairs to represent "characters" outside the Basic Multilingual Plane (BMP)

  • UCS-2 (a predecessor of UTF-16)

  • UCS-4

A RFC-2396 URI must be encoded / escaped using UTF-8 (and %hex-values) so if you want to acccess a web page called

the URI will be

and different browsers seem to work with Unicode URIs in different ways !

  • Safari works with both (.php and %E2%82%AC.php) and helpfully (?) redisplays %E2%82%AC.php as .php in the address bar
  • Firefox converts .php (sometimes incorrectly to %80.php) so you can only use / see %E2%82%AC.php
  • IE works with both (.php and %E2%82%AC.php) but leaves both versions unchanged in the address bar

Read and post comments | Send to a friend