Internationalization, Localization, Encoding, Iceland and Translation of a Japanese Cat Photo Website
There’s a lot of information out there when it comes to internationalization, localization and encoding within web applications. There’s also a lot of misunderstanding about what each provides for a web application. Here’s my take…
When a web browser makes a connection to a web server, it passes quite a few variables and makes additional requests, here’s an example:
GET / HTTP/1.1
Host: www.dknewmedia.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3
Accept: HTTP Accept=text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Internationalization
Internationalization is a composite of many things:
- Localization: The ability to identify what language and location a visitor is visiting from. This is done through HTTP Requests where the visitor is identified by a locale. In my case, that’s en-US. “en” is English and “US” is the United States. This is a setting within my Operating System.
- Time zones: The ability to adjust for time zones. This is normally accomplished by setting your server to a Greenwich Mean Time (GMT) and then allowing users to set their local offset from GMT.
- Character Encoding: This is the ability to display language character sets properly. It’s different from localization since localization can tell me the language and region of the computer that is making the request from, but it won’t tell me what language the reader is requesting… that’s up to the reader!
Notice in my HTTP Header when the browser made the request, it told the server that it was requesting my locale (Accept-Language: en-us); however, it also needs to tell the server what Character set is requested (Accept-Charset: Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7) ISO-8859-1 and utf-8 are both allowable character sets.
Localization
In this fantastic mixed-up world, localization never dictates language anymore. Even though I’m in en-US, I can absolutely read a different language using a different character set… that’s what happens when I use Google Hindi (I don’t really use Google Hindi). My request for locale and character set are identical as when I request the Google English page, but I’m actually fed a page that I can not read because I don’t have the character set. It all comes up ???????????… However, I can load that character set into Firefox (Firefox >Preferences > Advanced > Languages):
If I load that language, I can then request the page in its native character set and display it on my computer even though my default locale is en-US!
So… if I’m a Hindi student, studying English at Purdue and connected via VPN to the school’s server, on vacation in Australia… there are 3 different settings that need to be applied to the application to be truly Internationalized – and none depend on the other.
My locale would come up as en-US, but my timezone is Australia, but the language I’m requesting from the website may be Hindi. If I were to program my application to make assumptions based on my computer’s locale, I would be totally wrong – feeding the person English in an Eastern timezone. Ideally, I would program my application to offer both language and timezone settings… but I wouldn’t assume them based on locale.
Iceland – the ultimate example
We’re ignorant of multi-lingual and multi-locale challenges in the United States where we all speak English [sarcasm implied]. In some countries, such as Iceland, though the native language is Icelandic, the incredible Icelandic people grow up learning 3 languages! Since Iceland is a country in the middle of Europe and North America, their companies work throughout multiple continents, languages, dialects of languages and multiple timezones from their desktop!
Many Icelandic websites are built in US English, UK English, Icelandic, Spanish, Spanish, French and German! Just imagine how challenging it would be to build Icelandair’s website applications and ticketing systems… wow!
DISCLAIMER: I’ve had the absolute pleasure of working with the fine folks from Icelandair and can tell you they are some of the most talented and friendly professionals I’ve had the pleasure of working with. It’s simply an amazing country and people! Go visit… take Icelandair and be sure to visit the Blue Lagoon!
Language versus Encoding
There are even different character encodings within a single language that don’t play well with each other! Example: A Japanese email written in Shift-JIS may render unreadable on a Japanese person’s computer with localization set to ja-JP because their mail server only recognizes EUC-JP. Ideally, a customer should be able to set what encoding they would like as well as what language – simply ensuring that the encoding and language are compatible with what the client is requesting.
If I wish to read Japanese, I may have to select both Japanese as my language AND Shift-JIS for encoding to display that language properly. Here’s some more confusion to add to the mix… some encoding types support several languages. UNICODE/UTF8 supports dozens. The reverse is true as well. Some languages can be read in many encoding types. If that doesn’t make sense… I apologize, it’s a very complex issue.
Someday I believe (hope) this will all change. I think the original designers of localization codes hoped that the language-country combo would be all that is needed… but we’ve become far more sophisticated. Remember, much of this was developed before an Internet existed. With the advent of GIS, perhaps a person can select their encoding and GIS would handle timezone and locale information.
Internationalization
Back to Internationalization support. If you want to provide an Internationalized application, you need to:
- Support multiple encoding types, languages, and have translation files to display those translations.
- Allow the client to set their language, and even perhaps their encoding type, if necessary.
- Support time zones by allowing users to reference their timezone in comparison to GMT.
- Utilize localization codes with caution… they DO NOT accurately depict what your user is actually requesting nor what they can read.
Translation
Machine translation is still in its infancy. There are a number of websites out there (and WordPress Plugins) that offer machine translation of your site. Don’t be tempted to do this… there are two reasons why:
- If machine translation works, the user that is checking out your site will already have a translator to work with.
- Machine translation sucks.
Don’t believe me? Here’s a Japanese translation for you:
Pasted from Masatsu File – a dude with a ton of pictures of cats:
Japanese Blog Entry
???????????
– 00:29:35 by masatsu???????????????????????????
????????????????????????????????????(?)?
????????????????????????????????????????
?????????????????????????????????????????????????????????????????(?)
?????????????????
??????????????
???????????
??????????????????????????????????????????????????
???????????????????????????????
??????????????????????????????????????
??????????????????????????????????
?????????? Ag ???????????????????????????????????????????????????
Machine Translation:
?? Hann ?????
-00:29:35 by masatsuThe name of the fist saint of an/the elephant that appeared on yesterday’s “a/the beast fist fleet ?? ranger” “?? Hann ????” with a/the radio actor Yutaka Mizushima….
? (?) that does cuttlefish readily.
Because even the radio actor of the fist saint, master ????? of a/the cat are the Chinese quince of a Nagai Ichiro=dragon ball the staff of a/the ?? ranger looks like the people who witty remark knows readily.
??????(The sweat) that I was forgetting completely this year “the secret of the name of the super fleet”, of usual practice every year ?,
??????If I write belatedly this year,
??????? shrine ??? (can how ?) ? headland cullion (?? cut not) Fukami ?? (? or observe able to ?) an/the is it that does
??????With, it becomes “a can/?/?”, in other words, “kung fu”, when line up and change the head of a/the name.
??????Although it is said to increase additional two allegedly it may become what kind of name.
The truth is said to be “age” because the silver element symbol is Ag ?, although the name of “the ???? silver” the Takaoka ? technician of ??????? was thinking that the character called “?” was used with the association of a silver’screen?movie, if it calls it with an addition member.
??????However, I am thinking selfishly, that my theory is not unrelated that “?” is included in a/the Chinese character.
I’m sure reversing the polarity on this translation would provide just as readable a diction in English. You did understand the entry, right?