Unicode, One Code to Grok Them All
Despite being a programmer for years, and despite being old enough to remember consternation over how those different 8 bits computers had different character sets, I've only quite recently started down the pilgrim path towards the grokking of Unicode.
Maybe you're a programmer too. Maybe you're the developer in Joel On Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!):
When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't do anything about it." Like many programmers, he just wished it would all blow over somehow.
But it won't.
Damn skippy it won't. I used to wish it would too. After all, UTF-8 is magically delicious, right?
Right?
But now, I'm putting the upgrade on my molawiki software and I'm bound to think about this stuff in some detail. You see, I'm building a Flex/Flash front end to a PHP backend, all to be passed through either some AMF packets. The problem is, Flash likes UTF-8 text and PHP 4.x thinks in ISO-8859-1. You see, according to PHP:
A string is series of characters... a character is the same as a byte, that is, there are exactly 256 different characters possible. (source)
Well, shit.
That's all well and good as long as everyone is typing plain old English from a US or UK keyboard...but if I don't sort it out now, I'm dooming my software to a bad blind date with an angry Latvian.
First Up... Iconv Support in PHP
So it turns out that the spiffy AMF-PHP "uses" libiconv, the GNU Swiss Army Knife of Unicode Conversion. PHP, if it's been compiled correctly, exposes a series of functions that provide some utility for dealing with this stuff. Naturally, my test server doesn't have iconv baked into its copy of PHP. Recompiling, here I come.
Then... the experimenting and programming to find some graceful compromise that will handle character encoding correctly, if in a limited fashion.
Limited? Well, yes, Flash is down with the UTF-8. "Out of the box" PHP only likes ISO-8859-1. That's a camel through the eye of the needle, folks, parsing his ass, I hope, into a lot of cryptic escape encodings that can be used to reference at least some of the appropriate xhtml character entities...ending with "still coming up short," I'm guessing.
PHP Add-ons Can Help
Even with PHP's limited vision of strings, there are a variety of PHP modules and add-ons beyond ICONV that exist to ease dealing with UTF-8. See:
- Notes on PHP UTF-8 possibilities and problems
- PHP multibyte string module
- PHPUTF8 — open source tools for working with UTF-8 even when multibyte module is missing
Uh, Was There A Point?
Oh. right. The point is that you should read Joel's Unicode overview, so you too can start to think about this stuff. Then, dive into Unicode on Wikipedia. If you're a web developer, check out Unicode and HTML Wikipedia article for some more low-down on dealing with HTML/XHTML and encodings.
