Character Sets in Xinha
If you use UTF-8 encoding across the board then you should not have problems. Almost everybody uses UTF-8 these days, and as long as you do everywhere, you should be fine.
I want to learn more anyway…
Skip to the advice.
What is a Character Set?
A character set (also called character encoding) is a ‘'’code’’’, it maps an integer value (eg 65) to an arbitrary character (eg “A”). The typical character set is ASCII (American Standard Code for Information Interchange), this defines 127 integers mapping to 127 distinct characters ranging from “NUL” as character 0 to “DELETE” as 127.
Everything was so simple back then, when ASCII ruled, but then people realised that not everybody speaks english, and that those 127 characters we have in ASCII are no use for most other languages. So what happened is that people came up with different character sets, now when somebody said “Character 65” you also had to be told what character set you are using so that you could know if they were talking about the character “A” (in the ASCII set) or some completely different character in a different character set. Of course, this is exagerating a little, most character sets “extend” the 7-bit ASCII set using a full 8 bit byte and thus using the upper half of the character space for the “custom” characters, and leaving the standard ASCII code at and below byte 127.
Most european languages have a very small number of characters, English for example has 37 necessary to communicate – 26 letters, 10 numbers and space are all you need to get your message across (unless you’re speaking “txt”, in which case you can throw out the vowels ;).
But what about the much more complex languages of the world, CJK (Chinese, Japanese and Korean) being the most familiar to those of us in the west – these languages need MANY more characters than the 256 avaiable in a single byte, especially if they want to leave the lower 127 as standard ASCII!
To counter this problem, enter the double (and triple, and quadruple, and…) byte sets. Instead of using just 8 bits (one byte) to represent the number which specifys the character (remember, in ASCII the number “65” is represented simply by the 8 bits 1000001 which in the ASCII set represents the letter “A”), they use 2, or 3, or 4. Essentially this means that while ASCII has only 127 different characters available to it (7 bit ASCII this is), multi byte character sets can have thousands, or millions!
So, let’s say that the following string - “à®±” - is an encoding of some character in a multibyte character set now how do you decode that into the numbers that you can use to look up what character it represents?
Enter the character encoding – simply, this is an “algorithm” which you may follow to figure out which bits of the bytes you have to join togethor to get the number representing a character, and where one “character” stops and the next starts, ASCII is a character encoding AND a character set, in the terms of an encoding the “algorithm” is simply “take 8 bits, treat it as an integer, look it up in the ASCII character set”. Once we have the number (say 2993), that’s it, look it up in the character set (say UNICODE) and we know what character it is (say “TAMIL LETTER RRA”).
Ok, if you’ve been reading along you will now see that we have a lot of different character sets, and that without prior knowledge of which character set (and encoding) you have no show of “decoding” the stream of bytes into a list of numbers which you may use to lookup the character in the character set tables.
Enter UNICODE, this is just a character set like any other, except that it’s an ENORMOUS one, I mean big, really big, you might think that the solar system is big but that’s just a walk in the park compared to UNICODE. Put simply, UNICODE contains all the characters from all the languages, and room to spare for growth (and languages that havn’t been documented yet). It’s “compatable” with ASCII, in that character 65 is still capital letter “A” etc, but it contains more than a MILLION other characters.
Remeber multi-byte character sets, well, UNICODE is the ultimate multibyte character set, and as a result we need to have some encoding for those bytes so we can figure out the ‘'’numbers’’’ to lookup. Of course, there are many, but the most important one is….
UTF-8 is just a character encoding, it says “take a string of bytes, do this algorithm over them, and you’ll get a list of numbers which represent characters in the UNICODE character set”. The special thing about UTF-8 is that it leaves the lower (127 characters) of ASCII intact (remembering that these characters are unchanged in UNICODE), so for most english text, UTF-8 encoded UNICODE is identical to plain old 7-bit ASCII (which as you might expect is quite useful).
Slowly but surely the world is progessing to ONE character set (UNICODE) and ONE encoding (UTF-8), gone will be ASCII, BIG-5, SHIFT-JIS, and all those other character sets and encodings, never to darken our doorstep again.
#!html <a name="stopwaffle"></a>
Stop waffling, cut to the chase!
’'’Encoding and character sets are only used to get data IN TO and OUT OF the web browser. Inside the browser, inside Xinha, the data is ALWAYS represented by a list of numbers representing characters in UNICODE.’’’
The way you do this is for your webserver (or application language such as PHP) to set the Character-Encoding header (or maybe you can use a meta tag, but that is very unreliable) on the HTML file that you are putting Xinha into, and ensuring that you are using that same character encoding on the content being edited. There are no exceptions, if you are trying to edit UTF-8 encoded html, then make sure that the editor page itself is being recognized as utf-8. If you are trying to edit ISO-8859-1 (typical european langauges) then make sure that the editor page is being recognized as ISO-8859-1.
Further to that, UTF-8 encoding across the board with your files, with your edited content, with EVERYTHING, is HIGHLY encouraged, avoid using older encodings and character sets, UTF-8 encoded UNICODE isn’t the way of the future, it’s the way of the present - you will have a much more reliable time if you forget about the character sets of old and use UNICODE.
A very handy tool for the Linux users amongst us is gucharmap this program allows you to easily browse (and search) the entire UNICODE character set.
UNICODE is nothing if you don’t have a font with the characters. To my knowledge the most complete UNICODE font is currently Arial Unicode MS which has representations of a very large proportion of the UNICODE standard.
This font used to be downloadable from Microsoft for free but is now included only with MS Office. Emphasis to be placed on ‘‘from microsoft’’ if you know what I mean.