Chinese characters WILL last
Posted: Sun Oct 29, 2006 9:48 am
A language is itself just because the way it is.
Based on the fact that some foreign friends don't have adequate understanding about written Chinese, and that some of them claim that Chinese characters are destined to die out, I realized this topic should be brought up, in order to make it known to all that Chinese characters will flourish.
Frankly I'm NOT a student of language majur. I'm 23 and I just graduated with a mechanical engineering bachelor's degree. However, my primary interest is building electronic systems, including those with LCD user interfaces. Sometimes I need to embed a Chinese input method and character display function into my system. So you know, I HAVE TO BUILD A CHINESE CHARACTER SUPPORTING SYSTEM FROM SCRATCH. What you worried about was not a problem at all. On the contrary, Chinese texts take up less space in a digital system, and less time to type. I'll tell you why.
We all know that our screens are made up of luminescence materials which form tiny pixels. And that characters are a type of special images. Actually alphabetic letters are images too, but relatively simpler. Anyway, the way they are displayed are the same. We draw them on the creen, so they look the way a character appears.
Take the letter "A" and the character "国"(guo/country) for example:
let's put them both in a small 16 pixel height.
A:
00000000 = 0x00
00000000 = 0x00
00010000 = 0x10
00111000 = 0x38
01101100 = 0x6C
11000110 = 0xC6
11000110 = 0xC6
11111110 = 0xFE
11000110 = 0xC6
11000110 = 0xC6
11000110 = 0xC6
11000110 = 0xC6
00000000 = 0x00
00000000 = 0x00
00000000 = 0x00
00000000 = 0x00
国:
0000000000000100 = 0x00, 0x04
0111111111111110 = 0x7F, 0xFE
0100000000100100 = 0x40, 0x24
0101111111110100 = 0x5F, 0xF4
0100000100000100 = 0x41, 0x04
0100000100000100 = 0x41, 0x04
0100000101000100 = 0x41, 0x44
0100111111100100 = 0x4F, 0xE4
0100000100000100 = 0x41, 0x04
0100000101000100 = 0x41, 0x44
0100000100100100 = 0x41, 0x24
0100000100000100 = 0x41, 0x04
0101111111110100 = 0x5F, 0xF4
0100000000000100 = 0x40, 0x04
0111111111111100 = 0x7F, 0xFC
0100000000000100 = 0x40, 0x04
A digital 1 stands for a visible pixel on the screen, and digital 0 stands for the background. and that's how we draw images on the screen. Actually each digit takes up a bit in the memory. And 8 bits, make up a byte. So we can also put the dots into hexdecimal format. That's what those magic 0xxx numbers are. So the letter A takes up 16 bytes while being displayed, and the character 国 takes up 32 bytes.
Where do the images come from? They come from what we call a library, which is actually a file filled with sequential bytes. There are 128 ASCII codes, each displayed same way as illustrated above for letter "A". So an ASCII library takes exactly 128*16=2048 bytes, or 2KB. There are 8192 most frequently used Chinese characters according to the national standard(GBxxxx). These characters are usually arranged together to form a standard library. Given the fact a single 16*16 size Chinese character takes up 32 bytes, the tatal space 8192 characters consume is exactly 8192*32=262144 bytes. Because 262144/1024=256, we can say it takes 256KB to store such a library. How much space have you guys got in your computer? 40GB? 120GB? 400GB? or 2TB? I got 320GB here. so I've got a hell lotta space to save over 1.3 million duplicates of this library. So, you see, a 256KB library isn't big deal nowadays.
What does the computer do when we input Chinese characters? There are different types of IME's or we call input methods. I'm not digging deep into the algorithm they implement. I'm ganna tell you how characters are fetched from this library with a certain index. The 8192 characters are stored inside this library in a certain order. GB has set a table of index to retrieve these characters by giving each of them a unique address code. And in Chinese it is called "区位输入法", which means index by zone code and serial code. This makes the sequence characters form the library.
When we type Chinese characters, they appear as complicated images on the screen, but what's inside, is much simpler. Let's creat a *.txt file and type something into it. For example input Chinese characters "中国"(China) without any space or enter. Save the file, and check its properties, what's its size? 4 bytes right? How come two complex Chinese characters take only four bytes? I'll tell you how this is done inside the computer. (And actually the four specific bytes will be 0xD6,0xD0,0xB9,0xFA if you're curious enough to look into the file with WinHEX or UltraEdit.)
Take 国 for example again. Like I said it is stored as 0xB9,0xFA inside the machine, which we call the "internal code". Its zone code is 0xB9-0xA0=0x19 = 25(decimal), and its serial code is 0xFA-0xA0=0x5A = 90(decimal). 0xA0 or 160(decimal),is the threshould of whether the computer should consider a byte stored is for a Chinese character or an ASCII letter. When it gets 0xB9, it compares the number with 0xA0, and finds out it's greater than 0xA0, so the computer does the math as below:
absolute address of first byte of character in library = 32*[(zone code-161)*94+(serial code-161)]
-- 32 means each character takes 32 bytes.
-- 94 means there 94 zones inside the library
-- serial code-161 is the position this character stands in the zone of the library,or what we call the OFFSET
-- (zone code-161)*94+OFFSET means the sequence number of the character in the whole library. For 国, it is 2345, which means it's the 2345th character in the entire library. Multiply it by 32, and we get the address number 75040. This is where we start to pick up 32 sequential bytes. And they will be 0x00,0x04,0x7F,0xFE......0x40,0x04 just as illustrated above.
On the other hand, an ASCII code takes one byte in a *.txt file. This byte is its address in the ASCII table, i.e, A being 0x41. The computer checks the ASCII library pretty much the same way it does with a Chinese character, but the calculation is much simpler. 0x41-1=0x40 is multiplied by 16, instead of 32 because of the space the image takes.
So we've come all the way to this understanding, that a 8192(count)*16*16(size) characters Chinese library takes up 256KB of memory inside a digital device. And that our texts are stored inside the computer for two bytes per Chinese character and one byte per ASCII code. Most importantly, ONLY ONE LIBRARY IS SUFFICIENT IN A DIGITAL DIVICE TO DISPLAY ALL THE CHARACTERS IT SUPPORTS!! I don't think anyone would mind sparing 256KB extra space to support 8192 Chinese characters' compatibility. Some one said on most U.N confrences, printed documents with the same content come with different thickness, and the one typed in Chinese is always the thinest. The written Chinese language is brief. I've seen so many English articles being translated into Chinese, in less than half the original length. So I say, given the algorithm that one Chinese character takes twice an English letter does, a Chinese digital document can take no more than an English one, at least not much more. Consequently, space isn't an issue which will cause Chinese characters to die out. Neither will input efficiency be one. My cousin used to work as a typist, she could type an article even faster than someone reciting the same thing. So, my point is, Chinese characters will last, won't you agree?
Based on the fact that some foreign friends don't have adequate understanding about written Chinese, and that some of them claim that Chinese characters are destined to die out, I realized this topic should be brought up, in order to make it known to all that Chinese characters will flourish.
Frankly I'm NOT a student of language majur. I'm 23 and I just graduated with a mechanical engineering bachelor's degree. However, my primary interest is building electronic systems, including those with LCD user interfaces. Sometimes I need to embed a Chinese input method and character display function into my system. So you know, I HAVE TO BUILD A CHINESE CHARACTER SUPPORTING SYSTEM FROM SCRATCH. What you worried about was not a problem at all. On the contrary, Chinese texts take up less space in a digital system, and less time to type. I'll tell you why.
We all know that our screens are made up of luminescence materials which form tiny pixels. And that characters are a type of special images. Actually alphabetic letters are images too, but relatively simpler. Anyway, the way they are displayed are the same. We draw them on the creen, so they look the way a character appears.
Take the letter "A" and the character "国"(guo/country) for example:
let's put them both in a small 16 pixel height.
A:
00000000 = 0x00
00000000 = 0x00
00010000 = 0x10
00111000 = 0x38
01101100 = 0x6C
11000110 = 0xC6
11000110 = 0xC6
11111110 = 0xFE
11000110 = 0xC6
11000110 = 0xC6
11000110 = 0xC6
11000110 = 0xC6
00000000 = 0x00
00000000 = 0x00
00000000 = 0x00
00000000 = 0x00
国:
0000000000000100 = 0x00, 0x04
0111111111111110 = 0x7F, 0xFE
0100000000100100 = 0x40, 0x24
0101111111110100 = 0x5F, 0xF4
0100000100000100 = 0x41, 0x04
0100000100000100 = 0x41, 0x04
0100000101000100 = 0x41, 0x44
0100111111100100 = 0x4F, 0xE4
0100000100000100 = 0x41, 0x04
0100000101000100 = 0x41, 0x44
0100000100100100 = 0x41, 0x24
0100000100000100 = 0x41, 0x04
0101111111110100 = 0x5F, 0xF4
0100000000000100 = 0x40, 0x04
0111111111111100 = 0x7F, 0xFC
0100000000000100 = 0x40, 0x04
A digital 1 stands for a visible pixel on the screen, and digital 0 stands for the background. and that's how we draw images on the screen. Actually each digit takes up a bit in the memory. And 8 bits, make up a byte. So we can also put the dots into hexdecimal format. That's what those magic 0xxx numbers are. So the letter A takes up 16 bytes while being displayed, and the character 国 takes up 32 bytes.
Where do the images come from? They come from what we call a library, which is actually a file filled with sequential bytes. There are 128 ASCII codes, each displayed same way as illustrated above for letter "A". So an ASCII library takes exactly 128*16=2048 bytes, or 2KB. There are 8192 most frequently used Chinese characters according to the national standard(GBxxxx). These characters are usually arranged together to form a standard library. Given the fact a single 16*16 size Chinese character takes up 32 bytes, the tatal space 8192 characters consume is exactly 8192*32=262144 bytes. Because 262144/1024=256, we can say it takes 256KB to store such a library. How much space have you guys got in your computer? 40GB? 120GB? 400GB? or 2TB? I got 320GB here. so I've got a hell lotta space to save over 1.3 million duplicates of this library. So, you see, a 256KB library isn't big deal nowadays.
What does the computer do when we input Chinese characters? There are different types of IME's or we call input methods. I'm not digging deep into the algorithm they implement. I'm ganna tell you how characters are fetched from this library with a certain index. The 8192 characters are stored inside this library in a certain order. GB has set a table of index to retrieve these characters by giving each of them a unique address code. And in Chinese it is called "区位输入法", which means index by zone code and serial code. This makes the sequence characters form the library.
When we type Chinese characters, they appear as complicated images on the screen, but what's inside, is much simpler. Let's creat a *.txt file and type something into it. For example input Chinese characters "中国"(China) without any space or enter. Save the file, and check its properties, what's its size? 4 bytes right? How come two complex Chinese characters take only four bytes? I'll tell you how this is done inside the computer. (And actually the four specific bytes will be 0xD6,0xD0,0xB9,0xFA if you're curious enough to look into the file with WinHEX or UltraEdit.)
Take 国 for example again. Like I said it is stored as 0xB9,0xFA inside the machine, which we call the "internal code". Its zone code is 0xB9-0xA0=0x19 = 25(decimal), and its serial code is 0xFA-0xA0=0x5A = 90(decimal). 0xA0 or 160(decimal),is the threshould of whether the computer should consider a byte stored is for a Chinese character or an ASCII letter. When it gets 0xB9, it compares the number with 0xA0, and finds out it's greater than 0xA0, so the computer does the math as below:
absolute address of first byte of character in library = 32*[(zone code-161)*94+(serial code-161)]
-- 32 means each character takes 32 bytes.
-- 94 means there 94 zones inside the library
-- serial code-161 is the position this character stands in the zone of the library,or what we call the OFFSET
-- (zone code-161)*94+OFFSET means the sequence number of the character in the whole library. For 国, it is 2345, which means it's the 2345th character in the entire library. Multiply it by 32, and we get the address number 75040. This is where we start to pick up 32 sequential bytes. And they will be 0x00,0x04,0x7F,0xFE......0x40,0x04 just as illustrated above.
On the other hand, an ASCII code takes one byte in a *.txt file. This byte is its address in the ASCII table, i.e, A being 0x41. The computer checks the ASCII library pretty much the same way it does with a Chinese character, but the calculation is much simpler. 0x41-1=0x40 is multiplied by 16, instead of 32 because of the space the image takes.
So we've come all the way to this understanding, that a 8192(count)*16*16(size) characters Chinese library takes up 256KB of memory inside a digital device. And that our texts are stored inside the computer for two bytes per Chinese character and one byte per ASCII code. Most importantly, ONLY ONE LIBRARY IS SUFFICIENT IN A DIGITAL DIVICE TO DISPLAY ALL THE CHARACTERS IT SUPPORTS!! I don't think anyone would mind sparing 256KB extra space to support 8192 Chinese characters' compatibility. Some one said on most U.N confrences, printed documents with the same content come with different thickness, and the one typed in Chinese is always the thinest. The written Chinese language is brief. I've seen so many English articles being translated into Chinese, in less than half the original length. So I say, given the algorithm that one Chinese character takes twice an English letter does, a Chinese digital document can take no more than an English one, at least not much more. Consequently, space isn't an issue which will cause Chinese characters to die out. Neither will input efficiency be one. My cousin used to work as a typist, she could type an article even faster than someone reciting the same thing. So, my point is, Chinese characters will last, won't you agree?