BTW, the SimSun fonts include ExtensionA of Unicode 3.0, so you can display the first 6000 characters or so which were originally the rectangular boxes. Had you not selected the Simsun font, you could have scrolled down to find the one of the first readable characters as being the character yi/yat/yit/ichi/itsu/hitotsu/il/one That's what gave the game away for me.
Cheers,
Dyl.
Encoding the Database
Re: Encoding the Database
Hi,
The description that Dylan gave to open the file in Open Office is completely the way I tested it. I have no additional directions, except that I tested it in OO v1.1 (this is probably also the version you used)
The font I used is proprietary font based on Song with ExtA and ExtB. But even fonts such as MingliU or Arial Unicode MS should enable you to display the largest part of the file (except ExtA).
The bug in the last part of the file is also correct, caused by a scripting error. I corrected it in a newer version:
http://www.chineselanguage.org/CCDICT/S ... 4.0.tar.gz
or the spreadsheet
http://www.chineselanguage.org/CCDICT/S ... -4.4.0.sxc
Sorry for the inconvenience.
Regards
The description that Dylan gave to open the file in Open Office is completely the way I tested it. I have no additional directions, except that I tested it in OO v1.1 (this is probably also the version you used)
The font I used is proprietary font based on Song with ExtA and ExtB. But even fonts such as MingliU or Arial Unicode MS should enable you to display the largest part of the file (except ExtA).
The bug in the last part of the file is also correct, caused by a scripting error. I corrected it in a newer version:
http://www.chineselanguage.org/CCDICT/S ... 4.0.tar.gz
or the spreadsheet
http://www.chineselanguage.org/CCDICT/S ... -4.4.0.sxc
Sorry for the inconvenience.
Regards
Re: Encoding the Database
Thanks, Dylan, for the additional help.
I toyed around with it for a couple of hours on Open Office, but to no avail.
The best I can get is that the first 6000-plus characters are question marks or blank, but the characters after that are there.
The thing I did during those couple of hours was to change the language default setting for when an application does not support Unicode. I changed it between Japanese (my default) and English, and each variety of Chinese. This made a difference in whether the first 6000-plus characters were blanks, boxes or question marks, but regardless of the font (Unicode, Sim Sun, etc.), those first characters just don't show.
I'm pretty frustrated at this point. I think what I'll do is send the file to a fellow graduate student who is pretty knowledgable in encoding systems. He said he'd take a look at it for me.
Still, if anyone has any other ideas, I would like to hear them, to help in the future and for other people.
Best
Benjamin Barrett
Graduate Student
Department of Linguistics, University of Washington
[%sig%]
I toyed around with it for a couple of hours on Open Office, but to no avail.
The best I can get is that the first 6000-plus characters are question marks or blank, but the characters after that are there.
The thing I did during those couple of hours was to change the language default setting for when an application does not support Unicode. I changed it between Japanese (my default) and English, and each variety of Chinese. This made a difference in whether the first 6000-plus characters were blanks, boxes or question marks, but regardless of the font (Unicode, Sim Sun, etc.), those first characters just don't show.
I'm pretty frustrated at this point. I think what I'll do is send the file to a fellow graduate student who is pretty knowledgable in encoding systems. He said he'd take a look at it for me.
Still, if anyone has any other ideas, I would like to hear them, to help in the future and for other people.
Best
Benjamin Barrett
Graduate Student
Department of Linguistics, University of Washington
[%sig%]
Re: Encoding the Database
Hi Benjamin,
Don't worry about the first 6000 chars. They are from ExtA. You probably will not need them. You really need a ExtA-supporting font the view these (most fonts do not support them).
In addition, there is not much pronunciation data in these records. My suggestion is to stick with the viewable part which is the part you probably will need actually.
Regards,
Don't worry about the first 6000 chars. They are from ExtA. You probably will not need them. You really need a ExtA-supporting font the view these (most fonts do not support them).
In addition, there is not much pronunciation data in these records. My suggestion is to stick with the viewable part which is the part you probably will need actually.
Regards,
Re: Encoding the Database
Hi Benjamin,
I pretty much agree with Thomas, the Extension A characters are rarer, or just variants. There are some which are characters missing from the original Chinese GB and Big5 standards which are used to provide a direct conversion from traditional to simplified characters.
Unless you are working with ancient texts, you may not need those rare characters. For just work on current characters, those in current popular usage, those that appear in Unicode 2.1 (i.e. the 21000 or so characters you can see) are pretty much all you need.
Cheers,
Dyl.
I pretty much agree with Thomas, the Extension A characters are rarer, or just variants. There are some which are characters missing from the original Chinese GB and Big5 standards which are used to provide a direct conversion from traditional to simplified characters.
Unless you are working with ancient texts, you may not need those rare characters. For just work on current characters, those in current popular usage, those that appear in Unicode 2.1 (i.e. the 21000 or so characters you can see) are pretty much all you need.
Cheers,
Dyl.
Re: Encoding the Database
Doh! Now I feel a lot better
At this point in time, I'm going to just look at some sound correspondences of the modern languages to get my feet wet.
If I have any updates or additional information, I'll write back to the forum.
Thanks for all the help, Thomas and Dylan.
At this point in time, I'm going to just look at some sound correspondences of the modern languages to get my feet wet.
If I have any updates or additional information, I'll write back to the forum.
Thanks for all the help, Thomas and Dylan.
Re: Encoding the Database
I'm using OpenOffice 1.1.3 and the Chinese characters are missing. All I get are funny looking symbols instead. How to fix this? Any ideas?
Port of Unihan to Excel
I've had some success porting the Unihan DB to Excel. So far I'm able to show the 4 digit U+nnnn characters but am having trouble with the 5 digit characters.
Im using this VB macro to resolve the characters
Sub aaa()
'20050626, sunwukong (AT) povn(dot)com (Pat kirol)
'you put the 4 digit unicode values in col b
'and run this script and it will insert the characters in collumn c
'in this case n is 1 to 6 but you would have to adjust 6 to the 'last row of unicode 4 digit numbers.
For n = 1 To 6
vvv = Cells(n, 2).Value
Debug.Print n
Cells(n, 3).Value = ChrW("&H" & Cells(n, 2))
Next n
End Sub
What is even more interesting is that you can not get the characters to display consistantly and must edit the font used for each character. In my case I am editing in a lookup code for the kTaiwanTelegraph field (CCT or CTC). Im up against a brick wall because CCT=5983 is a 5 digit unicode value and this routine will not handle it.
Im using this VB macro to resolve the characters
Sub aaa()
'20050626, sunwukong (AT) povn(dot)com (Pat kirol)
'you put the 4 digit unicode values in col b
'and run this script and it will insert the characters in collumn c
'in this case n is 1 to 6 but you would have to adjust 6 to the 'last row of unicode 4 digit numbers.
For n = 1 To 6
vvv = Cells(n, 2).Value
Debug.Print n
Cells(n, 3).Value = ChrW("&H" & Cells(n, 2))
Next n
End Sub
What is even more interesting is that you can not get the characters to display consistantly and must edit the font used for each character. In my case I am editing in a lookup code for the kTaiwanTelegraph field (CCT or CTC). Im up against a brick wall because CCT=5983 is a 5 digit unicode value and this routine will not handle it.
Re: Port of Unihan to Excel
If you are using a MS Windows system you need to insert 5-digit (U+nnnnn) characters as a surrogate pair (recent versions of Office support them).sunwukong wrote:I've had some success porting the Unihan DB to Excel. So far I'm able to show the 4 digit U+nnnn characters but am having trouble with the 5 digit characters.
'*----------------------------------------------------------*
'* Name : vbShiftRight *
'*----------------------------------------------------------*
'* Purpose : Shift 32-bit integer value right 'n' bits. *
'*----------------------------------------------------------*
'* Parameters : Value Required. Value to shift. *
'* : Count Required. Number of bit positions to *
'* : shift value. *
'*----------------------------------------------------------*
'* Description: This function is equivalent to the 'C' *
'* : language construct '>>'. *
'*----------------------------------------------------------*
Public Function vbShiftRight(ByVal Value As Long, _
Count As Integer) As Long
Dim i As Integer
vbShiftRight = Value
For i = 1 To Count
vbShiftRight = vbShiftRight \ 2
Next
End Function
'*----------------------------------------------------------*
'* Name : WriteSurrogate *
'*----------------------------------------------------------*
'* Purpose : Returns a surrogate pair of ISO10646:1993 *
'* : CJK Extension B codepoints *
'*----------------------------------------------------------*
'* Parameters : Codepoint Required. 5-digit string to be *
'* : converted. *
'*----------------------------------------------------------*
'* Description: Based on the C++ conversion algorithm. *
'*----------------------------------------------------------*
Function WriteSurrogate(Codepoint as String) as String
Code = Val("&H" + Codepoint)
lowsur = vbShiftRight(Code, 10) + &HD7C0
highsur = &HDC00 Or Code And &H3FF
WriteSurrogate = ChrW(Val(lowsur)) + ChrW(Val(highsur))
End Function
Did not test the code for typos.
Good luck,
Thomas