Does anyone have a Big5-Hokkien input listing? I wish to create an annotator like the ones I have done for Mandarin, Cantonese and Hakka. Please find them here at
http://www.sungwh.freeserve.co.uk/misc/dialect.htm
It is best if it is in the form
[char1] [pronunciation1]
[char1] [pronunciation2]
[char1] [pronunciation3]
[char2] [pronunciation1]
[char2] [pronunciation2]
馬 ma1
長 chang1
長 zhang3
Cheers,
Dyl.
Hokkien Input listing request
Re: Hokkien Input listing request
Hi Dylan,
you might already know the monumental Chinese Dialects Database at
http://starling.rinet.ru/cgi-bin/query. ... hina%5Cdoc
Their dialect data for 20 dialects did originally come from here:
http://www.lang.cityu.edu.hk/chinese/doc.html
The entire database is made available on this site as a single word document (or text format). The data format looks a bit scary at first, but from their explanations in DOCUSE7.doc it's simply
<character>-<dialect>-<number of pron. var.>-<tone register>-<shang/xia>-<initial>-<vowel>-<vowel>-<vowel>-<nasalization>-<final>-<literary or colloq. pronunciation>
Now comes the downside (apart from the parsing to fish out a particular dialect): The pronunciation is encoded in IPA, so the font provided on their web site is needed (you'll need that for the word document, too). But maybe that is not as difficult to get into the code as I imagine it to be.
Hope this helps.
Best regards,
Aurelio
[%sig%]
you might already know the monumental Chinese Dialects Database at
http://starling.rinet.ru/cgi-bin/query. ... hina%5Cdoc
Their dialect data for 20 dialects did originally come from here:
http://www.lang.cityu.edu.hk/chinese/doc.html
The entire database is made available on this site as a single word document (or text format). The data format looks a bit scary at first, but from their explanations in DOCUSE7.doc it's simply
<character>-<dialect>-<number of pron. var.>-<tone register>-<shang/xia>-<initial>-<vowel>-<vowel>-<vowel>-<nasalization>-<final>-<literary or colloq. pronunciation>
Now comes the downside (apart from the parsing to fish out a particular dialect): The pronunciation is encoded in IPA, so the font provided on their web site is needed (you'll need that for the word document, too). But maybe that is not as difficult to get into the code as I imagine it to be.
Hope this helps.
Best regards,
Aurelio
[%sig%]
Re: Hokkien Input listing request
Hmm, I do have that file. I was gonna try and separate the stuff, but their encoding of IPA is bound by the codes in their own font, DOCIPA.ttf I recall.
I'll try and separate out the stuff, but the number of unique Big5 characters in the list is small IIRC. If I had a ready listing I could have an annotator out in half a day. The other thing about a recognisable romanisation is that it is a lot easier for it's users to read than IPA.
Cheers,
Dyl..
I'll try and separate out the stuff, but the number of unique Big5 characters in the list is small IIRC. If I had a ready listing I could have an annotator out in half a day. The other thing about a recognisable romanisation is that it is a lot easier for it's users to read than IPA.
Cheers,
Dyl..
Re: Hokkien Input listing request
After doing some programming, there are 2713 character entries (though less individual character than this) . There are 19 dialects, plus an entry which gives the fanqie spelling, so if this is effectively 20 total dialects, this gives 54260 lines. However, the total number of lines in the download is 55409 lines, where the extra are mostly in the form of literary readings.
The information in each line is in the following manner
北 Beijing
濟 Jinan
西 Xi'an
太 Taoyuan
漢 Hankou
成 Chengdu
揚 Yangzhou
蘇 Suzhou
溫 Wenzhou
長 Changsha
雙 Shuangfeng
南 Nanchang
梅 Meixian
廣 Guangzhou
廈 Xiamen
潮 Chaozhou
福 Fuzhou
上 Shanghang?
中 Zhongshan?
(I'm not sure about the latter two though, and there seems to be some odd entries in these which seem at odd with the following)
[char][dialect][variant][tone][ying/yang][initial][medial][nucleus][offglide][nasalisation][ending][literary]
[char] BIG5 Chinese character
[dialect] as above
[variant] 0,1,2,... 0 if no variant readings, 1, 2 define different readings (see literary)
[tone] 1,2,3,4 representing ping, shang, qu, ru tone classes
[ying/yang] 1,2 representing ying and yang respectively, and blank if there is no split, 3 if further splitting
[initial] initial consonant, blank if zero initial
[medial] medial, blank if no medial
[nucleus] main vowel, blank if syllabic consonant is the rime
[offglide] usually u, though not always, blank if none
[nasalisation]
[ending] final consonant ending, m, n, ng, p, t, k, ? (glottal stop) or other
[literary] Chinese character inidicating a literary reading
Thomas Chin's dictionary on this site has links to DOC entries in this site.
Cheers,
Dyl.
The information in each line is in the following manner
北 Beijing
濟 Jinan
西 Xi'an
太 Taoyuan
漢 Hankou
成 Chengdu
揚 Yangzhou
蘇 Suzhou
溫 Wenzhou
長 Changsha
雙 Shuangfeng
南 Nanchang
梅 Meixian
廣 Guangzhou
廈 Xiamen
潮 Chaozhou
福 Fuzhou
上 Shanghang?
中 Zhongshan?
(I'm not sure about the latter two though, and there seems to be some odd entries in these which seem at odd with the following)
[char][dialect][variant][tone][ying/yang][initial][medial][nucleus][offglide][nasalisation][ending][literary]
[char] BIG5 Chinese character
[dialect] as above
[variant] 0,1,2,... 0 if no variant readings, 1, 2 define different readings (see literary)
[tone] 1,2,3,4 representing ping, shang, qu, ru tone classes
[ying/yang] 1,2 representing ying and yang respectively, and blank if there is no split, 3 if further splitting
[initial] initial consonant, blank if zero initial
[medial] medial, blank if no medial
[nucleus] main vowel, blank if syllabic consonant is the rime
[offglide] usually u, though not always, blank if none
[nasalisation]
[ending] final consonant ending, m, n, ng, p, t, k, ? (glottal stop) or other
[literary] Chinese character inidicating a literary reading
Thomas Chin's dictionary on this site has links to DOC entries in this site.
Cheers,
Dyl.
Re: Hokkien Input listing request
I believe 中 is Middle Chinese. Not sure what 上 is, but I seem recall it is Shanghainese, which is a new addition to the database.
andrew
andrew
Re: Hokkien Input listing request
Middle Chinese? That makes a great deal of sense, however, after looking at the pronunciations, there are some oddities about it. For instance, it does not give final endings in the Ru tone, but indicates the tone as 4 corresponding to Ru. This is rather puzzling, but just rummaging around my downloads, there is a file called DOCUSE7.DOC which has a reference to Zhongyuan Yinyun, which I think must fit this bill, since in ZYYY (written by Zhou Deqing of the Yuan Dynasty), the ru characters had been lost. I think this is what "Zhong" refers to.
As for "Shang" yes, it's Shanghai dialect, after having a quick look at the pronunciations, I recognise some of the vowels that I've seen in phonetic descriptions of the dialect, like the Y vowel.
After converting the codes in docmas9.txt regarding the pronunciation (and dependent on having the docipa.ttf font) into IPA, I now have a 7.5 megabyte file. It is too big to put online, since I only have dial up access. I was thinking of splitting the whole lot into single character entry files, making an index, and take some days uploading it. Is anyone interested?
Dyl.
As for "Shang" yes, it's Shanghai dialect, after having a quick look at the pronunciations, I recognise some of the vowels that I've seen in phonetic descriptions of the dialect, like the Y vowel.
After converting the codes in docmas9.txt regarding the pronunciation (and dependent on having the docipa.ttf font) into IPA, I now have a 7.5 megabyte file. It is too big to put online, since I only have dial up access. I was thinking of splitting the whole lot into single character entry files, making an index, and take some days uploading it. Is anyone interested?
Dyl.