Hokkien Input listing request

Discussions on the Hokkien (Minnan) language.
Locked
Dylan Sung

Hokkien Input listing request

Post by Dylan Sung »

Does anyone have a Big5-Hokkien input listing? I wish to create an annotator like the ones I have done for Mandarin, Cantonese and Hakka. Please find them here at

http://www.sungwh.freeserve.co.uk/misc/dialect.htm

It is best if it is in the form

[char1] [pronunciation1]
[char1] [pronunciation2]
[char1] [pronunciation3]
[char2] [pronunciation1]
[char2] [pronunciation2]

馬 ma1
長 chang1
長 zhang3

Cheers,
Dyl.
Aurelio

Re: Hokkien Input listing request

Post by Aurelio »

Hi Dylan,

you might already know the monumental Chinese Dialects Database at

http://starling.rinet.ru/cgi-bin/query. ... hina%5Cdoc

Their dialect data for 20 dialects did originally come from here:

http://www.lang.cityu.edu.hk/chinese/doc.html

The entire database is made available on this site as a single word document (or text format). The data format looks a bit scary at first, but from their explanations in DOCUSE7.doc it's simply

<character>-<dialect>-<number of pron. var.>-<tone register>-<shang/xia>-<initial>-<vowel>-<vowel>-<vowel>-<nasalization>-<final>-<literary or colloq. pronunciation>

Now comes the downside (apart from the parsing to fish out a particular dialect): The pronunciation is encoded in IPA, so the font provided on their web site is needed (you'll need that for the word document, too). But maybe that is not as difficult to get into the code as I imagine it to be.

Hope this helps.

Best regards,

Aurelio

[%sig%]
Aurelio

Re: Hokkien Input listing request

Post by Aurelio »

And, er, yes, it's in Big5

Regards,
Aurelio
Dylan Sung

Re: Hokkien Input listing request

Post by Dylan Sung »

Hmm, I do have that file. I was gonna try and separate the stuff, but their encoding of IPA is bound by the codes in their own font, DOCIPA.ttf I recall.

I'll try and separate out the stuff, but the number of unique Big5 characters in the list is small IIRC. If I had a ready listing I could have an annotator out in half a day. The other thing about a recognisable romanisation is that it is a lot easier for it's users to read than IPA.

Cheers,
Dyl..
Dylan Sung

Re: Hokkien Input listing request

Post by Dylan Sung »

After doing some programming, there are 2713 character entries (though less individual character than this) . There are 19 dialects, plus an entry which gives the fanqie spelling, so if this is effectively 20 total dialects, this gives 54260 lines. However, the total number of lines in the download is 55409 lines, where the extra are mostly in the form of literary readings.

The information in each line is in the following manner

北 Beijing
濟 Jinan
西 Xi'an
太 Taoyuan
漢 Hankou
成 Chengdu
揚 Yangzhou
蘇 Suzhou
溫 Wenzhou
長 Changsha
雙 Shuangfeng
南 Nanchang
梅 Meixian
廣 Guangzhou
廈 Xiamen
潮 Chaozhou
福 Fuzhou
上 Shanghang?
中 Zhongshan?

(I'm not sure about the latter two though, and there seems to be some odd entries in these which seem at odd with the following)

[char][dialect][variant][tone][ying/yang][initial][medial][nucleus][offglide][nasalisation][ending][literary]

[char] BIG5 Chinese character
[dialect] as above
[variant] 0,1,2,... 0 if no variant readings, 1, 2 define different readings (see literary)
[tone] 1,2,3,4 representing ping, shang, qu, ru tone classes
[ying/yang] 1,2 representing ying and yang respectively, and blank if there is no split, 3 if further splitting
[initial] initial consonant, blank if zero initial
[medial] medial, blank if no medial
[nucleus] main vowel, blank if syllabic consonant is the rime
[offglide] usually u, though not always, blank if none
[nasalisation]
[ending] final consonant ending, m, n, ng, p, t, k, ? (glottal stop) or other
[literary] Chinese character inidicating a literary reading


Thomas Chin's dictionary on this site has links to DOC entries in this site.

Cheers,
Dyl.
Andrew Yong

Re: Hokkien Input listing request

Post by Andrew Yong »

I believe 中 is Middle Chinese. Not sure what 上 is, but I seem recall it is Shanghainese, which is a new addition to the database.

andrew
Dylan Sung

Re: Hokkien Input listing request

Post by Dylan Sung »

Middle Chinese? That makes a great deal of sense, however, after looking at the pronunciations, there are some oddities about it. For instance, it does not give final endings in the Ru tone, but indicates the tone as 4 corresponding to Ru. This is rather puzzling, but just rummaging around my downloads, there is a file called DOCUSE7.DOC which has a reference to Zhongyuan Yinyun, which I think must fit this bill, since in ZYYY (written by Zhou Deqing of the Yuan Dynasty), the ru characters had been lost. I think this is what "Zhong" refers to.

As for "Shang" yes, it's Shanghai dialect, after having a quick look at the pronunciations, I recognise some of the vowels that I've seen in phonetic descriptions of the dialect, like the Y vowel.

After converting the codes in docmas9.txt regarding the pronunciation (and dependent on having the docipa.ttf font) into IPA, I now have a 7.5 megabyte file. It is too big to put online, since I only have dial up access. I was thinking of splitting the whole lot into single character entry files, making an index, and take some days uploading it. Is anyone interested?

Dyl.
Locked