manum Posted August 27, 2005 Share Posted August 27, 2005 Hi, I am wondering if anyone here knows any programming libraries or such to break up thai words into syllables? I am planning to make Finnish-Thai-Finnish dictionary and I would like to have automated transliteration of Thai words. I found something from Chula's pages but the links were broken. I found a perl library to break a sentence into words already but now I need to break the word into syllables. Thanks! Link to comment Share on other sites More sharing options...
Oswulf Posted August 31, 2005 Share Posted August 31, 2005 If you're using Microsoft Windows, the obvious way to do this is to use the built-in Uniscribe capabilities. For API details see: http://www.microsoft.com/typography/develo...e/uniscribe.htm Link to comment Share on other sites More sharing options...
Oswulf Posted September 5, 2005 Share Posted September 5, 2005 On reflection, the concept of breaking Thai words into syllables corresponding with the spoken language is simply not possible. There are plenty of Thai words (several hundred) where individual letters are "double function", such as ผลไม้ (phǒnlamáay - fruit). This is written with two syllables, but pronounced with three. There are also other words for which the vowel elements of the diphthong wrap around the first syllable, such as เขมือบ (kha mʉ̀ʉap - swallow) which can't so be split. Link to comment Share on other sites More sharing options...
Richard W Posted September 5, 2005 Share Posted September 5, 2005 I am wondering if anyone here knows any programming libraries or such to break up thai words into syllables? I am planning to make Finnish-Thai-Finnish dictionary and I would like to have automated transliteration of Thai words. <{POST_SNAPBACK}> I'm not sure if there is any significant research currently going on on accurate automatic phonetic transcription. I've seen papers on it - nominally aiming at speech synthesis - and I know Mike (a poster here) of thai2english.com and Glenn Slayden of www.thai-language.com have put a lot of effort into making automatic transcriptions. I suspect the techniques are something of a trade secret. In one sense it cannot be done - there are a few homographs which are pronounced quite differently, like เพลา - meaning 'time, meal' it is [M]phee[M]laa; meaning 'axis' it is [M]phlao, as well as subtle differences such as ตนุ - [L]ta[L]nu meaning 'turtle' and [L]ta[H]nu meaning 'stem, I'. Link to comment Share on other sites More sharing options...
Abandon Posted September 6, 2005 Share Posted September 6, 2005 In theory Thai would be a perfect language for speech recognition as it is essentially a mono-syllabic language, and has clear pronunciation (until you move too far out of Bangkok) .....nothing to do with it , I know..... Link to comment Share on other sites More sharing options...
Richard W Posted September 11, 2005 Share Posted September 11, 2005 (edited) In theory Thai would be a perfect language for speech recognition as it is essentially a mono-syllabic language, and has clear pronunciation (until you move too far out of Bangkok) <{POST_SNAPBACK}> Is it truly much more monosyllabic than English? There are a great many words that fit a simple sesquisyllabic mould (C©aC©V©), and I think that's a better way to look at it. When you say a clear pronunciation, I presume you mean after the elimination of consonant clusters in /l/ and /r/, and are likewise speaking after the merger of /r/ and /l/, and of the tones of the presyllables of the sesquisyllables above. What about the change /khw/ > /f/ - that's even seeped into the angry English of a Bangkok lady I know. I'm sure you appreciate how commonly phrases get slurred - I was startled when I first heard จะเอา said as เจอา. Richard. Edited September 11, 2005 by Richard W Link to comment Share on other sites More sharing options...
manum Posted October 6, 2005 Author Share Posted October 6, 2005 Hi, As I didn't find any code snippets or something, I started making thai syllable breaker on my own. It can be found from here (with example word): http://wk.fi/thai/trans.php?w=%CA%A1%BB%C3%A1 There still are looot of errors and not even all the consonants and vowels are in. Now I am wondering what is the rule to ignore gaa-ran (รื for example) efficiently? There are words where only the consonent below it should be ignored, but also words that two consonants must be ignored. Examples with one consonant in the end: สัตว์ รถยนต์ And in these cases I should know how to ignore BOTH consonants: ภาพยนตร์ ถ่ายภาพยนตร์ And many others.. help is appreciated Link to comment Share on other sites More sharing options...
ProfessorFart Posted October 6, 2005 Share Posted October 6, 2005 And in these cases I should know how to ignore BOTH consonants:ภาพยนตร์ ถ่ายภาพยนตร์ In the case of ignoring the last two consonants when the 'Garan' is only over the last consonant is quite easy as it is not very common (it only occurs in certain words of Pali and Sanskrit origin) and it is only words which end with 'ตร์ ' which do this as far as I can recall. I could be mistaken but I can't think of any other instances. Link to comment Share on other sites More sharing options...
manum Posted October 6, 2005 Author Share Posted October 6, 2005 And in these cases I should know how to ignore BOTH consonants:ภาพยนตร์ ถ่ายภาพยนตร์ In the case of ignoring the last two consonants when the 'Garan' is only over the last consonant is quite easy as it is not very common (it only occurs in certain words of Pali and Sanskrit origin) and it is only words which end with 'ตร์ ' which do this as far as I can recall. I could be mistaken but I can't think of any other instances. <{POST_SNAPBACK}> For example: สัตว์ Anyway..this word could be easily worked out because of 'a' after สั... Link to comment Share on other sites More sharing options...
ProfessorFart Posted October 6, 2005 Share Posted October 6, 2005 That is pronounced 'Sat' and the final 'ว' is unpronounced. Link to comment Share on other sites More sharing options...
manum Posted October 6, 2005 Author Share Posted October 6, 2005 That is pronounced 'Sat' and the final 'ว' is unpronounced. <{POST_SNAPBACK}> Yep I know.. but I am just thinking of a rule how to NOT ignore ต in this word. Link to comment Share on other sites More sharing options...
ProfessorFart Posted October 6, 2005 Share Posted October 6, 2005 Aplogies if I never made it clear in my earlier post. As far as I know the only time two final consonants are unpronounced whern there is only one garan is when you find 'ตร์ ' together at the end of a word. In all other cases where you see a garan, it is only the consonant directly beneath which is left unpronounced although I stand to be corrected. eg. สัตว์ The consonant beneath the garan is a 'ว' therefore the final 'ต' should be pronounced. ภาพยนตร์ The final two consonants are 'ตร์' together therefore should both be unpronounced. Hope this is a little more clear. Link to comment Share on other sites More sharing options...
mike_l Posted October 6, 2005 Share Posted October 6, 2005 (edited) As far as I know the only time two final consonants are unpronounced whern there is only one garan is when you find 'ตร์ ' together at the end of a word. <{POST_SNAPBACK}> That's the most common, but there are other examples where two consonants are unpronounced too - for instance จันทร์ , นิรันดร์ , ประชาราษฎร์ etc... Edited October 6, 2005 by mike_l Link to comment Share on other sites More sharing options...
manum Posted October 6, 2005 Author Share Posted October 6, 2005 Btw. I am planning to make this open source since everybody who has made something like this tend to keep it as their own secrets So please.. help me if possible Link to comment Share on other sites More sharing options...
Richard W Posted October 6, 2005 Share Posted October 6, 2005 And there are words with three silenced consonants - ลักษมณ์! Manum's prototype broke on this one - parse-err(03;์). I got the same error on ลักษณ์.However, there is a rule that will help in this case. If a final ร after a dental is silent, but the dental is pronounced, then there is no karan on the ร. Examples: One silent letter: เมตร, ลิตร Two silent letters - see above! Also add พระอินทร์ I'm not sure if the rule also consistently applies to -กร, e.g. จักร [H]jak. For สัตว์, there are several ways to tell that the ต is not silent: In monomorphemic words, อั cannot occur in an open syllable. The compound word ประวัติศาสตร์ [L]pra[L]wa[L]ti[L]saat (?) 'history' shows that the rule needs some condition, but perhaps its spelling does hold a hint to the syllabification of ประวัติ [L]pra[L]wat. (Shame the tone of the final syllable has to be a guess!) Trimming off consonants should not expose a vowel (other than dipthongs in -i and or -u (<o>)), e.g. เสาร์ 'Saturn'. There are a very few examples where a final consonant is silent for no reason, but you no have chance with an unkaranned silent final consonant immediately after a vowel. คัมภีร์ is a peculiar example with karan. Implicit silencing does not trim a word back to one consonant. Which Artificial Intelligence language are you (Manum) using? Prolog? Lisp? Have you had a look at the ICU (Internationalization Components for Unicode) open source code? It includes a large Thai database just for tasks such as line-breaking. Link to comment Share on other sites More sharing options...
katana Posted October 7, 2005 Share Posted October 7, 2005 As far as I know the only time two final consonants are unpronounced whern there is only one garan is when you find 'ตร์ ' together at the end of a word. <{POST_SNAPBACK}> If you look in Haas' dictionary, you should find more examples of 2 or 3 silenced consonants with a GARAN การันต์. Link to comment Share on other sites More sharing options...
Richard W Posted October 7, 2005 Share Posted October 7, 2005 จักร [H]jak <{POST_SNAPBACK}> Correction: จักร [L]jak Link to comment Share on other sites More sharing options...
manum Posted October 8, 2005 Author Share Posted October 8, 2005 Which Artificial Intelligence language are you (Manum) using? Prolog? Lisp? I am actually using PHP only, with loops and regexps Anyway, I have a question again. Would there be some rule for unwritten (a)? For example โทรศัพท์ ? How could I know it's not "syllablified" to โทร ศัพท์ .. instead of โท ระ ศัพ ? Link to comment Share on other sites More sharing options...
Richard W Posted October 9, 2005 Share Posted October 9, 2005 Would there be some rule for unwritten (a)? For example โทรศัพท์? How could I know it's not "syllablified" to โทร ศัพท์ .. instead of โท ระ ศัพ ? <{POST_SNAPBACK}> Semantics . Having decided it's a compound word, you then ask whether its formed by Thai rules - head + qualifier, or Indic (actually Indo-European, I think, - cf. German, English) qualifier + head. In this case, it's qualifier plus head, so you need a link vowel, whence โท ระ ศัพ. These rules aren't 100% - especially with native words involved, e.g. ผลไม้ = ผน ละ ม้าย. You also need a grammatical analysis to decide between ไม้ (effectively the 'construct' form) and ม้าย (the 'absolute' form.) Of course, in this example you could use the fact that the word โทร 'telephone' is pronounced โท ระ I often use the cynical rule 'If you don't insert a vowel, how will your listener know that you know what the consonant is?'. I'm not sure why it doesn't work with มารดา. Is it just that the word is far too different from the original form มาตา, มาตร-? Is the pronunciation with -n- just a spelling pronunciation, ignoring the following double-pricing rule on karans: In P/S words, silent letters are marked as such if they are followed by a vowel in the original. In European words, silent letters are marked as such. There are a lot of irritating words like สามารถ, with a silent ร before a dental. Sometimes the word is easily misdiagnosed as not P/S, e.g เกียรติ pron. เกีียด / เกียรติ์์ pron. เกียน, from Sanskrit กีรฺติ, Pali กิตฺติ. (It has just occurred to me that this may be due to a transient stage in Khmer as final 'r' faded out, cf. English 'beard', though I think the now silent final 'r' only creates one diphthong in modern Khmer.) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now