Jump to content

Breaking Thai Words Into Syllables


manum

Recommended Posts

Hi,

I am wondering if anyone here knows any programming libraries or such to break up thai words into syllables? I am planning to make Finnish-Thai-Finnish dictionary and I would like to have automated transliteration of Thai words.

I found something from Chula's pages but the links were broken. I found a perl library to break a sentence into words already but now I need to break the word into syllables.

Thanks!

Link to comment
Share on other sites

On reflection, the concept of breaking Thai words into syllables corresponding with the spoken language is simply not possible.

There are plenty of Thai words (several hundred) where individual letters are "double function", such as ผลไม้ (phǒnlamáay - fruit). This is written with two syllables, but pronounced with three.

There are also other words for which the vowel elements of the diphthong wrap around the first syllable, such as เขมือบ (kha mʉ̀ʉap - swallow) which can't so be split.

Link to comment
Share on other sites

I am wondering if anyone here knows any programming libraries or such to break up thai words into syllables? I am planning to make Finnish-Thai-Finnish dictionary and I would like to have automated transliteration of Thai words.

I'm not sure if there is any significant research currently going on on accurate automatic phonetic transcription. I've seen papers on it - nominally aiming at speech synthesis - and I know Mike (a poster here) of thai2english.com and Glenn Slayden of www.thai-language.com have put a lot of effort into making automatic transcriptions. I suspect the techniques are something of a trade secret.

In one sense it cannot be done - there are a few homographs which are pronounced quite differently, like เพลา - meaning 'time, meal' it is [M]phee[M]laa; meaning 'axis' it is [M]phlao, as well as subtle differences such as ตนุ - [L]ta[L]nu meaning 'turtle' and [L]ta[H]nu meaning 'stem, I'.

Link to comment
Share on other sites

In theory Thai would be a perfect language for speech recognition as it is essentially a mono-syllabic language, and has clear pronunciation (until you move too far out of Bangkok)

Is it truly much more monosyllabic than English? There are a great many words that fit a simple sesquisyllabic mould (C©aC©V©), and I think that's a better way to look at it. When you say a clear pronunciation, I presume you mean after the elimination of consonant clusters in /l/ and /r/, and are likewise speaking after the merger of /r/ and /l/, and of the tones of the presyllables of the sesquisyllables above. What about the change /khw/ > /f/ - that's even seeped into the angry English of a Bangkok lady I know.

I'm sure you appreciate how commonly phrases get slurred - I was startled when I first heard จะเอา said as เจอา.

Richard.

Edited by Richard W
Link to comment
Share on other sites

  • 4 weeks later...

Hi,

As I didn't find any code snippets or something, I started making thai syllable breaker on my own.

It can be found from here (with example word):

http://wk.fi/thai/trans.php?w=%CA%A1%BB%C3%A1

There still are looot of errors and not even all the consonants and vowels are in.

Now I am wondering what is the rule to ignore gaa-ran (รื for example) efficiently? There are words where only the consonent below it should be ignored, but also words that two consonants must be ignored.

Examples with one consonant in the end:

สัตว์

รถยนต์

And in these cases I should know how to ignore BOTH consonants:

ภาพยนตร์

ถ่ายภาพยนตร์

And many others.. help is appreciated :o

Link to comment
Share on other sites

And in these cases I should know how to ignore BOTH consonants:

ภาพยนตร์

ถ่ายภาพยนตร์

In the case of ignoring the last two consonants when the 'Garan' is only over the last consonant is quite easy as it is not very common (it only occurs in certain words of Pali and Sanskrit origin) and it is only words which end with 'ตร์ ' which do this as far as I can recall. I could be mistaken but I can't think of any other instances.

Link to comment
Share on other sites

And in these cases I should know how to ignore BOTH consonants:

ภาพยนตร์

ถ่ายภาพยนตร์

In the case of ignoring the last two consonants when the 'Garan' is only over the last consonant is quite easy as it is not very common (it only occurs in certain words of Pali and Sanskrit origin) and it is only words which end with 'ตร์ ' which do this as far as I can recall. I could be mistaken but I can't think of any other instances.

For example:

สัตว์

Anyway..this word could be easily worked out because of 'a' after สั...

Link to comment
Share on other sites

Aplogies if I never made it clear in my earlier post.

As far as I know the only time two final consonants are unpronounced whern there is only one garan is when you find 'ตร์ ' together at the end of a word.

In all other cases where you see a garan, it is only the consonant directly beneath which is left unpronounced although I stand to be corrected.

eg. สัตว์

The consonant beneath the garan is a 'ว' therefore the final 'ต' should be pronounced.

ภาพยนตร์

The final two consonants are 'ตร์' together therefore should both be unpronounced.

Hope this is a little more clear.

Link to comment
Share on other sites

As far as I know the only time two final consonants are unpronounced whern there is only one garan is when you find 'ตร์ ' together at the end of a word.

That's the most common, but there are other examples where two consonants are unpronounced too - for instance จันทร์ , นิรันดร์ , ประชาราษฎร์ etc...

Edited by mike_l
Link to comment
Share on other sites

And there are words with three silenced consonants - ลักษมณ์! Manum's prototype broke on this one -

parse-err(03;์)
. I got the same error on ลักษณ์.

However, there is a rule that will help in this case. If a final after a dental is silent, but the dental is pronounced, then there is no karan on the . Examples:

One silent letter: เมตร, ลิตร

Two silent letters - see above! Also add พระอินทร์

I'm not sure if the rule also consistently applies to -กร, e.g. จักร [H]jak.

For สัตว์, there are several ways to tell that the ต is not silent:

  • In monomorphemic words, อั cannot occur in an open syllable. The compound word ประวัติศาสตร์ [L]pra[L]wa[L]ti[L]saat (?) 'history' shows that the rule needs some condition, but perhaps its spelling does hold a hint to the syllabification of ประวัติ [L]pra[L]wat. (Shame the tone of the final syllable has to be a guess!)
  • Trimming off consonants should not expose a vowel (other than dipthongs in -i and or -u (<o>)), e.g. เสาร์ 'Saturn'. There are a very few examples where a final consonant is silent for no reason, but you no have chance with an unkaranned silent final consonant immediately after a vowel. คัมภีร์ is a peculiar example with karan.
  • Implicit silencing does not trim a word back to one consonant.

Which Artificial Intelligence language are you (Manum) using? Prolog? Lisp?

Have you had a look at the ICU (Internationalization Components for Unicode) open source code? It includes a large Thai database just for tasks such as line-breaking.

Link to comment
Share on other sites

As far as I know the only time two final consonants are unpronounced whern there is only one garan is when you find 'ตร์ ' together at the end of a word.

If you look in Haas' dictionary, you should find more examples of 2 or 3 silenced consonants with a GARAN การันต์. :o

Link to comment
Share on other sites

Which Artificial Intelligence language are you (Manum) using?  Prolog?  Lisp?

I am actually using PHP only, with loops and regexps :o

Anyway, I have a question again.

Would there be some rule for unwritten (a)? For example โทรศัพท์ ? How could I know it's not "syllablified" to โทร ศัพท์ .. instead of โท ระ ศัพ ?

Link to comment
Share on other sites

Would there be some rule for unwritten (a)? For example โทรศัพท์? How could I know it's not "syllablified" to โทร  ศัพท์ .. instead of โท  ระ  ศัพ ?

Semantics :o . Having decided it's a compound word, you then ask whether its formed by Thai rules - head + qualifier, or Indic (actually Indo-European, I think, - cf. German, English) qualifier + head. In this case, it's qualifier plus head, so you need a link vowel, whence โท ระ ศัพ. These rules aren't 100% - especially with native words involved, e.g. ผลไม้ = ผน ละ ม้าย. You also need a grammatical analysis to decide between ไม้ (effectively the 'construct' form) and ม้าย (the 'absolute' form.)

Of course, in this example you could use the fact that the word โทร 'telephone' is pronounced โท ระ :D

I often use the cynical rule 'If you don't insert a vowel, how will your listener know that you know what the consonant is?'. I'm not sure why it doesn't work with มารดา. Is it just that the word is far too different from the original form มาตา, มาตร-? Is the pronunciation with -n- just a spelling pronunciation, ignoring the following double-pricing rule on karans:

  • In P/S words, silent letters are marked as such if they are followed by a vowel in the original.
  • In European words, silent letters are marked as such.

There are a lot of irritating words like สามารถ, with a silent before a dental. Sometimes the word is easily misdiagnosed as not P/S, e.g เกียรติ pron. เกีียด / เกียรติ์์ pron. เกียน, from Sanskrit กีรฺติ, Pali กิตฺติ. (It has just occurred to me that this may be due to a transient stage in Khmer as final 'r' faded out, cf. English 'beard', though I think the now silent final 'r' only creates one diphthong in modern Khmer.)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.







×
×
  • Create New...