Jump to content

Rule-Based Thai Transliteration


bytebuster

Recommended Posts

Hi All,

I've been developing a software for a fully rule-based Thai transliteration.

This software would not be a translation service (there are quite a few of them, and they are pretty fine).

Why another software? Well, yes, each dictionary contains transliteration as well. The problems are:

  1. They use their own transliteration conventions. It's confusing to use several ones simultaneously;
  2. They are primarily targeted on native English speakers. Although it's not a problem itself, it leads to well known confusions regarding aspirated English [p-t-k] and, as result, inventing such constructs as ต -> [dt]. The same applies to vowels โ◌ -> [oh], ำ -> [um], or confusing ิ and ี -> [ee], which is always long in English.
  3. Some transliterations are simply unreadable, e.g., ขอโทษ -> [K̄hxthos̄ʹ].
  4. For those who study language, it's important to know which particular rules apply for each word. It's easy to declare, ขอโทษ should be read [khɔ̌ː - thôːt], and hand a book of Thai grammar, but the beginner would never recall all necessary rules at once.

What did I do about this?

I use IPA as standard transliteration convention, while allowing any other, including Royal system of Transliteration or Cyrillic by L.N.Morev.

I provide with full tree-style proof that tells exactly why this particular phrase should be read so. Here's an example from my FAQ section:

  • เจริญ should be pronounced [tɕà - rɤn] because:
    • The best word splitting is จะ - เริญ because:
      • Akson nam rule applied;
      • AND An alternative splitting เจ - ริญ is worse;
      • AND An alternative splitting เจ - ริ - ญะ is worse;

      [*]AND จะ should be pronounced [tɕà] because:

      • Initial consonant is จ
      • AND Vowel is Sara A
      • AND Final consonant is EMPTY
      • AND The syllable is Dead because:
        • Short vowel + No final consonant

        [*]AND Tone is Low because:

        • Middle-class consonant + No Final

      [*]AND เริญ should be pronounced [rɤn] because:

      • Initial consonant is ร
      • AND Vowel is Sara OE
      • AND Final consonant is ญ
      • AND The syllable is Live because:
        • Final consonant is a sonorant

        [*]AND Tone is Normal because:

        • Akson-nam does not apply because:
          • Exception word

          [*]AND Low-class consonant + Live syllable

What do I need? (Points of Interest)

First of all, let me state I don't place a link to my Web site following topic 10 of the Forum Rules. I'm a bit confused by phrasing "URLs or addresses to a member's own business". It is not "business" since using the site is completely free. Dear fellow moderators, please suggest if the link can be placed.

I need people who will look and provide with their feedback. They can be students of Thai language or Thai linguists.

Also, I'm extremely interested to see if my tool can be integrated in teaching process.

I need someone who would suggest directions for further development.

Thanks.

Link to comment
Share on other sites

You said, "It is not "business" since using the site is completely free.". However, on your web page you say, "A standalone version (coming) is distributed on commercial basis".

I like the principle of your application, but I have some criticisms of the implementation.

I thought I had broken your system. I gave your system some hard words -

พรหมพร

เพลา

เสลา

เงิน

กำเนิด

สำเนียง

and then added ธรรมชาติ to the list. From that point on, it wouldn't work when I clicked 'Go', not even when I removed the last word. (I eventually realised that clicking 'transliterate' would reset the transliteration screen - but do you want to stop people from easily modifying and resubmitting text?) The transliterations I got were:

pʰrom - pʰɔn - pʰlau - sà - lǎu - ŋɤʔn - kɑm - nɤ̀t - sɑ̌m - nǐːaŋ

The vowel length notations in pʰɔn, ŋɤʔn and nɤ̀t (and also in the showcase word, tɕà - rɤn) are strange. They should be marked ɔː, ɤ and ɤː.

The phrases 'is (not) a dictionary word' is confusing - especially when เงิน is both described as 'is a dictionary word' and 'is not a dictionary word'. I suggest changing them to 'is (not) an exception'.

Your explanation does not explain why it discards the silent ห in พรหมพร.

เพลา and เสลา are disproof of principle words, as each is two or three different words with the different pronunciations pʰlau, pʰeːlaː, sàlǎu and sěːlaː. There was no explanation of why the second word in each pair was rejected.

The explanation of the tone of the second syllables is being displayed wrong. It says that the tones of the second syllables of กำเนิด and สำเนียง are not affected by the first syllable, and are therefore falling and mid, but then marks them as low and rising respectively, as they would be if the first syllable affected them. Now, the first syllable of กำเนิด does affect the tone of the second syllable, so the tone was marked correctly in the transliteration. Unusually, the first syllable of สำเนียง does not affect the tone of the second syllable - this word needs to go in your exception dictionary.

The explanation the application gives for the tone of เจริญ is no longer as in the FAQ (and quoted in the post). You reasonably make no comment on the effect of the first syllable, for changing the tone class to mid does not change the tone of the second syllable.

The transliteration of ธรรมชาติ as tʰɑm - tɕʰaː - dà is completely wrong. It should of course, if we accept your choice of vowel symbols, be tʰɑm - má - tɕʰât.

Link to comment
Share on other sites

Richard, I greatly appreciate your feedback.

First of all, let me express my admiration to your search skills. Seriously.

As for the commercial standalone version - yes, I plan this in the future, and that's why I was doubtful about placing the link. I decided when I get in contact with moderators, to let them to decide whether this link can be placed or not. There are 7 visitors for yesterday; such traffic can't be raised through forum advertisement. ;-)

Yes, there's a bug related to session timeout, but I haven't managed to nail it down yet.

Yes, เงิน is a long vowel, thank you for pointing this out. And yes, glottal stop mark in [ŋɤʔn] is certainly a bug. I wonder how could it slip from the tests... I will get it fixed asap.

Yet another thing is dictionary words. My goal is to avoid them as much as possible, and this leads to having many small dictionaries, each of which is targeted to a certain feature - splitting the words, irregular vowel length, and so on. It's a philosophic concept, if you want to avoid something you have many of it. :) So yes, the word can be absent in "splitting" dictionary but present in a "vowel length" dictionary.

I will certainly manage the wording to distinguish dictionaries.

As per rejected alternatives... Well, I can't think how it can be explained. In fact, I'm building all alternatives and then calculate the "quality" of each of them. There is no other explanation rather than the weight coefficient. I'm not sure if I could provide with better reasoning.

ธรรมชาติ is a certain exception.

Please give me a time to look deeper on กำเนิด and สำเนียง; they seem to be very interesting examples, and maybe thy would require re-arranging the proof.

Once again, thank you for the feedback. This is exactly what I wanted: a critical look of people who have good expertise in language and know how to crash-test the program with tricky tests. :)

Have a good day.

// Vlad

Link to comment
Share on other sites

As far as I am aware it's not against the rules to place a link to your website in your Thai Visa profile, then direct people who are interested in helping/beta testing/etc to there.

Please do, because I'd like to have a look, and can't seem to find the url on this topic

Link to comment
Share on other sites

Yes, เงิน is a long vowel, thank you for pointing this out.

No! It has a short vowel!

As per rejected alternatives... Well, I can't think how it can be explained. In fact, I'm building all alternatives and then calculate the "quality" of each of them. There is no other explanation rather than the weight coefficient. I'm not sure if I could provide with better reasoning.

Actually, I'm now thinking I must have missed the rejected sequences.

However, wá - lau for เวลา [weːlaː] time' is not only wrong but implausible.

Do you have any test cases for double-acting consonants? I just tried out เทศบาล, and it did not even consider the correct syllabification เทศ-ศ-บาล.

There seems to be a problem with the cluster สร-. Your application overlooked the monosyllabic สระ 'pool; to wash (hair)' and only presented the transcription of สระ 'vowel'. It failed completely with สร้าง 'to build', coming up with [sâː - ŋá].

ธรรมชาติ is a certain exception.

Yes, but only because the rule silencing final sara i has many exceptions.

Edited by Richard W
Link to comment
Share on other sites

Thank you Richard, this is really important review.

However, wá - lau for เวลา [weːlaː] time' is not only wrong but implausible.

Wrong splitting of เวลา is really annoying, what a losing face... I'm sorry about that. I've accidentally disabled one of the watchdogs that must have prevented LC to form Akson-Nam.

Do you have any test cases for double-acting consonants? I just tried out เทศบาล, and it did not even consider the correct syllabification เทศ-ศ-บาล.

Double-acting consonants are simply placed into exception dictionary. If the particular word is there, it will be processed well. The problem is that I did not pay much attention to those exception words in favor of base logic, so many words are simply not there.

But I agree that it's a good time to fill up the dictionaries. I've found several sources containing three hundred words each, so those words will appear here in the nearest time.

There seems to be a problem with the cluster สร-. Your application overlooked the monosyllabic สระ 'pool; to wash (hair)' and only presented the transcription of สระ 'vowel'. It failed completely with สร้าง 'to build', coming up with [sâː - ŋá].

Yes, it's another design problem that prevented false clusters to work properly in initial position. I will try to get it fixed asap.

Once again, thank you for detailed feedback. I will post here as soon as I upload the new version.

Link to comment
Share on other sites

thai2english also used a transliteration program (I am not sure it's still like this).

I think was the weakest point of this dictionary. I had a look again at thai2english and it really improved.

I wonder if they improved the algorithm or added a long list of exceptions or just use a dictionary/database now.

thai-language is very good. I think it's based on a dictionary/database.

There are so many exceptions and exceptions on exceptions that trying to define a set of rules is a really complex task and even if you have a 5 pages long algorithm, there would probably still be hundreds of exceptions.

I believe the royal institute published a small book with difficult to read words (all exceptions on the standard reading rules). You might want to put these words in your system.

I like the idea of trying to define an algorithm for transliterations. In combination with an exception list this could be nice feature for mobile devices and embedded systems where the available memory is limited (although device with little memory are becoming rare).

PS. I tried สำรับ, it failed to give the correct transliteration.

Link to comment
Share on other sites

thai-language is very good. I think it's based on a dictionary/database.

Unless it's changed very recently, it's rule-based but allows hints and complete overrides. The Thai-language scheme also has the advantage that it only works on words. It is marred by Glenn's initially falling for the claim that Thai morphemes are invariant.

(Thai noun morphemes can have three forms - the isolated from, the prefix-form, which I've jokingly called the genitive form, and the qualified form, which I've jokingly called the construct form. For an example of the third form, allowing the Bangkok form to be correct, the first syllable of น้ำใจ has a short vowel, whereas น้ำ normally has a long vowel. Unsurprisingly, Bytebuster ascribes a long vowel to น้ำใจ.)

Bytebuster's scheme allows a phrase-final consonant to form a syllable on its own. Now, for deducing the pronunciation of words, that became ridiculous for the last consonant of a word at some point in the last 50 years, but if the input is connected text rather than isolated words, the downside is minor. (I have a text book on Thai written when นม was an allowed writing of [H]na[H]ma.)

Similarly, I think that handling double acting consonants purely by dictionary entries will change the nature of the scheme because a huge set of dictionary entries will be needed just to handle compound words. The dictionary will be close to a list of compound words with consonant doubling, but perhaps one needs that if one is to handle sentences.

PS. I tried สำรับ, it failed to give the correct transliteration.

That looks like a clear candidate for an exception dictionary, as Bytebuster's conception does not allow for the lack of a word *สรับ to be used.
Link to comment
Share on other sites

Thank you kriswillems.

Yes, thai2english is an excellent resource, very visual and informative.

However, any service containing dictionary tends to simply write down transcription, and that's it.

I had two major goals in mind: (1) proof of concept that this is possible at all, with minimal use of dictionaries; (2) providing with a tree of mathematical proof regarding the particular word or phrase.

As per exceptions - you are right, there are about 600 of them so far. I'm hoping to make the tool working reliably first and then I will try to involve linguists (including members of this forum) to think of more rules in order to reduce number of exception words. I simply have no better idea. :)

Link to comment
Share on other sites

Bytebuster's scheme allows a phrase-final consonant to form a syllable on its own.

The code is based on a concept of what I call "chunk". The words are linked and may interact with each other. The word is certainly able to decide whether it is in final/isolated position or not. For instance, consonant reduplication is planned to work in such manner - no reduplication in final position, unless specifically stated.

Some words depend on previous word. You may check with the words like ประโยค. It is not a dictionary word (see the proof), however it acts according a special rule. There's a known issue with the proof that does not display necessary branch, however. You may experiment by separating the words with any non-Thai character, e.g., dash or dot.

Yet another problem is that the rules and the dictionary of words that potentially may have reduplication are not yet completely written. This is why I'm here...

What is really interesting with my approach is that any dictionary item may have its custom program code associated. It is not filled up yet, but it is there. :)

You have revealed most outstanding issues. Most of them are quite possible to fix.

I'm working on an updated version and will post in this topic as soon as it's ready.

Link to comment
Share on other sites

Some words depend on previous word.

Apart from 'construct' forms (e.g. นำใจ [M]nam[M]jai) and some oddities like พระนคร [H]phra [H]na [M]khawn as opposed to (normal) นคร [M]na[M]kawn, this is largely untrue. Do you use 'word' to mean 'syllable', as Thai frequently does with คำ? For example, ประโยค [L]pra[L]yook is an example of a syllable depending on the previous syllable within the same word.

Link to comment
Share on other sites

Do you use 'word' to mean 'syllable', as Thai frequently does with คำ? For example, ประโยค [L]pra[L]yook is an example of a syllable depending on the previous syllable within the same word.

Well, I have to learn terminology as well. :) Of course, I mean "syllable". Since there's no dictionary, there is no way to determine that ประโยค is a single meaningful polysyllabic word. They are two syllables belonging to the same text chunk.

Link to comment
Share on other sites

๋Out of curiosity I've just tried รัดประคด and ประคบ

The problem with this kind of words is that there are probably just as much of exceptions as words that seem to follow the rule.

Thank you for pointing this out. It's a bug in the code. It supposed to check the initial of the 2nd syllable for a sonorant, but instead it checked for LC. So it made a mistake for any word starting with ประ and followed with any plosive LC consonants. I will get it fixed ASAP.

Here's an excellent explanation for the rule.

Link to comment
Share on other sites

Oops, another problem.

I've never really thought about it until now, but for a non-dictionary based computer system it's almost impossible to split a sentence into syllables.

For example:

อีกรูปหนึ่ง (one more picture) will become อี กรูป หนึ่ง

because the Thai writing system only gives you hints about the beginning end end of a syllable, but not 100% certainty in all cases.

(you system uses a left to right approach, but even if you start at the end of the sentence you'll run into similar problems).

Link to comment
Share on other sites

Even if you'll correct this there will still be a lot of exceptions, like : ประยุกต์ or ประมุข

Thank you for a good observation.

ประยุกต์ does feel a loanword, so it can fit the rule.

ประมุข is harder and it might be an exception.

Oops, another problem.

I've never really thought about it until now, but for a non-dictionary based computer system it's almost impossible to split a sentence into syllables.

[...]

(you system uses a left to right approach, but even if you start at the end of the sentence you'll run into similar problems).

You are right about ambiguous splitting.

The simplest sample of ตากลม, and no tool -- even dictionary-based -- can help with it.

Sometimes the code makes mistakes, and this is why I'm actually here - to share my humble research with people who are more familiar with the language, and to ask them for help with fine-tuning the rules. I'm not aware about other successful attempts to do that.

However, the parser is not "greedy". It used to be such in first versions, and I had to re-write it from scratch due to exactly this reason.

In fact, the parser is based on backtracking recursive algorithm. Simply speaking, it (1) builds all grammatically possible variants of splitting, then (2) makes a certain computation called "quality" and finally (3) choses the chunk with highest quality. You may see the rejected variants in the proof. They are called "chunk reconstruction", not the best name. :-)

In case of อีกรูป it considered a cluster is better than a final consonant.

Another example is ตกลง which does not work properly as well.

Frankly, I have no ideas how to programmatically decide the correct splitting in this case. Any ideas would be greatly appreciated.

Link to comment
Share on other sites

Yes, เงิน is a long vowel, thank you for pointing this out.
No! It has a short vowel!

Listen to it, it's a long vowel.

Well, the sample word is เลิก. It has falling tone. If the vowel was short, it would be LC + V + Dead => High

Link to comment
Share on other sites

Yes, เงิน is a long vowel, thank you for pointing this out.
No! It has a short vowel!

Listen to it, it's a long vowel.

Well, the sample word is เลิก. It has falling tone. If the vowel was short, it would be LC + V + Dead => High

เงิน is usually pronounced with a short vowel although it's written with a long vowel. So, it's an exception.

  • Like 1
Link to comment
Share on other sites

เงิน is usually pronounced with a short vowel although it's written with a long vowel. So, it's an exception.

In writing, it's ambiguous rather than an exception. There's no way for standard Thai writing to show the vowel as short in a closed syllable, though some will add a maitaikhu, either diagonally above the sara i (not representable in plain text) or over the following consonant. (I don't know what makes เง็อน unacceptable.)

Link to comment
Share on other sites

This looks good to me!

I tried it with this word: เอร็ดอร่อย ([to be] delicious ; tasty), which puzzled me when I first read it (I didn't think of unwritten a). There is no transcription in Thai2English, I finally had to ask a Thai friend to read it.

Your programm correctly transcribes it as: ʔà - rèt - ʔà - rɔ̀ːi

Edited by ChristianPFC
Link to comment
Share on other sites

Added several transcription plugins, improved dictionary of exception words, improved visual appearance, and fixed several bugs.

There are still a few simple bugs left:

พรหมพร is transliterated as pʰrom - pʰɔn, should be pʰrom - pʰɔːn.

เงิน is transliterated as ŋɤʔn, should be ŋɤn.

On the exception front, you still haven't entered the data so that สำเนียง is transliterated not as sɑ̌m - nǐːaŋ but as sɑ̌m - niːaŋ. Do you have hooks for exception lists so that ผลิกะ and ปลัด will not be mistransliterated as pʰlì - kàʔ and plàt but correctly transliterated as pʰà - lí - kàʔ and pà - làt? At present, the correct syllabification is apparently not even considered.

You are still using the word 'proof' where 'explanation' is a far better word. It is not possible to prove that สระ 'pool' is pronounced sà - rà when it is actually pronounced sà. (At present there is no indication that the possibility has been rejected.)

Link to comment
Share on other sites

There are still a few simple bugs left:

Thank you very much Richard.

In next version I will change some logic regarding the glottal stops, so they don't appear in improper places.

There are, indeed, a lot of exception words still working incorrectly. I'm thinking of adding them from a large vocabulary, but that takes time.

I've been looking into Lexitron database, but primarily in terms of gathering essential statistics regarding consonant reduplication. My nearest goal is to try implementing some logic that would "guess" possible reduplication.

Yes, the clusters are considered as entire units (unless they aren't followed by blocking rules, e.g. "เ"). So decompositions where they are separate consonants are indeed not considered at all. They are exception words for my approach.

There is good logic behind that. My code involves backtracking algorithm, and each ambiguity leads to creating a tree of possibilities, and branches are further processed independently till the very end of input text (or till a blocking rule is matched). For instance, the full name of กรุงเทพฯ gives some 14 thousand grammatically-possible readings. Not all of them are processed due to lazy computations (I use F#), but still there are too many of them. Cluster optimization is one of many optimizations I had to add to reach fair performance.

As per "proof", I don't claim it is the best word, but I don't think "explanation" is the best term as well. Maybe you agree that tree-style entities are a bit "geeky" by themselves. "Normal" people would rather understand plain text, and that plain text better fits term of "explanation". Sometimes I make it verbose, and that will be called explanation. What do you think?

Link to comment
Share on other sites

  • 4 weeks later...

Another clutch of words that went wrong: เมตร เพชร จักร บุตร are all monosyllabic with short vowels and tones to match. The tool showed them as disyllabic. tɕàk - rá for จักร is clearly wrong - as clause final, it would require the spelling จักระ to have that pronunciation. (As the non-final element of Indic compounds, จักร- is enunciated tɕàk - krà -). The vowel lengths for เมตร and เพชร definitely need to be marked as exceptions.

Link to comment
Share on other sites

Maybe you agree that tree-style entities are a bit "geeky" by themselves. "Normal" people would rather understand plain text, and that plain text better fits term of "explanation". Sometimes I make it verbose, and that will be called explanation. What do you think?

Being able to hide parts of the explanation ('derivation' is a suitable word in this context) that are currently not of interest is useful, but there is a problem with the display technology. If one navigates away from the screen (at least, backwards), the tree has collapsed when one returns. This infuriates me.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.







×
×
  • Create New...