Jump to content

Thai language input method


anotheruser

Recommended Posts

Hello,

I am looking for anybody that can program and wants to help make a new input method for Thai. Many languages such as Japanese, Korean and Chinese have such systems already. You do not even have to be perfect at the Thai script, although if you have any background in dealing with IME it is what is needed. The way the Thai language is typed and how foreigners learn how to use smart phones and computers using the Thai language natively could have a huge impact on schools and the way the alphabet is learned.

It is also the input method of choice for all East Asian countries amongst the younger generation.

The recent method of typing some languages and having stickers and letters of another language on the keyboard is obsolete or at least it should be.

Even the most complex languages in the world use this input system already and require only a native English keyboard to function.

If anybody has any interest in this project please PM me.

Link to comment
Share on other sites

I presume you mean a method of input based upon the sound of the word, rather than its actual spelling, with a dictionary lookup and presenting a list of matches. Whilst this makes sense for ideogram-based scripts such as the three you mentioned, does it make sense for an alphabetic (or strictly speaking, in the case of Thai, abiguda) scripts? I guess not, otherwise we would see similar apps for the English language which is much less regular in spelling than Thai.

From a technical point of view, you'd need a dictionary which contains the pronunciation of each Thai word in computer-readable format. To the best of my knowledge, no such dictionary exists, and to create one would be a significant challenge. Ideally you'd also include word frequency information which is also not readily available.

One area where this might have been useful is in mobile devices. However, that time has passed. Go back 10 years ago and most text messages were done using Latin script because Thai input was too challenging. Fast forward to day, and the vast majority of messages is now done using Thai script thanks to advances in touch screen technology.

Finally, one mobile 'phone manufacturer - Nokia I think it was - came up with a very clever way of doing Thai input. It was based upon the symbol ⌘ as if superimposed on the numberpad. If a character had its initial loop in the top left corner (e.g. ข) you'd type the top left key on the number pad. If on the bottom right (e.g. ญ) you'd type the bottom left key, etc.. And if the character didn't have an initial loop (ก, ธ) you used the central key. Vowels were handled by the top, bottom and side middle keys, according to the vowel position. Quite ingenious, but it never caught on.

Anyway, sorry to be rather negative on your idea. That's my 2 baht's worth.

Link to comment
Share on other sites

I presume you mean a method of input based upon the sound of the word, rather than its actual spelling, with a dictionary lookup and presenting a list of matches. Whilst this makes sense for ideogram-based scripts such as the three you mentioned, does it make sense for an alphabetic (or strictly speaking, in the case of Thai, abiguda) scripts? I guess not, otherwise we would see similar apps for the English language which is much less regular in spelling than Thai.

There is one example I've come across, an 'input method editor' for Mongolian in the Mongolian script (as opposed to the Cyrillic script). However, this is probably because the computerisation of the Mongolian script is an abomination. It has lots of letters that look identical and sound similar. There are four identical-looking ways of spelling the word 'Mongol', and all pass a grammar test. There are another four identical-looking ways of spelling it that clearly fail a grammar test, and all this is just by ringing changes on the vowels.

There is something vaguely similar for English. If one's typing is inaccurate, a spell checker will often present one with a useful pick list.

From a technical point of view, you'd need a dictionary which contains the pronunciation of each Thai word in computer-readable format. To the best of my knowledge, no such dictionary exists, and to create one would be a significant challenge. Ideally you'd also include word frequency information which is also not readily available.

This depends on how low one sets one's sights. If you target foreigners with small vocabularies, a lot of data is available. Big vocabularies are a problem; it is then difficult to get Thai spell checkers to present a good list of choices if one misspells a short word.

For frequency information, there is or was the Orchid corpus, and there is probably more recent stuff. It may be a bit biased, but not fatally so.

A really ambitious system would handle the input of connected text.

Link to comment
Share on other sites

One difference from English spell checkers (at least the ones I've seen) is that a Thai version would need to work on the pronunciation of the word, rather than the spelling. English spell checkers simply use a variant of a Least Edit Difference (LED) algorithm weighted for typical typing errors. (E.g. "VAD" is more likely to be "BAD than "PAD" because "V" and "B" are adjacent on the keyboard and touch-typed by the same finger.) For Thai at the simplest level one could convert all the non-standard characters in the word to the standard ones (e.g. ค, ฆ both map to ; maps to ) and then perform a LED calculation again the dictionary. However, this isn't ideal: the irregular positioning of vowels would mess up the LED calculation. It would probably be best therefore to rearrange the vowels first to put them in a consistent position, namely where they appear in the pronunciation. So, for example, แดง would become ดแง, ขโมย would become ขมโย. It may also be necessary to do some special handling of tone marks.

Another consideration would be whether the target user is a native Thai speaker or not. A native speaker is unlikely to confuse and or and , whilst a non-native is more likely to. Ditto บ/ป/ผ &c.. There are similar potential confusions with vowels such as เ/แ. For the non-native speaker it would probably be best rather than using LED to implement a sort of Soundex algorithm specifically geared towards the typical confusions of a non-native learner (e.g. giving a low weighting to tone, vowel length and grouping together similar-ish consonants and vowels).

There's a pretty good article on one Thai Soundex implementation at http://linux.thai.net/~thep/soundex/soundex.html which highlights a few other issues.

Link to comment
Share on other sites

A few of other points:

(1) Final consonants will require different treatment from initials. The application will need to recognise syllable boundaries - but this isn't always unambiguous.

(2) If the user types a word which is not in the dictionary, how will the application know that it's a complete word and not try to continue matching it as the user types the first characters of the next word?

(3) If the user goes back to correct a previously typed word, how will the application know the word boundaries? The best parsing algorithms are only 90-95% accurate at recognising these.

Link to comment
Share on other sites

Hello everyone,

As IME helped me a lot to learn to recognize thousands of chinese characters, I absolutely like your idea.
Basically you need a list of words ordered by frequency as well as their pronunciation.
There are such lists (4000 and 5000 most frequent including pronunciation), but they still lack too many words.
Its hard to merge different lists, when they used different methods to determine frequency.
Best would be an IME which is learnable and can rearrange the words by frequency of usage.

I've already tried to make a Thai IME in Javascript: http://thai.riian-thai.com/?top=thai_ime
But i think it would be more usefull as an Input Method which can be installed on the computer and mobile phone.
It's easy to create a program for showing the candidates based on algorythms, but the hard thing is to put the text into any text field of any window.
Unfortunately I can only program in Java, but didn't manage to access the text fields via JNA/JNI yet, it would be better to do in C++ with Text Service Framework.



Edited by sunlinna
Link to comment
Share on other sites

I fear I may have derailed the discussion by talking about spell-checkers. Some of what you ask would not apply - the user would not have direct access to Thai characters.

(1) Final consonants will require different treatment from initials. The application will need to recognise syllable boundaries - but this isn't always unambiguous.

That'd be a good trick for a spell-checker. If one's typing in phonetically, they'll mostly be obvious. An interesting trick for the [L]taak[M]lom / [M]taa[M]klom ambiguity is to have a dictionary entry for taaklom. The Haas system actually distinguishes these as taaglom v. taaklom, but that's luck. not design.

(2) If the user types a word which is not in the dictionary, how will the application know that it's a complete word and not try to continue matching it as the user types the first characters of the next word?

Thanks to the efforts of Javier Solá, the Unicode Consortium graciously allowed us to continue to use zero-width space (ZWSP), or 'no-width optional break' as LibreOffice calls it, to mark or store word boundaries. It's often not a problem.

(3) If the user goes back to correct a previously typed word, how will the application know the word boundaries? The best parsing algorithms are only 90-95% accurate at recognising these.

With a normal spell-checker, if all else fails you can tell it where the word boundaries are. At the moment, you can even tell it where they aren't, though I expect the Unicode Consortium to withdraw this privilege on the 26th when it releases the text of Unicode 8.0.0. The current glue character is word joiner (WJ), which LibreOffice calls 'no-width, no break'. You may need a lot of glue to stick a foreign name together.

Link to comment
Share on other sites

I fear I may have derailed the discussion by talking about spell-checkers. Some of what you ask would not apply - the user would not have direct access to Thai characters.

Surely the user would have to have direct access to Thai characters for typing words which aren't in the application's dictionary. Or am I missing something?

Link to comment
Share on other sites

Surely the user would have to have direct access to Thai characters for typing words which aren't in the application's dictionary. Or am I missing something?

Yes: "The recent method of typing some languages and having stickers and letters of another language on the keyboard is obsolete or at least it should be."

Link to comment
Share on other sites

A couple of years ago I toyed around with something similar: an application to search a Thai dictionary by a word's sound. I could have used an existing Thai Soundex algorithm for this, but decided to write my own, more permissive one taking into account difficulties that foreign learners have with the sounds of Thai. I'll attach the actual code for the algorithm at the end of this post.

Anyway, I didn't have access to an electronic dictionary with correct syllabification and pronunciation for all words. This would definitely be needed for any real world application since Thai spelling is so irregular (irregular final consonants, linker syllables, number of characters "killed" by karan, irregular consonant clusters and so on).

For what it's worth I've uploaded the application for anyone who wants to play with it. It's at http://thai-notes.com/tools/ThaiSoundex.html (You have to run the site's dictionary program first to load the dictionary. That's at http://thai-notes.com/tools/predictionary.shtml.)

I did learn a few things from the experience:

- Performance is a big issue. It's very much a pay off between initial load time versus responsiveness. I found that using a Red-Black tree based NavigableMap with a custom sort sequence worked best for me, though the load time is still not great. That's with 43,000 dictionary entries.

- It was probably a mistake to consolidate DOR DEK, THOR THAHAN and TOR TAO sounds into a single group. DOR DEK might be confused for TOR TAO, but not TOR THAHAN. Ditto TOR THAHAN and DOR DEK.

- Soundex traditionally only considers vowel sounds if they occur at the start of a word. This doesn't work well for Thai with lots of words having similar consonant sequences.

- Reducing consonant clusters to the initial sound increases the number of words with similar consonant sequences in the Soundex encoded form.

- On the input side, it's difficult to make the input unambiguous because of syllable boundary issues.

It was a fun few hours experimentation, but ultimately led nowhere.

Anyway, here's the algorithm (in Java), should anyone be interested. The code is pretty rough and ready and almost certainly not bug free:

ThaiSoundex.txt

Edited by ThaiNotes
Link to comment
Share on other sites

I have finally managed to make a Thai IME in C#

It's not finished yet, but you can already try out very basic functionality.
Just double click on the Exe-File, the an Icon in systray appears.
You can toggle the IME by caps lock key, select the first entry by space key and other entries by the number keys.

http://www.riian-thai.com/files/ThaiIME_0.01.zip

Edited by sunlinna
Link to comment
Share on other sites

A few of other points:

(1) Final consonants will require different treatment from initials. The application will need to recognise syllable boundaries - but this isn't always unambiguous.

(2) If the user types a word which is not in the dictionary, how will the application know that it's a complete word and not try to continue matching it as the user types the first characters of the next word?

(3) If the user goes back to correct a previously typed word, how will the application know the word boundaries? The best parsing algorithms are only 90-95% accurate at recognising these.

Hit the space bar and that finalises the word or letter. Sunlinna has already showed me a very simple web based javascript site that she made that is already doing much of what I knew can be done. I will leave it to that poster if they want to share it or not at this point. I am on a mac so haven't been able to see what she has done with that zip file.

Edited by anotheruser
Link to comment
Share on other sites

Surely the user would have to have direct access to Thai characters for typing words which aren't in the application's dictionary. Or am I missing something?

Yes: "The recent method of typing some languages and having stickers and letters of another language on the keyboard is obsolete or at least it should be."

You might just be missing something. Here is random garble just to show you the idea of the space bar. พฟหดวอทยุ _ represents the space bar as a stop or to let a consonant or vowel go. to get this I typed p_f_h_d_w_a_t_y_uu_ So you can see it is possible to make any character separate or use a dictionary. Here is a simple word not in the database yet แคว I used ae_k_w_ and it reads keaw. Given the database is only days old and there hasn't been time to figure out the romanization it should show early proof of concept that anything can be typed in this method.

Edited by anotheruser
Link to comment
Share on other sites

Foreign names would be typed a bit differently. Say your name is Gary. แกรี่ I typed the letters again in exact order the appear rather than phonetically. ae_g_rii_ gaerii. Hope this helps show what i mean. Edited to say not the best example but hopefully it shows a bit of the workings.

Edited by anotheruser
Link to comment
Share on other sites

It now sounds rather like the M17n 'ispell' input method that seems to come with ibus on Ubuntu. The controlling file is in Debian package m17n-db. I don't seem to have the complete set of packages, though - I can't get it to do anything useful for me.

Depending on your operating system, it can be straightforward to set up a keyboard mapping so that typing in transliteration generates the Thai characters. There is a Unicode character available for marking the boundaries between Thai words. The tricky bit is deciding what order the characters are coming in, but if you have an escape sequence (like '_' in your examples) to show it, the difference can be handled.

Link to comment
Share on other sites

I made a mistake about the web based thing. I think Sunllina just showed me an existing one and meant for me to use to help make a database. So here it is for those curious.

http://thai.riian-thai.com/?top=thai_ime

Play around with it a bit and maybe you can notice why I am thinking it could be useful. Hint is when you type the word and the drop down gives numbers you can just type the number and it moves it to the text box on the left. So what I want is this built into the keyboard selection of an operating system.

Link to comment
Share on other sites

I did some improvements:
- The input windows automatically adjust size and is placed at text carret position. (doesn't work for all applications yet)
- IME is triggered by Ctrl + CapsLock, so the CapsLock is still usable, also Shift works to write latin letters while IME is active.
- Suggestions can also be selected by NumPad numbers.
- The focus will not be lost anymore while toggle IME.

Download:
http://thai.riian-thai.com/files/ThaiIME_0.02.zip

Link to comment
Share on other sites

A new version is avaliable now:
- Different romanization are added (Royal Thai, Paiboon+) as well as some custom options (e.g. ignore vowel length and b/d/g instead of p/t/k for final stops).

- Input string is automatically converted (remove accents and space), so its not necessary in csv database anymore.
Paiboon+ speacial charactes can be entered as Q = ɔ, V = ʉ, X = ə, Z = ɛ

Download:

http://thai.riian-thai.com/files/ThaiIME_0.03.zip

Link to comment
Share on other sites

  • 2 weeks later...
Changes in version 0.05:

- The database is extended by ten times, now consists approx. 50,000 entries.
- The IME is now learnable, the candidates are reordered depending on how often you select them.
- Positioning of the candidate window also working on mouse actions.

Download:

http://thai.riian-thai.com/files/ThaiIME_0.05.zip

Link to comment
Share on other sites

Changes in version 0.05:

- The database is extended by ten times, now consists approx. 50,000 entries.

- The IME is now learnable, the candidates are reordered depending on how often you select them.

- Positioning of the candidate window also working on mouse actions.

Download:

http://thai.riian-thai.com/files/ThaiIME_0.05.zip

Not that your system isn't fine, but here are some other ideas for transliteration using the standards qwerty keyboard:
Benajwan Becker uses O = ɔ, U = ʉ, A = ə, E = ɛ
for the input method when looking up Thai words by pronunciation in her Talking Thai dictionary app.
Some ascii Sanskrit transliteration use capital letters to indicate retroflex, so that would make
Ch = ฉ D = ฎ , N = ณ , Th = ฒ
J winds up being ญ but it's probably easier for people to remember as Y = ญ
Although it's not retroflex, you could apply capital letters to other characters
Kh = ฆ
Again, I'm only mentioning this in case you want use a system that's already been around. Using transliteration used for Sanskrit still has some problems with typing Thai, but perhaps you might find some things useful?
I think your program for a learner IME is great, and it is likely to be a hit, but I also think: C'mon, just learn to type in Thai. It's not that hard.
On a related note, I went to a linguistics conference in 1999, and someone there presented a learner IME for Japanese that compensates for specific mistakes that learners make. His IME offered options that offered options that included words with long and short vowel lengths when only a short vowel length has been entered, etc. I believe JustSystems (maker of the ATOK input method) purchased the IME from him, and now it is a standard feature of the IME for Japanese. I could see many Thai schools being interested in what you're developing.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.









×
×
  • Create New...