Jump to content

Looking For A Programmer To Create Thai Reading Tool


Recommended Posts

Posted

I'm looking for someone to create a tool similar to this

http://womenlearnthai.com/index.php/fltr-the-foreign-language-text-reader/

that works for Thai. I am willing to pay.

Basically, I want to create a working mouse-over dictionary for Thai. I want a user to be able to go to the site, paste text, and read. Mouse over when you don't know a word. The site will remember which words you know, and highlight the ones you don't. It will create a list of your unknown words if you want. I plan to put this on a website, and let everybody use it for free.

A site that does something similar (but not for Thai):

http://www.lingq.com

A free version of something similar:

http://lwt.sourceforge.net

The closest thing that I can find to what I want today:

http://www.thai2english.com/online/

What other features would you like to see? Do you think the programmer will need to be a fellow Thai speaker?

  • Replies 37
  • Created
  • Last Reply

Top Posters In This Topic

Posted

It's an interesting challenge. However, I can see a number of big challenges:

(1) How do you parse the text to split it into words? There is no software out there that does better than about 90% accuracy. It's a very difficult problem.

(2) Where do you get your look-up dictionary from? There's nothing out there (paid for or free) that is comprehensive - particularly if you include a need to support slang/text-speak/regional dialects.

(3) To work on-line would require the user to have a fast Internet connection (a rarity in Thailand) and for the server to be even faster. How can that be provided for free? I rather doubt advertising would anywhere near cover the cost.

Unless you're an extremely wealthy philanthropist this just isn't going to work with the current state of technology.

And no, the programmer doesn't need to be a Thai speaker. In fact, I rather doubt you'd easily find a Thai programmer with the requisite technical expertise. (You might have noticed that, for the most part, Thai language websites are atrocious.)

Posted

1) Very carefully smile.png I think Thai to English proves that it can be done.

2) There are dictionaries that are good enough, IMO

3) Again, Thai to English proves that it can be done, unless he's an extremely wealthy philanthropist.

Know any programmers who'd be interested?

Posted

(1) How do you parse the text to split it into words? There is no software out there that does better than about 90% accuracy. It's a very difficult problem.

Not that difficult. In my project (link is available in the profile), the backtracking weight-based algorithm allowed syllabification accuracy significantly over 90%, which sounds good, considering the solution uses pure algorithmic analysis, e.g. there is no dictionary/corpus except ~500 lexemes for loanwords.

(2) Where do you get your look-up dictionary from? There's nothing out there (paid for or free) that is comprehensive - particularly if you include a need to support slang/text-speak/regional dialects.

Lexitron and ORCHID. One can also contact with Thai2English or ThaiLanguage to see if they are interested in such collaboration.

And no, the programmer doesn't need to be a Thai speaker. In fact, I rather doubt you'd easily find a Thai programmer with the requisite technical expertise. (You might have noticed that, for the most part, Thai language websites are atrocious.)

Agree. The programmer rather has to be a linguist who may not speak the language fluently, but has to understand basic linguistic principles applied to specific language.

Posted

@leosmith, the biggest challenge is not the dictionary or async retrieval of results. There are many sites that do translation (Google) and async retrieval via AJAX (ThaiLanguage).

To my understanding, a word-by-word translation is useless (or sometimes harmful as it may produce a rude/offensive result).

Consider something like ขี้เกียจ or รุ้งกินน้ำ to understand what I mean.

Yet another funny example:

ไม่ได้เจอกันตั้งนานนม โตขึ้นเยอะเลย (long time no see; {you} have grown so much)

versus

ไม่ได้เจอกันตั้งนาน นมโตขึ้นเยอะเลย (long time no see; {your} breast has grown so much)

Instead, a good translation service has to translate the phrases/sentences. This looks extremely difficult due to a broad use of indirect or idiomatic constructs in isolating languages. This has to be the biggest challenge.

Yet another important point. You have to figure out how your site would differ to the existing ones.

Please don't hesitate to PM me. I'm not sure if I can write it for you, but I will do my best to share what I have learned so far.

And good luck with your startup.

Posted
ไม่ได้เจอกันตั้งนาน นมโตขึ้นเยอะเลย (long time no see; {your} breast has grown so much)

link?

a good translation service has to translate the phrases/sentences. This looks extremely difficult due to a broad use of indirect or idiomatic constructs in isolating languages. This has to be the biggest challenge.

Although I'm not familiar with the programming issues, I suspect you are right. And the fact is, sometimes one wants the cursor to highlight single words, sometimes short phrases, and sometimes whole sentences. So we have a challenge ahead of us, which is a very good thing smile.png

You have to figure out how your site would differ to the existing ones.

This is another good point. To answer it, I have to ask myself why I'm not completely satisfied with Thai to English. It's a wonderful free tool, and I hope people won't be offended by what I have to say.

1) there is no highlighting words and keeping track of what I know

2) the dictionary lacks many basic words

3) parsing issues, the worst of which is it often forces parsing into long, incorrect, phrases

I want my site to include/be an improvement on all of these things. They are all important, but I wouldn't do this project without 1). To see what I'm talking about, please check out LingQ on youtube (sorry about the annoying advertising) (fyi - there is no LingQ for Thai, and there never will be) Having words highlight when you don't know them, and highlight a different color when you've looked them up once, is a big help. This involves saving each user's vocabulary data. Having a library of public lessons, and the ability of doing private lessons, is also something I want. I'd like to save audio for each lesson too, but wonder if that will make it much more expensive. What do you think about the expense of storing/streaming audio?

So for Thai, 1) will be unique.

Something else I want to do is have the option to translate into 3 different language - English, Thai and Russian. Some language learners don't like using translation, so they can toggle to Thai. There are so many Russian speaking farang in Thailand now, and I speak Russian, so I'd like that option to increase our audience.

So let me shift gears a little bit here. I bring capital and ideas to the table. I need more information and ideas from users and IT people. I'm anxious to pull the trigger, but it would be nice if we could come up with a rough plan before I start hiring people. I haven't forgotten you innerspace and bytebuster. Hey, are there any Russians learning Thai here? Interested to hear what you have to say.

Posted
ไม่ได้เจอกันตั้งนาน นมโตขึ้นเยอะเลย (long time no see; {your} breast has grown so much)

link?

One, two. In fact, all top 20 Google results are fine.

This is another good point. To answer it, I have to ask myself why I'm not completely satisfied with Thai to English. It's a wonderful free tool, and I hope people won't be offended by what I have to say.

There are many tools on the Web. Altogether, they cover all my needs, however you are right, there is no single tool doing everything.

Color highlighting, user's history, and rich multimedia seem to be nice features, but IMHO they alone would not turn "yet another site" into a unique one, which is your primary goal.

Something else I want to do is have the option to translate into 3 different language - English, Thai and Russian. Some language learners don't like using translation, so they can toggle to Thai. There are so many Russian speaking farang in Thailand now, and I speak Russian, so I'd like that option to increase our audience.

I do speak several Slavic languages, including Russian.

As per my opinion, there is little to no demand among Russian-speaking community to learn Thai. Solvent demand is even lower, IYKWIM.

I have added a feature of Cyrillic transcription to my project (in fact, several ones, including an official transcription by L.N. Morev), and, according to my stats, about 2-5% visitors use it.

I don't want ranting on them, but it seems that people who express their interest to languages, would study English first; then, it would not be a problem for them to study Thai based on English.

Posted

One, two. In fact, all top 20 Google results are fine.

I see my joke failed.

There are many tools on the Web. Altogether, they cover all my needs, however you are right, there is no single tool doing everything.

Color highlighting, user's history, and rich multimedia seem to be nice features, but IMHO they alone would not turn "yet another site" into a unique one, which is your primary goal.

I make due with what's available, as you and everyone else does. But I'd like to provide some better tools. My primary goal is to be useful, rather than unique. But I'm open to suggestions...

Any of you Thai language learners have ideas for tools that you'd like to use?

Posted

I was intrigued by some of the technical issues involved in creating such a site, and since I've had a few spare hours over the last couple of days I created a "proof of concept" application - http://thai-notes.co...ordSearch.shtml. In the process I learned a few things:

(1) The LEXiTRON dictionary is missing a lot of very basic words such as เอา, ว่า, อังกฤษ, เริ่ม and เป็น (there are dozens more) - though the dictionary does have them in compounds. Though bytebuster mentioned ORCHID as an alternative I can't find a downloadable version of the dictionary and suspect it's only available on-line.

(2) My approach is to download a Thai-English dictionary file once when the application is first run, rather than making lots of requests for individual words. With my (not very fast) Internet connection, this didn't take too long - about 20 seconds. This is too long to be acceptable, but this could be improved upon. (For example, using TIS-620 encoding, rather than the UTF-8 that I used, the dictionary size would be roughly halved - 10 seconds, but still too long.)

(3) The flip side of having a local dictionary is that the word look up is blazingly fast - I was surprised, particularly because I put no effort into optimising my code for speed - it's really inefficient. Of course, it's a payoff between slow first-time-load with fast look up versus faster initial load with slower (AJAX) look up.

(4) My approach to word look up is pretty simplistic: I parse the text for possible syllable boundaries, then I perform a "greedy match" for the longest words/phrases. This works reasonably well, though a "maximal match" approach would probably produce marginally better results.

(5) Thai2English undoubtedly does a better matching job than I, and that's primarily down to their having a better dictionary to work with. For example, their dictionary contains words such as "Obama", though they don't have "Mitt". (But then everyone's forgotten Mitt already, so that doesn't really matter.) Of course, they're a commercial site, so they've got the resources to keep their dictionary maintained. I used the LEXiTRON 2.0 dictionary data - NECTEC hasn't made a later version of the dictionary available for download, but I suspect a newer version of the dictionary wouldn't make that much of a difference. (And, yes, I know that one can hack the dictionary in later versions of LEXiTRON, but I don't currently have access to a computer running Microsoft Windows.)

Anyway, if you were to have a look at application and let us know what you think, that would be good. It's at http://thai-notes.co...ordSearch.shtml. De minimis it might stimulate a few thoughts to help the OP with his mission.

Posted

ThaiNotes - that's awesome! I posted some text from a VOA article. Most words were defined, but some weren't. A simple one that wasn't defined - ด้า

The word list looks nice, and I can cut & paste into a text file, from which I can load anki. Perfect.

note: I couldn't get the wordlist to grow after the first piece of text I input

Posted

(4) My approach to word look up is pretty simplistic: I parse the text for possible syllable boundaries, then I perform a "greedy match" for the longest words/phrases. This works reasonably well, though a "maximal match" approach would probably produce marginally better results.

You should definitely support non-greedy match. Consider ตากลม, it can be parsed ตาก-ลม (which your app does) or ตา-กลม.

Otherwise, a good start!

  • 2 months later...
Posted

The proof-of-concept application I wrote has evolved into a fully-fledged dictionary application.

* It now supports single word or bulk lookup as well as browsing the dictionary alphabetically

* The parsing of the bulk lookup is much improved and now handles ambiguity in parsing words

* The application works both Thai to English and English to Thai for almost all functionality.

* The dictionary has been upgraded to the LEXiTRON 2.6 version which is more comprehensive than version 2.1 (though still has its limitations)

I have though, for the moment at least, dropped the "personal word list" feature.

I (perhaps rather immodestly) think that in some respects this is one of the best on-line Thai-English dictionaries out there.

* It is blazingly fast (after the initial dictionary loading, which can take a few seconds)

* It's easy to use

* No need to register and log in to use it (as with LEXiTRON itself)

* For any Thai word you only need to hover over the word to see a brief definition. (Clicking on any word, Thai or English, retrieves a full definition from the server which may take a second or two, depending on your network connection speed.)

Anyway, before I release the dictionary to the public I'd appreciate any feedback. In particular, I'd like to know if it works properly in Internet Explorer. (I've tested on Chrome, Firefox, Opera and Safari - but don't have access to a Windows PC to test Microsoft's browser.)

The link is http://thai-notes.com/tools/predictionary.shtml

You can give feedback either here, via PM or by the form on the website.

Thanks for reading.

TN

Posted

Seems to function on Explorer, but not in Chrome.

Not in Chrome? That really surprises me. I did all my primary testing on Chrome, and it's working fine for me. What happens when you try to use it?

Good to hear it seems OK in Internet Explorer.

Thanks.

Posted

Seems to function on Explorer, but not in Chrome.

Not in Chrome? That really surprises me. I did all my primary testing on Chrome, and it's working fine for me. What happens when you try to use it?

Good to hear it seems OK in Internet Explorer.

Thanks.

I get the result, "Nothing found" for every and any Thai entry. I should also mention this is using Chrome in a Vista Home environment.

Posted

Got it to work in FIrefox.

You need a way to close a text bubble once read since it blocks out further text behind it.

Posted

Anyway, before I release the dictionary to the public I'd appreciate any feedback.

Looks great so far. I tried the bulk lookup on Safari, for this text:

Tom Donilon ที่ปรึกษา ปธน. โอบามา ด้านความมั่นคงแห่งชาติประกาศมาตรการลงโทษธนาคารของเกาหลีเหนือเพื่อตัดช่องทางการเงินในการพัฒนานิวเคลียร์และอาวุธร้ายแรงอื่นๆ เขากล่าวว่าสหรัฐต้องการการเจรจาอย่างจริงใจกับกรุงเปียงยาง แต่ที่ผ่านมาท่าทีดังกล่าวถูกตอบรับด้วยความก้าวร้าวและยั่วยุจากรัฐบาลเกาหลีเหนือ

The dictionary incorrectly parsed โอบามา, นิวเคลียร์, อื่นๆ, กรุงเปียงยาง, and didn't find

สหรัฐ at all.

That being said, I feel the parsing is now as good as or better than Thai to English. I'm glad you are shooting for a more literal translation, at the word level, than at the sentence level, as it's much more useful to language learners. If you ever decide to go to the sentence level, I hope you will keep the word level translation as an option. This is probably the biggest problem with Thai to English right now - nonsensical long phrases.

As another poster mentioned - the pop-up definitions often interfere with being able to highlight the next word.

And finally, as much as I hate romanization, even as a pretty good reader there are some words that elude me, so it would be nice if you added pronunciation. I prefer Thai phonetic script, but romanization would be ok.

Posted

Katana, Leosmith: The tooltip/text bubble problem has been fixed, I believe. I've also added tooltips to the buttons in the Browse Dictionary tab since the button font is rather on the small side.

DavidHouston: I can only think that the "Not found" problem in Chrome is because the dictionaries haven't loaded correctly for some reason. I'll have a think and add some extra checking to pick this up.

Leosmith: There is Haas/AUA-style IPA for 41203 out of 51477 entries (80%). Unfortunately, I'm limited in what I can do by the quality of the LEXiTRON data. The latest publicly available dictionary data are few years old now - NECTEC hasn't released the data for its latest version, which is on-line only. This also explains why words like โอบามา aren't found. สหรัฐ wasn't found because it's not spelled correctly (at least according to the dictionary) - it should be สหรัฐฯ. Some very common words are also missing such as (in English) "you". อื่นๆ, however, sounds like a programming mistake. I'll look into it.

Thank you all for your feedback.

Posted

Are you going to include the other Lexitron data, that is, sample sentences, synonyms, and related words? It would be very helpful if you could include translations of all the sample sentences.

Posted

Are you going to include the other Lexitron data, that is, sample sentences, synonyms, and related words? It would be very helpful if you could include translations of all the sample sentences.

That's all already there. Hover over an orange word for a brief definition. Click on a word and all the other information appears on the right hand side of the screen. And the sample sentences are also parsed with hover over for definitions, and clicking for full definition.

Posted

Are you going to include the other Lexitron data, that is, sample sentences, synonyms, and related words? It would be very helpful if you could include translations of all the sample sentences.

Sorry, I am unable to see more than just the basic definition, even in Internet Explorer. Must be my machine. Great job on what you did!

  • 2 weeks later...
Posted

FWIW, I've added "My Wordlist" functionality to the dictionary.

I've also made all functionality fully bi-directional. So, for example, hovering over an English word now brings up definitions in Thai.

I've made several hundred corrections to the LEXiTRON data (mostly for English words).

And I've fixed all known bugs (not that any was particularly significant IMO).

Anyway, that's it. I've done all I want to do with the application. Work over.

http://thai-notes.com/tools/predictionary.shtml

Posted

Thanks for all the work on that. Would you ever consider releasing an offline installable version for use, in case the website ever went down?

Posted

(4) My approach to word look up is pretty simplistic: I parse the text for possible syllable boundaries, then I perform a "greedy match" for the longest words/phrases. This works reasonably well, though a "maximal match" approach would probably produce marginally better results.

You should definitely support non-greedy match. Consider ตากลม, it can be parsed ตาก-ลม (which your app does) or ตา-กลม.

Otherwise, a good start!

I think that in most cases a greedy or maximal match will work better than a pure analytic (non-dictionary based) approach. A simple example is ดูดวง. Also when the dictionary size grows a greedy match will gradually improve the results.

The analytic approach you use is from an academical point of view a beautiful piece of work (also from a software design point of view).

Probably setting up a tree structure, like you did in your analytic approach and giving a score to each branch can also be done when using greedy (dictionary based) matches. You could build a tree of greedy match possibilities, give a score to each branch and show a list of the highest scoring translations.

  • Like 1
Posted

The bulk lookup doesn't seem to work on either windows or linux here (chrome).

That's a bit of a surprise.

Could you be a bit more specific, please, so I can track down the problem? For example, does the Bulk Lookup tab load? Do you see the text area into which to paste? When you paste into the area, does anything happen? If so, what? Any sort of pop-up error message?

Have you tried doing a hard reload of the page (Ctrl + F5)?

You could also try clearing your local storage. This is best done by right-clicking on the page and selecting "Inspect Element", then clicking on "Resources", then "http://thai-notes.com" under "Local Storage". You'll then need to wait a few seconds and should see three entries - dict.en, dict.th and dict.version. (The latter should have the value 53.) Right click on one of the first two entries and selecting "delete". This will make the dictionary "broken", so next time you reload the page the dictionary should see an error message and be prompted to redownload the dictionary.

If this solves the problem I'd be particularly interested because it suggests that somehow inconsistent dictionary data are loaded, which is an area I've had problems reported before and I thought I'd fixed.

Posted

The bulk lookup doesn't seem to work on either windows or linux here (chrome).

That's a bit of a surprise.

Could you be a bit more specific, please, so I can track down the problem? For example, does the Bulk Lookup tab load? Do you see the text area into which to paste? When you paste into the area, does anything happen? If so, what? Any sort of pop-up error message?

Have you tried doing a hard reload of the page (Ctrl + F5)?

You could also try clearing your local storage. This is best done by right-clicking on the page and selecting "Inspect Element", then clicking on "Resources", then "http://thai-notes.com" under "Local Storage". You'll then need to wait a few seconds and should see three entries - dict.en, dict.th and dict.version. (The latter should have the value 53.) Right click on one of the first two entries and selecting "delete". This will make the dictionary "broken", so next time you reload the page the dictionary should see an error message and be prompted to redownload the dictionary.

If this solves the problem I'd be particularly interested because it suggests that somehow inconsistent dictionary data are loaded, which is an area I've had problems reported before and I thought I'd fixed.

Sorry for my last post. I should have looked better. It works fine. I didn't realise that the translation would popup when you move the cursor over the words. I expected to see a translation and I didn't see it. I feel really stupid and apologize for the time you lost trying to solve this.

You made a very nice tool. We have ever worked on a project to put the FSI thai language course online and add Thai script. The system works in a wiki ( look for FSI thai language wiki in google). It's free, without advertisement and paid for by some of the volunteers. It would be nice if the transliterations would pop-up when you move your cursor over the Thai words. Our dictionary size is very small, the course contains very few words. Nobody of the volunteers working on the wiki managed to implement a system with popup transliterations. We are lacking web-programming skills. If you would be interested in helping or giving suggestions you can contact me or davidhouston.

Posted

The bulk lookup doesn't seem to work on either windows or linux here (chrome).

That's a bit of a surprise.

Could you be a bit more specific, please, so I can track down the problem? For example, does the Bulk Lookup tab load? Do you see the text area into which to paste? When you paste into the area, does anything happen? If so, what? Any sort of pop-up error message?

Have you tried doing a hard reload of the page (Ctrl + F5)?

You could also try clearing your local storage. This is best done by right-clicking on the page and selecting "Inspect Element", then clicking on "Resources", then "http://thai-notes.com" under "Local Storage". You'll then need to wait a few seconds and should see three entries - dict.en, dict.th and dict.version. (The latter should have the value 53.) Right click on one of the first two entries and selecting "delete". This will make the dictionary "broken", so next time you reload the page the dictionary should see an error message and be prompted to redownload the dictionary.

If this solves the problem I'd be particularly interested because it suggests that somehow inconsistent dictionary data are loaded, which is an area I've had problems reported before and I thought I'd fixed.

Sorry for my last post. I should have looked better. It works fine. I didn't realise that the translation would popup when you move the cursor over the words. I expected to see a translation and I didn't see it. I feel really stupid and apologize for the time you lost trying to solve this.

You made a very nice tool. We have ever worked on a project to put the FSI thai language course online and add Thai script. The system works in a wiki ( look for FSI thai language wiki in google). It's free, without advertisement and paid for by some of the volunteers. It would be nice if the transliterations would pop-up when you move your cursor over the Thai words. Our dictionary size is very small, the course contains very few words. Nobody of the volunteers working on the wiki managed to implement a system with popup transliterations. We are lacking web-programming skills. If you would be interested in helping or giving suggestions you can contact me or davidhouston.

Do you mean something like http://thai-notes.com/test/ThaiTips.html?

It was pretty trivial for me to produce a cut-down version of the Bulk Lookup functionality of my dictionary and to use it to go through a web page to add translation popups.

To implement it would need a single line of code to be added to each page - though I presume that your site uses templates, so the code could be added to the template(s) as required. And there are a few files that would need to be hosted on your server.

I've used a different technology for storing the dictionary locally, which means that the dictionary is deleted when the browser is closed. This allows the code to be simpler and hence smaller. The dictionary is small, so the performance hit when downloading it is negligible. (26 kB is enough for the vocab for the first 15 lessons.)

In Chrome there's an annoying flash of the text when the tooltips are added. Not sure if there's anything I can do about that. Firefox and Opera, however, don't seem to have the same problem.

If you're interested, PM me, or send me a message via the website, and I'll tidy up the code and email it to you along with installation instructions.

  • Like 1

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.




×
×
  • Create New...