New front end to RI dictionary (alpha)

ThaiNotes · May 31, 2014

I get very frustrated using the Royal Institute's on-line dictionary. For a start, every time I use it I have manually to change the encoding. Plus the alphabetic listings of words are incomplete. Anyway, rather than just stay frustrated I decided to write a new front end to the data. The results of my efforts are at:

http://thai-notes.com/dictionaries/RIDictionary.html

This is still very much work in progress, but it's definitely usable. However, I was wondering whether anyone would like to suggest any specific features to add. Thoughts I have include:

Link terms in definitions to where they refer to
Hover-over pop-up box to explain any abbreviations
Incorporating the RI's Dictionary of New Words
Reverse lookup (i.e. find all definitions containing a given word)

I have also wondered about providing an option to make it more friendly for native English speakers. For example, translating all abbreviations and providing IPA pronunciation, rather than the Thai system. Though perhaps someone advanced enough in their studies to use the RID doesn't need this sort of hand holding. Thoughts?

What else would be good?

This is early, alpha software, so there are quite a few known issues, including:

There are some errors in parsing the RI data resulting in a few (<0.5%) erroneous entries and truncated definitions. A few lookups result in "not found".
The sequence of the word suggestions is a little off. (Entries for prefixes are all listed before the first non-prefix.)
Fonts aren't being properly loaded from the Internet; local fonts are being used.
No bold/italics in definitions.
If you resize your browser window the application display doesn't change.
When there's a long definition a scrollbar should appear. It doesn't always do so, and may be less than usual width.
No integration with the rest of the site for customisable font size and color theme.

There may be some browser-related issues. I don't use Microsoft Windows or MacOS so have to test Internet Explorer and Safari using emulation software. If anyone is using these browsers I'd love to know if these browsers really do have the problems I experienced.

Safari, letters typed into the word input box don't display.
Internet Explorer (tested with version 8) loads the data and displays the keyboard, but every word lookup fails.

Thank you for reading so far.

As always, any and all feedback much appreciated.

kriswillems · May 31, 2014

์์Nice. Please have a look at the faranglearnthai facebook group. There are other developers there and I think they could help you or you could help them.

I miss the functionality the RID has, like using * and just browsing through all words starting with a certain consonant.

katana · May 31, 2014

Nice work. It just needs an offline, installable version now! (if no ones already done this).

ThaiNotes · June 1, 2014

์์I miss the functionality the RID has, like using * and just browsing through all words starting with a certain consonant.

Thanks for the feedback. Both those features are on the "to do" list.

As for the use of *, a couple of questions:

(1) If you search for ก* do you expect to return words such as เก่ง and ไก่?

(2) If you search for ง* do you expect to return words such as หงุบ and หงำ?

(3) Are wildcards in the middle of an expression actually useful (e.g. หญิง*การ)? What about multiple wildcards in an expression?

Alphabetic browsing will come after I've managed to sort out all the data issues. No point in creating the (static) pages until I can parse all the RID entries correctly. (I'm now at a point where I have to track down the problems of the dodgy entries one by one. It's very time-consuming.)

kriswillems · June 1, 2014

My opinion:

(1) No

(2) No

(3) Yes, more wildcards are also useful, also wildcards a the first position, in the middle of a word and at the end of a word. Even multiple wildcards would be nice to have.

Even ? (single character) would be very useful!

You could also consider allowing unix regular expressions, but they might a bit hard to use for most.

Sealang also supports regular expressions but they don't really works. I am still waiting for the first dictionary with regular expressions that work.

I think your web page would be visited often, because the RI dictionary website is really terrible.

It's a shame that such a huge piece of work can't be made more useful.

Thank you for the good work.

AyG · June 1, 2014

Thanks once more for the thoughts.

I think real regular expressions would be a bit of overkill, and beyond the ken of most users.

I've come up with a simplified system which I've got working.

v matches 0 or more vowels
V matches 1 or more vowels (but in practice there'll only ever be one)

c matches 0 or more consonants
C matches 1 or more consonants

? matches a single character
+ matches 1 or more characters
* matches 0 or more characters

h matches 0 or more hor hip

i means ignore all tone marks
I means ignore all tone marks plus thanthakat and maithaiku

(i and I can be placed anywhere in the expression to work)

Any of these wildcards can be used anywhere in the query. Multiple wildcards fully supported.

So, to find all words beginning with kor kai: ก*
So, to find all words where the first consonant is kor kai: vก* (results will include เก่ง and ไก่ as well as กด and ก้น

To ignore leading hor hip just start the query with h.

I would have uploaded the new version of the software, but unfortunately my server has technical problems and is down at the moment. The hosting company is unfortunately unable to give a time to fix. I'll post again once things are back to normal and I've uploaded the new version with wildcard support.

kriswillems · June 1, 2014

Looks ok, So, you're ThaiNotes?

AyG · June 1, 2014

No. ThaiNotes is my partner. We share a computer. Mistakes get made.

Richard W · June 2, 2014

(1) If you search for ก* do you expect to return words such as เก่ง and ไก่?

(2) If you search for ง* do you expect to return words such as หงุบ and หงำ?

(3) Are wildcards in the middle of an expression actually useful (e.g. หญิง*การ)? What about multiple wildcards in an expression?

Question 1 is an interesting can of worms, which I've thrown over to the public Unicode email list. I can formally justify it including the words with a preposed vowels on the basis of comparing by collation, and until 3 years ago the Unicode recommendations for regular expressions would have required it as the Level 3 interpretation of [[ก-ข]-[ข]].* I've a feeling that except for the set differencing, that should also find the words with the preposed vowels for POSIX. However, this is at the level where the behaviour of apparently compliant implementations was becoming too bizarre, and the recommendations were withdrawn. Unfortunately, collation can contain a lot of bizarre features which are fine for putting things in an alphabetical order, but which are very bizarre for expressing word structure.

Question 2 just shows that collation as used for ordering and collation as used for searching can be quite different things. The answer should probably be 'sometimes'.

Your notations for optional vowels and optional ho nam provide practical solutions.

I suggest you keep your matching logic at the codepoint level - anything else is very much a deluxe feature.

ThaiNotes · June 2, 2014

I suggest you keep your matching logic at the codepoint level - anything else is very much a deluxe feature.

Could you please explain what you mean by "codepoint level"? I'm not a professional programmer - very much an amateur hobbyist and aren't familiar with all the technical terms.

ThaiNotes · June 2, 2014

OK, the wildcards features is now live.

Summary of the wildcards with examples at http://thai-notes.com/dictionaries/wildcards.html

The dictionary itself remains at http://thai-notes.com/dictionaries/RIDictionary.html

kriswillems · June 2, 2014

This is great.

Did you manage to parse the 0.5% remaining entries?

If you give this a nice layout, and put the instructions on he first page, nobody will use the original RID website anymore.

Maybe also leave a small (vertical) space between the (matching) entries

Richard W · June 2, 2014

Could you please explain what you mean by "codepoint level"?

Just do the comparisons character by character, as you would if you were comparing ASCII strings.

ThaiNotes · June 2, 2014

This is great.
Did you manage to parse the 0.5% remaining entries?
If you give this a nice layout, and put the instructions on he first page, nobody will use the original RID website anymore.
Maybe also leave a small (vertical) space between the (matching) entries

Thanks.

The remaining 0.5% entries is a pain. To be honest, I'm putting it off. Programming is far more fun than trying to make sense of the horrible, hand-coded mess that is the RID website. (I find it quite shocking that the dictionary isn't maintained in a database with the webpages automatically generated. But perhaps I shouldn't be so shocked when I consider that the committee is a group of elderly experts, meeting occasionally, and working with a card filing system.)

The layout at the moment isn't high priority. That's easy to tweak at the end. I agree about the vertical space between entries. That occurred to me this morning once I got the wildcards working properly and started returning more than a handful of entries on each query. I also really hate the font that's being used - not crisp. I'm struggling to work out why Droid Sans Thai isn't being used, which is what it should be.

By "instructions" do you mean the wildcard information? My plan at the moment is to put a "Help" link just below the text input box which will bring up the wildcard information on the right - just like a dictionary entry. I rather suspect that the vast majority of users isn't interested in wildcards - for them the predictive input is enough. And beyond the wildcards I'm not sure that there's really any need for instructions on how to use the page, though I'm happy to concede I'm wrong.

kriswillems · June 2, 2014

Very strange that the dictionary isn't available in database format ....

DavidHouston · June 4, 2014

Friends,

I know this is an off-the-wall comment, but would it not be more useful to have a full translation of the RID entries than merely tweaking the access mechanism? The RID has lots of useful definitional and explanatory content, as well as sample sentences and phrases. A translation to English of this content would be really beneficial to the foreign-learning community, don't you think?

kriswillems · June 5, 2014

That's true David. But giving the RID a new interface is a few days work, translating every entry ... a few years?

And I like the RID because the explanations are in Thai. By just reading the explanation of the word, you get an idea in which context this word is used and what the surrounding vocabulary is.

ThaiNotes · June 7, 2014

Alpha #2 now available.

(1) I think I'm parsing the RID data better now. I suspect there's only one entry that's behaving badly, but need to do further checking to make sure I'm right on that.

(2) I've added a "browse" function, allowing you to list all words starting with a given consonant. At the moment this is done by database query. If this proves to be too much of a performance hit I'll change it to static pages.

(3) I've added a "reverse lookup" function. In other words you can search for all definitions containing a given word. It's very crude, just matching the characters entered, rather than matching exact words. I had hoped to rely upon third party software to handle the parsing, but it's not up to the job. I can improve the results, but this is low priority for me at the moment.

(4) I've improved the formatting of the definition, with better spacing between entries.

(5) A few other, minor bugs have been fixed and enhancements made.

It's still at http://thai-notes.com/dictionaries/RIDictionary.html

ThaiNotes · June 7, 2014

Friends,

I know this is an off-the-wall comment, but would it not be more useful to have a full translation of the RID entries than merely tweaking the access mechanism? The RID has lots of useful definitional and explanatory content, as well as sample sentences and phrases. A translation to English of this content would be really beneficial to the foreign-learning community, don't you think?

An interesting idea. My doubts:

(1) It's a lot of work (over 40,000 main entries). Are there enough people willing to lend a hand in translating to get it done in a reasonable timescale?

(2) What would be the advantage over a Thai-English dictionary such as Wong Wattanaphichet's dictionary (which I consider excellent)? Albeit, that's a dictionary written on sliced, dead trees, not on-line.

(3) If someone needs a dictionary as comprehensive as the RID, do they really need translations?

Thinking from a technical point of view, I'm reminded of http://www.thaisubtitle.com/ which is a community site for translating movie subtitles. It sort of works. A similar approach could be use to translating the RID.

There are issues, particularly of vandalism. Perhaps a wiki-based solution with security measures and the ability easily to revert to earlier versions of definitions could be used.

As for my

merely tweaking the access mechanism

all I can say is hmmmph.

DavidHouston · June 7, 2014

Thank you so much for that thoughtful and detailed reply. I am not aware of any current or historical effort to translate the Royal Institute dictionary itself, rather than the many attempts at translating the Thai words contained therein.

My English comprehension is fairly limited so I do not understand the technical term "hmmmph". Is this an IT term of art or is it, perhaps, Norwegian?

Thanks.

David in Phuket

kriswillems · June 7, 2014

I really love the new front-end. It's wonderful.
I like how you added the horizontal lines between words.

It would be nice to promote this work, it's in my opinion one of the best online Thai-Thai dictionaries. It has several features most others don't have.

If it would become popular, would your server be ale to handle many requests?

Are you planning to promote it?
A simple message in the "Farang can learn Thai" Facebook group with 12000 members, would give you many users....
And how about the copyright? Is there any?

kriswillems · June 7, 2014

May I give one remark? I don't know if it's technically possible to solve it ....

There are many newlines in the explanations. If I narrow down the width of my browser window and I go to the dictionary I get something like this:

น. ชื่อไม้ล้มลุกชนิด Typha angustifolia L. ในวงศ์ Typhaceae ขึ้น
ใน
น้ำ ช่อดอกคล้ายธูปขนาดใหญ่, กกธูป ธูปฤๅษี ปรือ หรือ เฟื้อ ก็เรียก.

Do you notice the newline after ใน?
It kinda messes up the layout.

Enlarging the browser window does not redraw the content correctly. I have close the browser first, then open a new larger browser window and open the dictionary again.

I think it might be technically hard to filter out the newlines? So you might consider changing it to a fixed width webpage in your css file?

Also, can you change the height of the wildcard reference window?

ThaiNotes · June 8, 2014

If it would become popular, would your server be ale to handle many requests?

Are you planning to promote it?
A simple message in the "Farang can learn Thai" Facebook group with 12000 members, would give you many users....
And how about the copyright? Is there any?

The current website hosting is a very cheap, shared server plan. Undoubtedly I'd hit problems if the volume of requests was high. If the problems were from the reverse lookups I'd probably disable that feature, or somehow limit it. (The queries are very database intensive.) If I needed to switch to a more expensive server plan I might add advertising to the site (though I'm not sure how much that would raise), or try to solicit donations to cover costs.

I'm afraid I haven't learnt how to use Facebook. I'm not really sure what it does. I'll look into that later, once the dictionary is out of alpha. I don't want people visiting and finding something's broken and never returning to the site.

I'd also like to find some way later of letting Thai people know the dictionary is there, since I think it would be useful to them, as well as to non-native speakers.

The copyright issue is an interesting one. There is no copyright statement on the RI dictionary website itself that I can see (though there is one on the RI website front page). I'm hoping that by just spidering the RI dictionary website and cacheing the results I fall into the same category as a search engine such as Google. If the Royal Institute isn't happy with what I've done, then I'll have to take the site down. However, I'd hope they'd see it as something positive. And if they wanted the code, I'd be happy to give it to them to incorporate into their own website.

ThaiNotes · June 8, 2014

May I give one remark? I don't know if it's technically possible to solve it ....

There are many newlines in the explanations. If I narrow down the width of my browser window and I go to the dictionary I get something like this:

น. ชื่อไม้ล้มลุกชนิด Typha angustifolia L. ในวงศ์ Typhaceae ขึ้น
ใน
น้ำ ช่อดอกคล้ายธูปขนาดใหญ่, กกธูป ธูปฤๅษี ปรือ หรือ เฟื้อ ก็เรียก.

Do you notice the newline after ใน?
It kinda messes up the layout.

Enlarging the browser window does not redraw the content correctly. I have close the browser first, then open a new larger browser window and open the dictionary again.

I think it might be technically hard to filter out the newlines? So you might consider changing it to a fixed width webpage in your css file?

Also, can you change the height of the wildcard reference window?

With the new lines, what you're seeing is the hard coded new lines (after ใน). The break after ขึ้น is being added by your browser because it can't fit all the text onto a single line.

You don't need to close your browser to get the text to display properly in a larger browser window. Just reload the page (Ctrl-r) after you resize. Getting the application to redraw the window on resize is already on the "to do" list. (The quickest fix would simply be automatically to reload the page on resize, but that would lose any previous query data. Can't decide whether that would be a problem or not.)

I'll fix the wildcard reference window height with the next release.

ThaiNotes · June 9, 2014

Just a very minor update:

(1) I've added the code to handle when the user resizes the browser window. If you go to a very small window it looks a mess, but that can't really be avoided.

(2) I've added links to the reference pages from the RID covering things such as alphabetic order and etymology. However, the results don't display correctly yet. (Only discovered that after installing the new software version.) It's a pain having to deal with all the idiosyncrasies of the RID's HTML.

(3) Thai font handling should be better now - though it's not as consistent as I'd like.

kriswillems · June 9, 2014

Looks great. You might give a better description if there's no match for a query. F.i. ก*ผ

์Now I get a database error.

I never had issues with the fonts ...I also don't see any difference in the font handling....

One question, whenever I open the webpage it starts with "requesting wordlist", this list in probably used for the input prediction.

Is it technically possible to store this list in cache, so it doesn't has to be loaded every time I open the webpage?

ThaiNotes · June 10, 2014

The database error is because I wasn't checking whether there were zero matches. My bad. Will fix.

The font issue is complex. The page is supposed to be using Google Web Fonts' Droid Sans Thai, however, it doesn't. Not sure whether this is a known bug from Chrome 33, or a problem with the font itself. (A similar issue has been reported for a Hebrew web font from Google.) This resulted in the browser using its default font, which in some browsers (Opera 12, that's you) looks dreadful. Now I encourage the browser to use one of a list of fonts of my choosing. If you're a Linux user, the chances are that your default font was Waree, and that's the font I'm specifying for such users.

It's possible to store the word list locally. I already do this for the LEXiTRON-based dictionary. There are a number of ways to do this and I'll look into it. Not 100% certain it will be faster to load from disk, though, since the data from the Internet is compressed for faster transfer.

kriswillems · June 10, 2014

It takes about 6 seconds from my home in Thailand with a 3BB internet connection for the webpage to become active.

That's not very long, but it's a bit annoying if you want to look up a single word.

Another possibility is to first open the webpage and show the input box, but load the wordlist in background and start with input prediction only when the wordlist is loaded.

Again, I am not sure that's technically possible.

Anyway, the website is now already much better than the original website, because on the original website I lost lots of time with changing the encoding to Thai.

I've been using Chrome on linux. For me the Waree font looks good enough....

kriswillems · June 10, 2014

About the font issue ... do you use Droid Sans Thai?

Support for this font has been discontinued because Thai font in now included in "Droid Sans". Maybe you could try this font?

Also, the font I see on linux, chrome is actually Droid Sand Thai, I think. How can I recognize the font? There doesn't seem to be a problem for me.

kriswillems · June 10, 2014

I tried 3 different html files based on this post.

https://groups.google.com/forum/#!topic/early-access-fonts/07QH6KWRZnU

File 1 : like in the post

File 2: I removed 'Droid Sans Thai'

File 3: I replaced 'Droid Sand Thai' with Waree

All 3 files gave a different output on linux, chrome. The most crisp font I get from File1.

This is also the font I see when I open your webpage.

I tried with firefox on linux: also the same crisp font.

Then I tried on an old windows XP machne with chrome, the font looked different, but I don't know it that's because of the font rendering on XP.

Then I tried on internet explorer 8 on XP - it didn't work. I got only Database errors (which is no problem for me because I never use internet explorer).

You might want to check if your interface works on IE9 or what is the latest version?

Sign In

New front end to RI dictionary (alpha)

Recommended Posts

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Announcements

Topics

Popular Contributors

Latest posts...

Popular in The Pub

ASEANNOW

MORE INFO

POPULAR AREAS

CONTACT US