New front end to RI dictionary (alpha)

ThaiNotes · June 10, 2014

The Droid Sans version available as a web font is a subset; it doesn't include the Thai character set. (This is to keep the download size small.)

For the first one, I think if you right click on the text, then select "inspect element", then click on the "computed" tab and scroll to the bottom you'll see what font is actually being displayed. For me it's "OTS derived font" - not Droid Sans Thai as it should be.

Unfortunately, Firefox and Opera don't have a comparable feature (at least that I can find), so what fonts they're actually displaying is a bit of a mystery.

Anyway, unless I get reports of problems I think I'll leave the fonts as they are for the moment.

kriswillems · June 10, 2014

ok, now I see it :

OTS derived font—92 glyphs

Arial—5 glyphs

kriswillems · June 10, 2014

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link href=' http://fonts.googleapis.com/css?family=Droid+Sans+Thai' rel='stylesheet' type='text/css'>
<style>
h1, p {font-family: 'Droid Sans Thai', 'Source Sans Pro', sans-serif;}
</style>
</head>
<body>
<h1>ก๊ก</h1>
<p>น. พวก, หมู่, เหล่า, โดยปรกติมักใช้เข้าคู่กันว่า เป็นก๊กเป็นพวก เป็นก๊กเป็นหมู่ เป็นก๊กเป็นเหล่า.</p>
</body>

This works for me. Rendered fonts:

Droid Sans Thai—92 glyphs

Arial—5 glyphs

I think , and . are rendered in Arial.

kriswillems · June 10, 2014

I suspect you were the person asking the font question on google groups. If not, this answer in the google groups also solves the problem:

Try changing

http://fonts

to

//fonts

ThaiNotes · June 12, 2014

Just released an updated version of the program. Changes include:

(1) Locally storing the word list to save time on subsequent startups. (The local copy of the list will be updated if the master list changes.)

(2) Added ability to close dictionary entries by clicking on the x in the top right corner of each entry.

(3) Added suggested words for wildcard lookups. (I had worried this might slow things down, but it seems OK to me.)

(4) Fixed a problem with displaying entries with multiple subdefinitions.

(5) Changed the alphabetic sort order - leading hyphens are now ignored. (Previously it wasn't possible to select the entry for "ก" because of the entries beginning "-ก" sorted before it.

kriswillems · June 12, 2014

All your changes work fine for me.

Also the font issue seems to be solved.

When I type ก* and hit Enter, I get one (the first) entry.

It's supposed to be like that now, right?

The wildcard lookup looks very fast to me - like instantly.

Tested on Chrome and firefox on ubuntu.

It think it's a great an useful piece of work. Hope you'll keep this online forever.

kriswillems · June 12, 2014

Just one little thing. When I lookup กรก I get "Not found". Same for กรกฎ. I think it has something to do with the , after the entry in the RID.

ThaiNotes · June 13, 2014

All your changes work fine for me.

Also the font issue seems to be solved.

When I type ก* and hit Enter, I get one (the first) entry.

It's supposed to be like that now, right?

The wildcard lookup looks very fast to me - like instantly.

Tested on Chrome and firefox on ubuntu.

It think it's a great an useful piece of work. Hope you'll keep this online forever.

Literally a couple of minutes after I put that version live I realised that there was an ambiguity: if you are using wildcards and press enter, do you want to run the wildcard query and retrieve all matches? Or do you want the currently selected suggestion (which defaults to the first suggestion)?

It's been a struggle, but I've changed things so now when you enter a wildcard expression, no matches are shown, just the total number of matches. Then when you hit Enter or click on the Enter button all the matches are retrieved.

There may be a bug in the code, or possibly duplicates in the database, so the number of matches is occasionally out. I'll look into this when I have time.

There's also (I think) a problem with the matching/non-matching of tone marks for wildcards. Again, to be looked into.

ThaiNotes · June 13, 2014

Just one little thing. When I lookup กรก I get "Not found". Same for กรกฎ. I think it has something to do with the , after the entry in the RID.

Yup. Another problem. Though the RID itself has as the กรก, กรก- head word, I create separate index entries for กรก and กรก-. Both should appear in the suggestions (only the first does), and both should result in the กรก, กรก- definition being retrieved (they don't).

I'll look into this tomorrow.

kriswillems · June 14, 2014

Actually, I like the instant wildcard matching - pretty amazing it worked so fast.

My remark was just a question :)

The last problem has been fixed.

I think everything looks fine now.

You might want to find more testers ...

kriswillems · June 14, 2014

ok, I was able to find some irregularities:

Try : *บ้าน*

Some of the matches don't make sense.

Also when I forget to change my keyboard layout after * I get a strange error message.

Try: *[hko*

ThaiNotes · June 14, 2014

ok, I was able to find some irregularities:

Try : *บ้าน*

Some of the matches don't make sense.

Also when I forget to change my keyboard layout after * I get a strange error message.
Try: *[hko*

The problem with *บ้าน* is the RI data. What they usually do is put the head word in the first column of a table, then put the definitions in the second column. The junk that's coming back is because for those words they've put both the head word and the definitions all in the first column. The nonsensical matches do contain "บ้าน", but in the definitions. I can possibly fix this with even more rigorous parsing of the stuff coming back from the RI website.

I'm not surprised *[hko* is a problem. The presence of * and h makes the application think it's a regular expression. However, the inclusion of [ (which can also be part of a valid regular expression) messes things up since it's not terminated by ]. I probably should validate to make sure that the only non-Thai characters entered as those I recognise as part of my pseudo-regular expressions.

For an initial release I think I'm pretty much functionality-complete. I now need to make sure I can handle all the RI's random data formats and can handle 100% of the dictionary entries. Then I'll move on to the next stage which is trying to entice more people to try the application and give feedback.

kriswillems · June 15, 2014

Maybe you could parse the text in the RID based on the color of the words? I see all words are in blue, while the explanations are in black.

DavidHouston · June 19, 2014

If I might make a small suggestion, I would like to recommend the inclusion of another feature of the Royal Institute website beyond the dictionary which might be incorporated into your "front-end".

This is the listing of ลักษณะนาม found at http://www.royin.go.th/th/profile/index.php?SystemModuleKey=265&SystemMenuID=1&SystemMenuIDS= and subsequent pages. This listing, pages 1 - 22, can be accessed by clicking on the page number. This listing is in alphabetical order in two columns, the first column is the noun and the second its classifier. This list is not referenced in the dictionary itself. I wonder if it would be possible for your front end to have a link to this listing so that whenever a particular noun is chosen by the user, its classifier would show up as well. I apologize for not using correct IT language in this note, but perhaps you can understand what I mean.

The current array is difficult to use because the site contains no information regarding which range of words by alphabet are included in each page. For each word you wish to look up you need to guess the appropriate page. This process requires a bit of trial and error. This listing is ripe for technical improvement.

Thank you for your consideration.

ThaiNotes · June 19, 2014

If I might make a small suggestion, I would like to recommend the inclusion of another feature of the Royal Institute website beyond the dictionary which might be incorporated into your "front-end".

This is the listing of ลักษณะนาม found at http://www.royin.go.th/th/profile/index.php?SystemModuleKey=265&SystemMenuID=1&SystemMenuIDS= and subsequent pages. This listing, pages 1 - 22, can be accessed by clicking on the page number. This listing is in alphabetical order in two columns, the first column is the noun and the second its classifier. This list is not referenced in the dictionary itself. I wonder if it would be possible for your front end to have a link to this listing so that whenever a particular noun is chosen by the user, its classifier would show up as well. I apologize for not using correct IT language in this note, but perhaps you can understand what I mean.

The current array is difficult to use because the site contains no information regarding which range of words by alphabet are included in each page. For each word you wish to look up you need to guess the appropriate page. This process requires a bit of trial and error. This listing is ripe for technical improvement.

Thank you for your consideration.

Perhaps you're not familiar with http://thai-notes.com/tools/classifiers.shtml ?

It uses the data you refer to and allows you to search by word (to find classifier) and by classifier (to find words which use that classifier).

The classifier data are limited to 3900 entries. That seems rather small compared with the 40,000+ entries in the RID. Of course, only nouns have classifiers, but the large discrepancy suggests the classifier data are incomplete.

My longer term goal is to allow integrated querying of all three dictionaries (LEXiTRON, RID and classifiers). The user interface will be similar to what I have now for RID & LEXiTRON. (Can't remember why I used a slightly different interface for the classifiers.) The user will be able to select which dictionaries to retrieve matches from, then display the results - though I can't picture yet how the results will be displayed on the right.

DavidHouston · June 19, 2014

Thank you for all that effort. My mistake; I saw only the dictionary, not the classifier list. Excellent job!

ThaiNotes · July 11, 2014

It's taken a lot longer (and been much tougher) than I had anticipated, but the latest update to the program is now live.

There's not a lot to see on the surface; there's no new functionality. However, I now believe all the RID entries are now being displayed correctly (with the exception of two words).

There is one small bug that I know of: when using wildcards and more than 500 entries match, there's no longer a warning that only the first 500 are displayed. I'll fix this when I get a chance.

I'm now fairly certain that it's usable. Next step, add new functionality.

kriswillems · July 11, 2014

The new version doesn't work at all on my system. The only thing I get is a blank webpage. (tested on firefox and chromium on ubuntu)

ThaiNotes · July 11, 2014

The new version doesn't work at all on my system. The only thing I get is a blank webpage. (tested on firefox and chromium on ubuntu)

It looks like there's a temporary problem with Google Web Fonts.

If you wait about 10-15 seconds the request for the fonts will time out and the dictionary should then display OK. (At least, that's what's happening for me at the moment.) This should fix itself in time when Google sorts out the problem at their end.

If there's still a problem I'll have another look on Monday to see if I can sort something out.

DavidHouston · July 11, 2014

The revised site works very well for me. Thank you for this new presentation.

kriswillems · July 11, 2014

Thanks, if I wait long enough, the site appears, as you said and it works fine.

Maybe just one small remark, if a wildcard entry doesn't exist, for instance กกก* I get no message telling me this (or any other feedback).

PS. I don't know if you know this online dictionary : http://dict.longdo.com/

If it similar to what you made but the layout is not so nice and it misses wildcard functionality.

It also doesn't remember words previously looked up.

It also does a multi dictionary lookup (including the RID, and it supports French and German).

desi · July 13, 2014

I'd also like to find some way later of letting Thai people know the dictionary is there, since I think it would be useful to them, as well as to non-native speakers.

The Farang Can Learn Thai FB group has over 14,000 members and a great deal of them are Thai. If you post there, Thais will make sure it's known.

ThaiNotes · July 14, 2014

Maybe just one small remark, if a wildcard entry doesn't exist, for instance กกก* I get no message telling me this (or any other feedback).

That was a mistake on my part. The previous message saying how many matches there were got accidentally dropped when I did some major refactoring and I didn't notice. It's back there now.

I'll work on the slow loading next.

ThaiNotes · July 14, 2014

The slow loading has been fixed.

Stop reading here if you're not interested in technical details.

[hr]I use MySQL to store the list of words and to cache individual entries. My hosting provider migrated me from 5.5 to 5.6. This didn't cause problems until I needed to reload the database. Now if you want to use Unicode in 5.6 the collation has changed. Before I was using utf8_unicode_ci collation throughout. Now one is forced to use utf8mb4_unicode collation for tables just to be able to reload them. It didn't occur to me at the time that I also needed to change the collation for the indices. I'd thought doing so might fix the problem. It didn't. The query

SELECT COUNT(DISTINCT headword) FROM entries

still ran incredibly slowly - around 25-30 seconds. What is more, it returned the wrong answer - which wasn't particularly a surprise since MySQL's handling of Unicode is atrocious. I tried a couple of things such as forcing a binary query, which gave the correct answer. However, the query which downloads the wordlist, namely

SELECT DISTINCT headword FROM entries

completely ignored all tone marks, so rather than returning ก, ก็, กก, ก๊ก... it just returned ก, กก... In other words, Unicode in MySQL 5.6 is now even more broken that it was before.

The only solution I could think of was to include all duplicates, which slows the download (which isn't a big deal, since it should be a one off event), but also uses more browser local storage, which is finite.

Anyway, everything is back to normal and running fast again. Curse you, MySQL.

I'm not sure whether there are going to be any other knock-on problems after the database "upgrade". If you spot anything odd, please do let me know.

ThaiNotes · July 14, 2014

I'd also like to find some way later of letting Thai people know the dictionary is there, since I think it would be useful to them, as well as to non-native speakers.

The Farang Can Learn Thai FB group has over 14,000 members and a great deal of them are Thai. If you post there, Thais will make sure it's known.

Thanks for the suggestion. Once the program is a bit more stable (and I've learned how to use Facebook) I'll give that a go.

ThaiNotes · July 14, 2014

PS. I don't know if you know this online dictionary : http://dict.longdo.com/
If it similar to what you made but the layout is not so nice and it misses wildcard functionality.
It also doesn't remember words previously looked up.
It also does a multi dictionary lookup (including the RID, and it supports French and German).

I'd been vaguely aware of it, but never looked at it seriously.

It uses a rather different approach from mine in that I store the word list locally, Longdo doesn't. That means I can provide suggestions much faster and can also reasonably support wildcard searches.

Rather oddly, I think, when searching for a term, Longdo's suggestions are words including the letters typed, not words starting with what's typed. Not sure why they decided to do this.

The RID lookup appears to be from the previous version of the RID (though there is a link to look up a word with the latest version at the RID website).

And one other big difference: my site has no advertising and I'm not trying to sell anyone anything.

Anyway, interesting to have a look at it.

As for linking to looking up words in other dictionaries, that's something I've thought about. In particular, linking to thai-language.com would be useful. There used to be a way to do that, but I don't think it's available any more. Something to look at in the future.

ThaiNotes · August 6, 2014

I've added a number of new features to the dictionary.

- Abbreviations are coloured. However over them for a popup showing the expansion and an English explanation. (No English where the reference is to a Thai language book.)

- Where there is a reference to another word in the dictionary (for example, กก ๗ - ดู กะวะ) you can click on the word (in this case กะวะ) to see that word's definition.

- After you search for a word, you can search for that word on four other dictionaries directly: thai-language.com, thai2english.com, LEXiTRON and Thai language Wikipedia. To do this, position the mouse over one of the terms and a popup will appear. Select the appropriate dictionary and the search result will appear in a new tab.

There are a number of known issues:

- Performance for some queries can be poor, particularly browsing by letter and reverse lookup for short, common words. This I think I can improve with further work, but it's not a straightforward fix.

- Some of the links added are spurious. That's more of an irritation than a problem. I don't expect to be able to fix this. (The problem is inconsistent formatting of the RID's entries.)

- Previously the application downloaded the dictionary word list once, and then used a locally stored copy. For some reason it's now downloading the word list every time you start it. This should be easy to fix, if I can track down the cause.

- Font sizes are inconsistent in the results displayed.

- The popups sometimes hide themselves when they shouldn't. The workaround for this is to move the mouse cursor over the popup itself. Then all will be fine.

- For the popups that link to third party sites, there's no guarantee that the word will be available at that site. (If I can get lists of all words at the third party sites I can fix this.)

- LEXiTRON searches are done via my webserver, so can be rather slow. (All other external searches are done directly.)

As ever, all and any feedback much appreciated.

The dictionary remains at http://thai-notes.com/dictionaries/RIDictionary.html

DavidHouston · August 7, 2014

Thai-notes,

What about common words like "ว่า" whose most common definition is "that" which are not in the Lexitron dictionary. (Lexitron shows the much less common "quarrel".) Will you be making manual additions to add these?

Also, I am having difficulty getting the box which shows third party reference dictionaries. Must my browser turn off "pop-ups"? Where is the pop-up turn off switch in Chrome?

Thanks for your assistance and all the excellent work you are doing on your site.

ThaiNotes · August 7, 2014

What about common words like "ว่า" whose most common definition is "that" which are not in the Lexitron dictionary. (Lexitron shows the much less common "quarrel".) Will you be making manual additions to add these?

I'm not sure if you're referring to the RID front end, or to the Thai-English dictionary on the same site which is based upon the LEXiTRON data.

If it's the former, (i.e. using the RID front end and selecting LEXiTRON from the popup), then all I do is direct you to LEXiTRON itself, so I have no control over what LEXiTRON shows. In practice a lot of the links will return "not found" - particularly for Wikipedia.

And if the latter, I probably won't be adding in extra words. It simply would be too much effort for me to maintain any dictionary. However, the longer term plan is to try to produce an integrated dictionary which will pull back results from a number of dictionaries, rendering both of the current dictionaries obsolete. That, however, is going to be quite a challenge, so will be well into the future.

Also, I am having difficulty getting the box which shows third party reference dictionaries. Must my browser turn off "pop-ups"? Where is the pop-up turn off switch in Chrome?

The popups aren't popups as such, so should be unaffected by any browser setting.

Are you seeing any highlighting in the results you are getting? There should be lots of light blue links (not to be confused with the darker blue links in "reverse lookup"). If not, you probably need to reload the web page and try again. Otherwise, I'm stumped as to what the problem might be. Let me know if you still have a problem after you've tried reloading.

ThaiNotes · August 7, 2014

- Previously the application downloaded the dictionary word list once, and then used a locally stored copy. For some reason it's now downloading the word list every time you start it. This should be easy to fix, if I can track down the cause.

This is now sort of fixed. With any luck the dictionary word list should be downloaded one final time, then not again until it changes. However, the database is now ignoring tone marks (again). E.g. searching for ว่า now brings back วา and ว้า as well.

For the technically minded, this is another problem with MySQL's hopelessly broken Unicode handling. The behaviour has changed between version 5.6.12 (which I test on) and 5.6.17 (which the live database has just been upgraded to by my hosting provider). Nothing in the MySQL release notes to explain the change in behaviour, so I'm not sure if this is a new "feature" that's been introduced and will be fixed, or whether this now is the intended behaviour.

Now all I need to decide is whether to rewrite all my database queries to work around the new behaviour, or update my application do filter out the spurious responses.

Sign In

New front end to RI dictionary (alpha)

Recommended Posts

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Topics

Popular Contributors

Latest posts...

Popular in The Pub

ASEANNOW

MORE INFO

POPULAR AREAS

CONTACT US