Jump to content

ThaiNotes

Member
  • Posts

    123
  • Joined

  • Last visited

Posts posted by ThaiNotes

  1. I wasn't sure whether to post this under Thai language or Internet, but I'm hoping I'll get a better response here.

    The Thai language websites I visit are all (without exception) poorly designed, and often broken. For example:

    • truevisions.com - the TV guide didn't work for years, and now, though it works, it only covers terrestrial channels
    • pantip.com - purple text on a purple background
    • bloggang.com - ugly blogs customised by using raw HTML/CSS
    • immigration.go.th - very amateurish design (particularly the animated graphics/links on the left), and parts of it (the appointment booking system) only work in Internet Explorer
    • rirs3.royin.go.th/dictionary.asp (the Royal Institute Dictionary) - a one line change would make this site work 1000% better (by including the page encoding in the HTML) - now when many people visit it they are presented with gibberish and manually have to change the browser's page encoding.
    Generally speaking, there's an over-reliance on Flash (meaning that content isn't being indexed by the search engines), far too much in the way of movement (animated .gifs, rotating banners), uninspiring fonts, and overlong pages requiring one to page down again, and again, and again.

    There are a lot of gateway pages using large Flash images with no HTML link to the content meaning that the entire site is potentially not indexed. A particularly horrible gateway page at the moment is with tescolotus.com which includes loud football chanting. In the Occident website designers dropped gateway pages as a bad idea years ago.

    Where an English language version is "provided", often a lot of the content provided in the form of graphics with Thai text with no translated graphic (e.g. tescolotus.com, bigc.co.th) and sometimes it simply isn't there at all.

    Interestingly, a number of very popular sites have adopted a similar style: very long pages full of tiled images with subtitles. For example, sanook.com, teenee.com, truelife.com, siamsport.com, kapook.com.

    I find the situation quite difficult to understand. Are Thai people not visually literate (which is hard for me to believe), or do the designs actually appeal to them? Or is it that design is irrelevant, and they only care about content? Do the owners not care that the sites don't work properly, or is it that their IT experts lack the skills to produce properly working sites?

    The problems are so pervasive that I'm led to ask: is there even a single, well-designed, Thai-produced, Thai language website out there? If so, I'd love to see it. Please post a link.

  2. ok, I was able to find some irregularities:
     
    Try : *บ้าน*
     
    Some of the matches don't make sense.
     
    Also when I forget to change my keyboard layout after * I get a strange error message.
    Try: *[hko*

     
    The problem with *บ้าน* is the RI data.  What they usually do is put the head word in the first column of a table, then put the definitions in the second column.  The junk that's coming back is because for those words they've put both the head word and the definitions all in the first column.  The nonsensical matches do contain "บ้าน", but in the definitions.  I can possibly fix this with even more rigorous parsing of the stuff coming back from the RI website.
     
    I'm not surprised *[hko* is a problem.  The presence of * and h makes the application think it's a regular expression.  However, the inclusion of [ (which can also be part of a valid regular expression) messes things up since it's not terminated by ].  I probably should validate to make sure that the only non-Thai characters entered as those I recognise as part of my pseudo-regular expressions.
     
    For an initial release I think I'm pretty much functionality-complete.  I now need to make sure I can handle all the RI's random data formats and can handle 100% of the dictionary entries.  Then I'll move on to the next stage which is trying to entice more people to try the application and give feedback.
  3. Just one little thing. When I lookup กรก I get "Not found". Same for กรกฎ. I think it has something to do with the ,  after the entry in the RID.

     

    Yup.  Another problem.  Though the RID itself has as the กรก, กรก- head word, I create separate index entries for กรก and กรก-.  Both should appear in the suggestions (only the first does), and both should result in the กรก, กรก- definition being retrieved (they don't).
     

    I'll look into this tomorrow.

  4. All your changes work fine for me.

    Also the font issue seems to be solved.

    When I type ก* and hit Enter, I get one (the first) entry.

    It's supposed to be like that now, right?

     

    The wildcard lookup looks very fast to me - like instantly.

    Tested on Chrome and firefox on ubuntu.

     

    It think it's a great an useful piece of work. Hope you'll keep this online forever.

     

     

    Literally a couple of minutes after I put that version live I realised that there was an ambiguity:  if you are using wildcards and press enter, do you want to run the wildcard query and retrieve all matches? Or do you want the currently selected suggestion (which defaults to the first suggestion)?

     

    It's been a struggle, but I've changed things so now when you enter a wildcard expression, no matches are shown, just the total number of matches.  Then when you hit Enter or click on the Enter button all the matches are retrieved.

     

    There may be a bug in the code, or possibly duplicates in the database, so the number of matches is occasionally out.  I'll look into this when I have time.

     

    There's also (I think) a problem with the matching/non-matching of tone marks for wildcards.  Again, to be looked into.

  5. Just released an updated version of the program.  Changes include:

     

    (1) Locally storing the word list to save time on subsequent startups.  (The local copy of the list will be updated if the master list changes.)

     

    (2) Added ability to close dictionary entries by clicking on the x in the top right corner of each entry.

     

    (3) Added suggested words for wildcard lookups.  (I had worried this might slow things down, but it seems OK to me.)

     

    (4) Fixed a problem with displaying entries with multiple subdefinitions.

     

    (5) Changed the alphabetic sort order - leading hyphens are now ignored.  (Previously it wasn't possible to select the entry for "ก" because of the entries beginning "-ก" sorted before it.

     

  6. The Droid Sans version available as a web font is a subset; it doesn't include the Thai character set.  (This is to keep the download size small.)

     

    For the first one, I think if you right click on the text, then select "inspect element", then click on the "computed" tab and scroll to the bottom you'll see what font is actually being displayed.  For me it's "OTS derived font" - not Droid Sans Thai as it should be.

     

    Unfortunately, Firefox and Opera don't have a comparable feature (at least that I can find), so what fonts they're actually displaying is a bit of a mystery.

     

    Anyway, unless I get reports of problems I think I'll leave the fonts as they are for the moment.

  7. The database error is because I wasn't checking whether there were zero matches.  My bad.  Will fix.

     

    The font issue is complex.  The page is supposed to be using Google Web Fonts' Droid Sans Thai, however, it doesn't.  Not sure whether this is a known bug from Chrome 33, or a problem with the font itself.  (A similar issue has been reported for a Hebrew web font from Google.)  This resulted in the browser using its default font, which in some browsers (Opera 12, that's you) looks dreadful.  Now I encourage the browser to use one of a list of fonts of my choosing.  If you're a Linux user, the chances are that your default font was Waree, and that's the font I'm specifying for such users.

     

    It's possible to store the word list locally.  I already do this for the LEXiTRON-based dictionary.  There are a number of ways to do this and I'll look into it.  Not 100% certain it will be faster to load from disk, though, since the data from the Internet is compressed for faster transfer.

  8. I've been getting this error after trying to submit a post. It's been reported a few times before, and there hasn't been any answer. What's going on?

    I'm guessing it's a problem with two people sharing a PC, both with accounts here.

    There's a possibly related issue that even after explicitly signing out, I can come back and find myself still logged on.

    Any explanation? Solution?

  9. Just a very minor update:

    (1) I've added the code to handle when the user resizes the browser window. If you go to a very small window it looks a mess, but that can't really be avoided.

    (2) I've added links to the reference pages from the RID covering things such as alphabetic order and etymology. However, the results don't display correctly yet. (Only discovered that after installing the new software version.) It's a pain having to deal with all the idiosyncrasies of the RID's HTML.

    (3) Thai font handling should be better now - though it's not as consistent as I'd like.
  10. May I give one remark? I don't know if it's technically possible to solve it ....

    There are many newlines in the explanations. If I narrow down the width of my browser window and I go to the dictionary I get something like this:

    น. ชื่อไม้ล้มลุกชนิด Typha angustifolia L. ในวงศ์ Typhaceae ขึ้น
    ใน
    น้ำ ช่อดอกคล้ายธูปขนาดใหญ่, กกธูป ธูปฤๅษี ปรือ หรือ เฟื้อ ก็เรียก.

    Do you notice the newline after ใน?
    It kinda messes up the layout.

    Enlarging the browser window does not redraw the content correctly. I have close the browser first, then open a new larger browser window and open the dictionary again.

    I think it might be technically hard to filter out the newlines? So you might consider changing it to a fixed width webpage in your css file?

    Also, can you change the height of the wildcard reference window?


    With the new lines, what you're seeing is the hard coded new lines (after ใน). The break after ขึ้น is being added by your browser because it can't fit all the text onto a single line.

    You don't need to close your browser to get the text to display properly in a larger browser window. Just reload the page (Ctrl-r) after you resize. Getting the application to redraw the window on resize is already on the "to do" list. (The quickest fix would simply be automatically to reload the page on resize, but that would lose any previous query data. Can't decide whether that would be a problem or not.)

    I'll fix the wildcard reference window height with the next release.
  11. If it would become popular, would your server be ale to handle many requests?

    Are you planning to promote it?
    A simple message in the "Farang can learn Thai" Facebook group with 12000 members, would give you many users....
    And how about the copyright? Is there any?


    The current website hosting is a very cheap, shared server plan. Undoubtedly I'd hit problems if the volume of requests was high. If the problems were from the reverse lookups I'd probably disable that feature, or somehow limit it. (The queries are very database intensive.) If I needed to switch to a more expensive server plan I might add advertising to the site (though I'm not sure how much that would raise), or try to solicit donations to cover costs.

    I'm afraid I haven't learnt how to use Facebook. I'm not really sure what it does. I'll look into that later, once the dictionary is out of alpha. I don't want people visiting and finding something's broken and never returning to the site.

    I'd also like to find some way later of letting Thai people know the dictionary is there, since I think it would be useful to them, as well as to non-native speakers.

    The copyright issue is an interesting one. There is no copyright statement on the RI dictionary website itself that I can see (though there is one on the RI website front page). I'm hoping that by just spidering the RI dictionary website and cacheing the results I fall into the same category as a search engine such as Google. If the Royal Institute isn't happy with what I've done, then I'll have to take the site down. However, I'd hope they'd see it as something positive. And if they wanted the code, I'd be happy to give it to them to incorporate into their own website.
  12. Friends,
     
    I know this is an off-the-wall comment, but would it not be more useful  to have a full translation of the RID entries than merely tweaking the access mechanism? The RID has lots of useful definitional and explanatory content, as well as sample sentences and phrases. A translation to English of this content would be really beneficial to the foreign-learning community, don't you think?

     
    An interesting idea.  My doubts:
     
    (1) It's a lot of work (over 40,000 main entries).  Are there enough people willing to lend a hand in translating to get it done in a reasonable timescale?
     
    (2) What would be the advantage over a Thai-English dictionary such as Wong Wattanaphichet's dictionary (which I consider excellent)? Albeit, that's a dictionary written on sliced, dead trees, not on-line.
     
    (3) If someone needs a dictionary as comprehensive as the RID, do they really need translations?
     
    Thinking from a technical point of view, I'm reminded of http://www.thaisubtitle.com/ which is a community site for translating movie subtitles.  It sort of works.  A similar approach could be use to translating the RID.
     
    There are issues, particularly of vandalism.  Perhaps a wiki-based solution with security measures and the ability easily to revert to earlier versions of definitions could be used.

    As for my

     

    merely tweaking the access mechanism


    all I can say is hmmmph.

  13. Alpha #2 now available.

     

    (1) I think I'm parsing the RID data better now.  I suspect there's only one entry that's behaving badly, but need to do further checking to make sure I'm right on that.

     

    (2) I've added a "browse" function, allowing you to list all words starting with a given consonant.  At the moment this is done by database query.  If this proves to be too much of a performance hit I'll change it to static pages.

     

    (3) I've added a "reverse lookup" function.  In other words you can search for all definitions containing a given word.  It's very crude, just matching the characters entered, rather than matching exact words.  I had hoped to rely upon third party software to handle the parsing, but it's not up to the job.  I can improve the results, but this is low priority for me at the moment.

     

    (4) I've improved the formatting of the definition, with better spacing between entries.

     

    (5) A few other, minor bugs have been fixed and enhancements made.

     

    It's still at http://thai-notes.com/dictionaries/RIDictionary.html

  14. This is great.
    Did you manage to parse the 0.5% remaining entries?
    If you give this a nice layout, and put the instructions on he first page, nobody will use the original RID website anymore.
    Maybe also leave a small (vertical) space between the (matching) entries


    Thanks.

    The remaining 0.5% entries is a pain. To be honest, I'm putting it off. Programming is far more fun than trying to make sense of the horrible, hand-coded mess that is the RID website. (I find it quite shocking that the dictionary isn't maintained in a database with the webpages automatically generated. But perhaps I shouldn't be so shocked when I consider that the committee is a group of elderly experts, meeting occasionally, and working with a card filing system.)

    The layout at the moment isn't high priority. That's easy to tweak at the end. I agree about the vertical space between entries. That occurred to me this morning once I got the wildcards working properly and started returning more than a handful of entries on each query. I also really hate the font that's being used - not crisp. I'm struggling to work out why Droid Sans Thai isn't being used, which is what it should be.

    By "instructions" do you mean the wildcard information? My plan at the moment is to put a "Help" link just below the text input box which will bring up the wildcard information on the right - just like a dictionary entry. I rather suspect that the vast majority of users isn't interested in wildcards - for them the predictive input is enough. And beyond the wildcards I'm not sure that there's really any need for instructions on how to use the page, though I'm happy to concede I'm wrong.

  15. I suggest you keep your matching logic at the codepoint level - anything else is very much a deluxe feature.


    Could you please explain what you mean by "codepoint level"? I'm not a professional programmer - very much an amateur hobbyist and aren't familiar with all the technical terms.
  16. ์์I miss the functionality the RID has, like using * and just browsing through all words starting with a certain consonant.


    Thanks for the feedback. Both those features are on the "to do" list.

    As for the use of *, a couple of questions:

    (1) If you search for ก* do you expect to return words such as เก่ง and ไก่?

    (2) If you search for ง* do you expect to return words such as หงุบ and หงำ?

    (3) Are wildcards in the middle of an expression actually useful (e.g. หญิง*การ)? What about multiple wildcards in an expression?

    Alphabetic browsing will come after I've managed to sort out all the data issues. No point in creating the (static) pages until I can parse all the RID entries correctly. (I'm now at a point where I have to track down the problems of the dodgy entries one by one. It's very time-consuming.)
  17. I get very frustrated using the Royal Institute's on-line dictionary. For a start, every time I use it I have manually to change the encoding. Plus the alphabetic listings of words are incomplete. Anyway, rather than just stay frustrated I decided to write a new front end to the data. The results of my efforts are at:

    http://thai-notes.com/dictionaries/RIDictionary.html

    This is still very much work in progress, but it's definitely usable. However, I was wondering whether anyone would like to suggest any specific features to add. Thoughts I have include:

    • Link terms in definitions to where they refer to
    • Hover-over pop-up box to explain any abbreviations
    • Incorporating the RI's Dictionary of New Words
    • Reverse lookup (i.e. find all definitions containing a given word)

    I have also wondered about providing an option to make it more friendly for native English speakers.  For example, translating all abbreviations and providing IPA pronunciation, rather than the Thai system.  Though perhaps someone advanced enough in their studies to use the RID doesn't need this sort of hand holding.  Thoughts?

     

    What else would be good?

     

    This is early, alpha software, so there are quite a few known issues, including:

    1. There are some errors in parsing the RI data resulting in a few (<0.5%) erroneous entries and truncated definitions.  A few lookups result in "not found".
    2. The sequence of the word suggestions is a little off.  (Entries for prefixes are all listed before the first non-prefix.)
    3. Fonts aren't being properly loaded from the Internet; local fonts are being used.
    4. No bold/italics in definitions.
    5. If you resize your browser window the application display doesn't change.
    6. When there's a long definition a scrollbar should appear.  It doesn't always do so, and may be less than usual width.
    7. No integration with the rest of the site for customisable font size and color theme.

    There may be some browser-related issues.  I don't use Microsoft Windows or MacOS so have to test Internet Explorer and Safari using emulation software.  If anyone is using these browsers I'd love to know if these browsers really do have the problems I experienced.

    • Safari, letters typed into the word input box don't display.
    • Internet Explorer (tested with version 8) loads the data and displays the keyboard, but every word lookup fails.

    Thank you for reading so far.

     

    As always, any and all feedback much appreciated.

  18. hi, i just had a go at the describing food flash cards and i ticked the type answer box at the top, and once i've flipped the card after typing it only gives me the option of pressing the red cross confirming i got it wrong, which i didn't, it did it on everything….sad.png.pagespeed.ce.5zxzyGiJz0.png

    Thanks for the feedback. Sorry for the problem. Unfortunately, I'm not sure why this isn't working for you.

    I presume that what you typed (displayed next to the red cross) is exactly matching the larger word displayed above to the right.

    All I can think of is that you might not be typing the above/below characters in the correct sequence. (It should be consonant, then above/below vowel, then tone mark.) It may look correct on the screen if you don't use this order, but technically it's wrong and it won't match.

    Do you only get this problem with "describing food"? Or is it with all packs?

    If you change the language direction so you're typing English, do you still get the same problem?

    Which browser and version number are you using?

    If anybody else has the same problem, could they let me know so I can try and find some pattern with what's going wrong?

    Thanks

  19. I've completely rewritten the flashcard program on my website adding a number of new features, including the ability to type your answer (good for practising typing I thought), automatically flipping cards after a few seconds, and recycling of wrong answers. You can also create your own packs of cards. Before I fully release the program it would be great if a few people could check whether the beta version works for them. (I've tested Safari 5, IE 11, Firefox 28, Opera 12 and Chrome 33, but there are so many combinations of browser version and operating system that it's impossible to test them all.) The program is at:

    http://thai-notes.com/flashcards/flashcards_plus.shtml

    The Editor to create your own packs is at:

    http://thai-notes.com/flashcards/flashcard_editor.shtml

    Any and all feedback much appreciated, either here or via the Feedback page on the website.

    Known issues are:

    (1) Problems with scrollbars displaying correctly

    (2) Occasional ugly font handling - particularly with IE and Opera

    Thanks in anticipation.

  20. Nice! Thanks.

    What does it do with ๆ and ฯ ?

    It gets them wrong. It treats them as ordinary characters collating according to their Unicode value.

    For example,

    เบะปาก
    เบา
    เบาฯ
    เบาๆ
    เบ้า
    เบาความ
    

    (which I think is in probably the correct alphabetic order. เบาฯ is a made-up word for testing purposes)

    sorts as

    เบะปาก
    เบา
    เบ้า
    เบาความ
    เบาฯ
    เบาๆ
    

    Fortunately, that's good enough for my web application.

    LiibreOffice (which I presume uses the Linux sorting functionality) sorts as

    เบะปาก
    เบา
    เบ้า
    เบาฯ
    เบาๆ
    เบาความ
    

    which seems wrong to me.

  21. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit".

    Looking at the code, you're right that leading vowels and consonants are seen as one unit. However, they are actually swapped. Line 493 includes the comment

    % Thai consonants, with leading vowels rearrangement
    

    The following lines of code defines the swapped vowel/consonants.

    BTW, following on from an earlier post, does anyone know how ๆ and ฯ should be sorted? All the references I've looked at ignore these characters.

  22. I appreciate this is of very limited interest, but I recently faced an interesting challenge in how to sort lists of Thai words alphabetically. For reasons that are unimportant here, I had to code this from scratch. Doing this the same way I do as a human is poorly suited to computers. I found an old algorithm (Londe & Warotamasikkhadit, 1969) and implemented it in Java. I give my code here (i) because I think the algorithm is both clever and interesting, and (ii) in the future someone facing the same challenge might stumble across this page.

    First the static variables

        static final char SARA_E = 0x0E40;
        static final char SARA_AI_MAIMALAI = 0x0E44;
        static final char MAITAIKHU = 0x0E47;
        static final char THANTHAKHAT = 0x0E4C;  // a.k.a. "garan"		
    

    A couple of utility functions

        boolean isLeadingVowel(char c) {
            // Returns true if character is in the range from SARA E to SARA Ai MAIMALAI, 
            // i.e. if the character is a leading vowel
            
            return (c >= SARA_E && c <= SARA_AI_MAIMALAI);
        }
        boolean isToneMark (char c) {
            // Returns true if character is in the range from MAITHAIKHU to THANTHAKHAT
            // which includes the four tone marks.  I.e. all "above" symbols
            
            return (c >= MAITAIKHU && c <= THANTHAKHAT);
        }
    

    Finally the code to produce Strings that can be compared directly.

        String getThaiComparisonString(String s) {
        
            // Convert String to separate characters so we can manipulate them
            char[] chars = s.toCharArray();
                    
            // Swap all leading vowels with next character
            for (int i = 0; i < chars.length; i++) {
                if (isLeadingVowel(chars[i])) {
                    char c = chars[i];
                    chars[i] = chars[i + 1];
                    chars[i + 1] = c;
                    i++;
                }
            }
            
            // The String for comparison is built in to parts, here referred to
            // as "head" and "tail".  "tail" always begins with "00".
            String tail = "00";
            
            // For each tone mark, Mai Tai Khoo, or Thantakhat found, remove it
            // from "head", add a 2 digit String to "tail" representing 
            // its original position from the END of the original String, 
            // then append the mark itself to "tail"
            String head = "";
            for (int i = 0; i < chars.length; i++) {
                if (isToneMark(chars[i])) {
                    int pos = chars.length - i;
                    if (pos >= 10)
                        tail += "" + pos;
                    else
                        tail+= "0" + pos;
                    tail += chars[i];                                
                }
                else {
                    head += chars[i];
                }
            }
    
            // Return the String for comparison
            return head + tail;
        }
    

    Finally, to sort, all you need is:

    Collections.sort(this, new Comparator<Card>() {
                public int compare(String s1, String s2) {
                        return getThaiComparisonString(s1).compareTo(getThaiComparisonString(s2));
                }});
    

    So, for example, in the process เมษายน becomes มเษายน00 for comparison purposes, and กุมภาพันธ์ becomes กุมภาพันธ0001์.

    Pretty neat, huh?

×
×
  • Create New...