ThaiNotes Posted March 13, 2014 Share Posted March 13, 2014 I appreciate this is of very limited interest, but I recently faced an interesting challenge in how to sort lists of Thai words alphabetically. For reasons that are unimportant here, I had to code this from scratch. Doing this the same way I do as a human is poorly suited to computers. I found an old algorithm (Londe & Warotamasikkhadit, 1969) and implemented it in Java. I give my code here (i) because I think the algorithm is both clever and interesting, and (ii) in the future someone facing the same challenge might stumble across this page.First the static variables static final char SARA_E = 0x0E40; static final char SARA_AI_MAIMALAI = 0x0E44; static final char MAITAIKHU = 0x0E47; static final char THANTHAKHAT = 0x0E4C; // a.k.a. "garan" A couple of utility functions boolean isLeadingVowel(char c) { // Returns true if character is in the range from SARA E to SARA Ai MAIMALAI, // i.e. if the character is a leading vowel return (c >= SARA_E && c <= SARA_AI_MAIMALAI); } boolean isToneMark (char c) { // Returns true if character is in the range from MAITHAIKHU to THANTHAKHAT // which includes the four tone marks. I.e. all "above" symbols return (c >= MAITAIKHU && c <= THANTHAKHAT); } Finally the code to produce Strings that can be compared directly. String getThaiComparisonString(String s) { // Convert String to separate characters so we can manipulate them char[] chars = s.toCharArray(); // Swap all leading vowels with next character for (int i = 0; i < chars.length; i++) { if (isLeadingVowel(chars[i])) { char c = chars[i]; chars[i] = chars[i + 1]; chars[i + 1] = c; i++; } } // The String for comparison is built in to parts, here referred to // as "head" and "tail". "tail" always begins with "00". String tail = "00"; // For each tone mark, Mai Tai Khoo, or Thantakhat found, remove it // from "head", add a 2 digit String to "tail" representing // its original position from the END of the original String, // then append the mark itself to "tail" String head = ""; for (int i = 0; i < chars.length; i++) { if (isToneMark(chars[i])) { int pos = chars.length - i; if (pos >= 10) tail += "" + pos; else tail+= "0" + pos; tail += chars[i]; } else { head += chars[i]; } } // Return the String for comparison return head + tail; } Finally, to sort, all you need is: Collections.sort(this, new Comparator<Card>() { public int compare(String s1, String s2) { return getThaiComparisonString(s1).compareTo(getThaiComparisonString(s2)); }}); So, for example, in the process เมษายน becomes มเษายน00 for comparison purposes, and กุมภาพันธ์ becomes กุมภาพันธ0001์.Pretty neat, huh? Link to comment Share on other sites More sharing options...
kriswillems Posted March 13, 2014 Share Posted March 13, 2014 (edited) Nice! Thanks. What does it do with ๆ and ฯ ? Edited March 13, 2014 by kriswillems Link to comment Share on other sites More sharing options...
kriswillems Posted March 13, 2014 Share Posted March 13, 2014 (edited) PS. For those that are into C, you can find a library of function related to Thai here: http://linux.thai.net/websvn/wsvn/software.libthai? There's another generic sorting algorithm in trunk/src/thcoll. Some header file used are in trunk/include/thai Edited March 13, 2014 by kriswillems Link to comment Share on other sites More sharing options...
Richard W Posted March 15, 2014 Share Posted March 15, 2014 PS. For those that are into C, ... Standard functions strcoll() and strxfrm() should do the comparison. If you want elaborate solutions, there is also ICU (International Components for Unicode). Link to comment Share on other sites More sharing options...
kriswillems Posted March 15, 2014 Share Posted March 15, 2014 PS. For those that are into C, ... Standard functions strcoll() and strxfrm() should do the comparison. If you want elaborate solutions, there is also ICU (International Components for Unicode). Interesting. To set the language used by strcoll(), i've to call char * setlocale (int category, const char *locale) But if I ask a list of supported languages on my system, like this : locale -a I don't immediately see anything referring to Thai. Any suggestions? Link to comment Share on other sites More sharing options...
kriswillems Posted March 15, 2014 Share Posted March 15, 2014 Ok, I've found the answer. I had to add a new locale for Thai like this: sudo localedef -f TIS-620 -i th_TH th_TH After that th_TH.tis620 shows up when executing locale -a After that this little program seems to work. #include <string.h> #include <stdio.h> #include <locale.h> int main (void) { char * result; result = setlocale(LC_COLLATE, "th_TH.tis620"); printf ("setlocale returned %s\n", result); printf ("compare result: %d\n", strcoll("กาน", "ขาร")); } 1 Link to comment Share on other sites More sharing options...
Richard W Posted March 15, 2014 Share Posted March 15, 2014 I had to add a new locale for Thai like this: sudo localedef -f TIS-620 -i th_TH th_TH Useful to know. For Thai, I only had th_TH.utf8 on my machine until I executed that command. After that this little program seems to work. <snip> printf ("compare result: %d\n", strcoll("กาน", "ขาร")); One needs a bigger battery of tests to ensure that the preposed characters are being swapped properly. 10 years ago the standards (e.g. ISO 14651) relied on preposed vowels being preprocessed as in in the original post. My test results (adapting your code) were: Compare กาน v. ขาร: -1 Compare เกา v. คน: -3 Compare การ v. คน: -3 Compare คน v. เด็ก: -16 Compare คน v. ดัด: -16 I get the same difference values whether I use th_TH.utf8 or th_TH.tis620, though obviously the strings and selected locale have to be in the right encoding. Link to comment Share on other sites More sharing options...
kriswillems Posted March 16, 2014 Share Posted March 16, 2014 I found other useful commands: local -m Gives an overview of the character sets. localedef --help shows the locale path, the files in this path are used as a basis for the compare algorithm (strcoll). (in my case /usr/lib/locale:/usr/share/i18n) If you look at this file: /usr/share/i18n/locales/th_TH you can guess how the algorithm works It compares collating elements and each possible consonant with each possible preposed vowel is defined as a certain collating element. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit". This system would also work for other languages that might see 3 or more characters a being one collating element. Link to comment Share on other sites More sharing options...
ThaiNotes Posted March 16, 2014 Author Share Posted March 16, 2014 I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit". Looking at the code, you're right that leading vowels and consonants are seen as one unit. However, they are actually swapped. Line 493 includes the comment % Thai consonants, with leading vowels rearrangement The following lines of code defines the swapped vowel/consonants. BTW, following on from an earlier post, does anyone know how ๆ and ฯ should be sorted? All the references I've looked at ignore these characters. Link to comment Share on other sites More sharing options...
Richard W Posted March 16, 2014 Share Posted March 16, 2014 If you look at this file: /usr/share/i18n/locales/th_TH you can guess how the algorithm works It compares collating elements and each possible consonant with each possible preposed vowel is defined as a certain collating element. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit". That's how it does the swapping - see http://www.unicode.org/reports/tr10/#Many_To_Many for details. The ISO 14651 'common template' ordering (the ancestor of the file /usr/share/i18n/locales/iso14651_t1_common that is present in many GNU-based systems) and the Default Unicode Collation Element Table (DUCET) are generated by the same program. The difference is that the common template uses symbols while DUCET just has numbers. The use of symbols makes tailoring for specific languages easier, though it requires the use of 'copy' in the local definition files in a way not permitted by POSIX. The en_CA locale is one example where 'copy' is used in this very useful prohibited (and seemingly undocumented) manner. The swapping mechanism is now present in the 'common template' ordering, though the file iso14651_t1_common in Ubuntu 12.04 (precise pangolin) lacks it. This system would also work for other languages that might see 3 or more characters a being one collating element.Be very careful what you say. These collating elements are artifices; you do not want a search for ฒ to reject a sequence เฒ because it does not contain the collating element of ฒ. The Common Locale Data Repository (CLDR) has special 'collations' for searching. Most of them, including the one for Thai, switch these order reversing 'contractions' off for searching. But yes, this trick is used for Burmese, for although its preposed vowels are stored after the consonant, it effectively swaps final consonants and vowels around for ordering purposes. At least in Burmese one can easily identify final consonants. Lao has the worst of both worlds - it needs cluster aware swapping of preposed vowels and, for the better dictionaries, (further) swapping of vowels and final consonants. The simple (= greedy) rules for splitting a word into collating elements do not quite work for Lao unless you mark the syllable boundaries even when they're perfectly obvious. The CLDR Collation Algorithm allows tables to include the intelligence to spot these obvious cases in Lao, but one ends up with enormous tables. 1 Link to comment Share on other sites More sharing options...
kriswillems Posted March 16, 2014 Share Posted March 16, 2014 (edited) I am still confused why the sorting working for me after: sudo localedef -f TIS-620 -i th_TH th_TH Because if do a dump of my memory, I clearly see the string is UTF-8 encoded. So, I don't understand why setlocale(LC_COLLATE, "th_TH.tis620"); worked. Anyway, I tried: sudo localedef -f UTF-8 -i th_TH th_TH Now, th_TH.utf8 shows up when I do locale -a (like on your system, Richard). And I can change this line my program to: result = setlocale(LC_COLLATE, "th_TH.tis620"); to: result = setlocale(LC_COLLATE, "th_TH.utf8"); There seems to be one guy behind all Thai language support in Linux (C programming). His homepage is : http://linux.thai.net/~thep/ Here you can read about what he says about ๆ and ฯ http://linux.thai.net/~thep/tsort.html On your linux system you can find them in the th_TH file under the punctuation marks: % punctuation marks, ordered after ISO/IEC 14651 <U0E2F> IGNORE;IGNORE;<U0E2F>;IGNORE % THAI CHARACTER PAIYANNOI <U0E46> IGNORE;IGNORE;<U0E46>;IGNORE % THAI CHARACTER MAIYAMOK And here you see the weight he assigns to ๆ and ฯ marks in thcoll: http://linux.thai.net/websvn/wsvn/software.libthai/trunk/src/thcoll/cweight.c Edited March 16, 2014 by kriswillems Link to comment Share on other sites More sharing options...
Richard W Posted March 16, 2014 Share Posted March 16, 2014 I am still confused why the sorting working for me after: sudo localedef -f TIS-620 -i th_TH th_TH Because if do a dump of my memory, I clearly see the string is UTF-8 encoded. So, I don't understand why setlocale(LC_COLLATE, "th_TH.tis620"); worked. Did it work? For my test cases, with UTF-8 strings sorted according to the 8-bit locale, I get: Compare กาน v. ขาร: -1 Compare เกา v. คน: 1 Compare การ v. คน: 1 Compare คน v. เด็ก: -1 Compare คน v. ดัด: -1 The two results with '1' as the result are wrong.So, it works for the one example you gave, but the results are almost random. In particular, the results for the comparison กาน v. ขาร depend on the results of comparing the final letter. Here you can read about what he says about ๆ and ฯ http://linux.thai.net/~thep/tsort.html On your linux system you can find them in the th_TH file under the punctuation marks: % punctuation marks, ordered after ISO/IEC 14651 <U0E2F> IGNORE;IGNORE;<U0E2F>;IGNORE % THAI CHARACTER PAIYANNOI <U0E46> IGNORE;IGNORE;<U0E46>;IGNORE % THAI CHARACTER MAIYAMOK The comment in the file is very misleading. ISO/IEC 14651 and DUCET still treat paiyannoi (ฯ) as though it were the final consonant in alphabetical order, coming between ฮ and ะ, and mai yamok (ๆ) as the first letter. As to the treatment of the third and fourth level weights, ISO/IEC 14651, at least up to 2008, seems to make a total mess out of a complicated system.I couldn't find any evidence in the RID for the ordering of paiyannoi, and I suspect it doesn't exist. Evidence for mai yamok may exist, but I don't yet intend to flick through the RID looking for it. Part of the complication is that some claim that mai yamok should always be preceded by a space, in which case its sorting properties may simply flow from those of the character SPACE. Now, there is an option in the collation rules which says, ignore punctuation (or ignore punctuation and symbols - there are conflicting views) until it comes to tie-breaking. Thep puts punctuation at the 3rd level in his tables and capitalisation at the 4th level, whereas the standard is to put capitalisation at the third level and 'ignored' punctuation at the fourth level. For sorting pure Thai the difference in ordering does not matter. This option to ignore punctuation etc. is called 'variable weighting'. I am not sure what should happen to paiyannoi and maiyamok when sorting Thai with this rule is in effect. The question is whether paiyannoi and maiyamok should be ignored to the same extent as spaces when one ignores spaces in sorting. Link to comment Share on other sites More sharing options...
kriswillems Posted March 16, 2014 Share Posted March 16, 2014 (edited) My result (using UTF-8 strings) worked both with setlocale(LC_COLLATE, "th_TH.tis620"); and result = setlocale(LC_COLLATE, "th_TH.utf8"); Compare กาน v. ขาร: -1Compare เกา v. คน: -3Compare การ v. คน: -3Compare คน v. เด็ก: -33Compare คน v. ดัด: -24 But I don't understand why the first one (th_TH.tis620) worked. Edited March 16, 2014 by kriswillems Link to comment Share on other sites More sharing options...
kriswillems Posted March 16, 2014 Share Posted March 16, 2014 (edited) Sorry, I seem to have made a mistake. The last 2 results are: Compare คน v. เด็ก: -16Compare คน v. ดัด: -16 Looks like there's something wrong with tis620 sorting on my system (it incorectly assumes the characters are in utf-8). Normally I just use utf-8 anyway. Edited March 16, 2014 by kriswillems Link to comment Share on other sites More sharing options...
Richard W Posted March 16, 2014 Share Posted March 16, 2014 Looks like there's something wrong with tis620 sorting on my system (it incorectly assumes the characters are in utf-8).What is your system? It's just conceivable that your system is checking the consistency of the locale settings - they're generally not required to work if thoroughly inconsistent. It may be worth setting locale for category LC_ALL rather than just LC_COLLATE and seeing what happens. Link to comment Share on other sites More sharing options...
kriswillems Posted March 17, 2014 Share Posted March 17, 2014 (edited) ubuntu 12.04 (64 bit). LC_ALL gives the the same results. Edited March 17, 2014 by kriswillems Link to comment Share on other sites More sharing options...
ThaiNotes Posted March 18, 2014 Author Share Posted March 18, 2014 Nice! Thanks. What does it do with ๆ and ฯ ? It gets them wrong. It treats them as ordinary characters collating according to their Unicode value. For example, เบะปาก เบา เบาฯ เบาๆ เบ้า เบาความ (which I think is in probably the correct alphabetic order. เบาฯ is a made-up word for testing purposes) sorts as เบะปาก เบา เบ้า เบาความ เบาฯ เบาๆ Fortunately, that's good enough for my web application. LiibreOffice (which I presume uses the Linux sorting functionality) sorts as เบะปาก เบา เบ้า เบาฯ เบาๆ เบาความ which seems wrong to me. Link to comment Share on other sites More sharing options...
Richard W Posted March 18, 2014 Share Posted March 18, 2014 I agree that1) เบา 2) เบ้า 3) เบาๆis the wrong order. After looking through the RID from ก to ณ I found two clear examples, the orderings1) งัก ๆ, งั่ก ๆ 2) งั่กand1) งัง ๆ 2) งั่งNow this argues that the most significant weight of ๆ is secondary or less. The simplest argument is that one should be using a 'variable' weighting (alternate=shifted) as opposed to (alternate=non-ignorable). LibreOffice is probably using (alternate=non-ignorable).I can't find any evidence for the weighting of paiyannoi. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now