Jump to content

Computer Sorting in Thai


ThaiNotes

Recommended Posts

I appreciate this is of very limited interest, but I recently faced an interesting challenge in how to sort lists of Thai words alphabetically. For reasons that are unimportant here, I had to code this from scratch. Doing this the same way I do as a human is poorly suited to computers. I found an old algorithm (Londe & Warotamasikkhadit, 1969) and implemented it in Java. I give my code here (i) because I think the algorithm is both clever and interesting, and (ii) in the future someone facing the same challenge might stumble across this page.

First the static variables

    static final char SARA_E = 0x0E40;
    static final char SARA_AI_MAIMALAI = 0x0E44;
    static final char MAITAIKHU = 0x0E47;
    static final char THANTHAKHAT = 0x0E4C;  // a.k.a. "garan"		

A couple of utility functions

    boolean isLeadingVowel(char c) {
        // Returns true if character is in the range from SARA E to SARA Ai MAIMALAI, 
        // i.e. if the character is a leading vowel
        
        return (c >= SARA_E && c <= SARA_AI_MAIMALAI);
    }
    boolean isToneMark (char c) {
        // Returns true if character is in the range from MAITHAIKHU to THANTHAKHAT
        // which includes the four tone marks.  I.e. all "above" symbols
        
        return (c >= MAITAIKHU && c <= THANTHAKHAT);
    }

Finally the code to produce Strings that can be compared directly.

    String getThaiComparisonString(String s) {
    
        // Convert String to separate characters so we can manipulate them
        char[] chars = s.toCharArray();
                
        // Swap all leading vowels with next character
        for (int i = 0; i < chars.length; i++) {
            if (isLeadingVowel(chars[i])) {
                char c = chars[i];
                chars[i] = chars[i + 1];
                chars[i + 1] = c;
                i++;
            }
        }
        
        // The String for comparison is built in to parts, here referred to
        // as "head" and "tail".  "tail" always begins with "00".
        String tail = "00";
        
        // For each tone mark, Mai Tai Khoo, or Thantakhat found, remove it
        // from "head", add a 2 digit String to "tail" representing 
        // its original position from the END of the original String, 
        // then append the mark itself to "tail"
        String head = "";
        for (int i = 0; i < chars.length; i++) {
            if (isToneMark(chars[i])) {
                int pos = chars.length - i;
                if (pos >= 10)
                    tail += "" + pos;
                else
                    tail+= "0" + pos;
                tail += chars[i];                                
            }
            else {
                head += chars[i];
            }
        }

        // Return the String for comparison
        return head + tail;
    }

Finally, to sort, all you need is:

Collections.sort(this, new Comparator<Card>() {
            public int compare(String s1, String s2) {
                    return getThaiComparisonString(s1).compareTo(getThaiComparisonString(s2));
            }});

So, for example, in the process เมษายน becomes มเษายน00 for comparison purposes, and กุมภาพันธ์ becomes กุมภาพันธ0001์.

Pretty neat, huh?

Link to comment
Share on other sites

PS. For those that are into C, ...

Standard functions strcoll() and strxfrm() should do the comparison. If you want elaborate solutions, there is also ICU (International Components for Unicode).

Interesting. To set the language used by strcoll(), i've to call

char * setlocale (int category, const char *locale)

But if I ask a list of supported languages on my system, like this :

locale -a

I don't immediately see anything referring to Thai.

Any suggestions?

Link to comment
Share on other sites

Ok, I've found the answer.

I had to add a new locale for Thai like this:

sudo localedef -f TIS-620 -i th_TH th_TH

After that th_TH.tis620 shows up when executing locale -a

After that this little program seems to work.

#include <string.h>
#include <stdio.h>
#include <locale.h>
int main (void)
{
char * result;
result = setlocale(LC_COLLATE, "th_TH.tis620");
printf ("setlocale returned %s\n", result);
printf ("compare result: %d\n", strcoll("กาน", "ขาร"));
}
  • Like 1
Link to comment
Share on other sites

I had to add a new locale for Thai like this:

sudo localedef -f TIS-620 -i th_TH th_TH

Useful to know. For Thai, I only had th_TH.utf8 on my machine until I executed that command.

After that this little program seems to work.

<snip>

printf ("compare result: %d\n", strcoll("กาน", "ขาร"));

One needs a bigger battery of tests to ensure that the preposed characters are being swapped properly. 10 years ago the standards (e.g. ISO 14651) relied on preposed vowels being preprocessed as in in the original post. My test results (adapting your code) were:

Compare กาน v. ขาร: -1

Compare เกา v. คน: -3

Compare การ v. คน: -3

Compare คน v. เด็ก: -16

Compare คน v. ดัด: -16

I get the same difference values whether I use th_TH.utf8 or th_TH.tis620, though obviously the strings and selected locale have to be in the right encoding.

Link to comment
Share on other sites

I found other useful commands:

local -m

Gives an overview of the character sets.

localedef --help

shows the locale path, the files in this path are used as a basis for the compare algorithm (strcoll).

(in my case /usr/lib/locale:/usr/share/i18n)

If you look at this file:

/usr/share/i18n/locales/th_TH

you can guess how the algorithm works

It compares collating elements and each possible consonant with each possible preposed vowel is defined as a certain collating element. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit". This system would also work for other languages that might see 3 or more characters a being one collating element.

Link to comment
Share on other sites

I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit".

Looking at the code, you're right that leading vowels and consonants are seen as one unit. However, they are actually swapped. Line 493 includes the comment

% Thai consonants, with leading vowels rearrangement

The following lines of code defines the swapped vowel/consonants.

BTW, following on from an earlier post, does anyone know how ๆ and ฯ should be sorted? All the references I've looked at ignore these characters.

Link to comment
Share on other sites

If you look at this file:

/usr/share/i18n/locales/th_TH

you can guess how the algorithm works

It compares collating elements and each possible consonant with each possible preposed vowel is defined as a certain collating element. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit".

That's how it does the swapping - see http://www.unicode.org/reports/tr10/#Many_To_Many for details. The ISO 14651 'common template' ordering (the ancestor of the file /usr/share/i18n/locales/iso14651_t1_common that is present in many GNU-based systems) and the Default Unicode Collation Element Table (DUCET) are generated by the same program. The difference is that the common template uses symbols while DUCET just has numbers. The use of symbols makes tailoring for specific languages easier, though it requires the use of 'copy' in the local definition files in a way not permitted by POSIX. The en_CA locale is one example where 'copy' is used in this very useful prohibited (and seemingly undocumented) manner.

The swapping mechanism is now present in the 'common template' ordering, though the file iso14651_t1_common in Ubuntu 12.04 (precise pangolin) lacks it.

This system would also work for other languages that might see 3 or more characters a being one collating element.

Be very careful what you say. These collating elements are artifices; you do not want a search for ฒ to reject a sequence เฒ because it does not contain the collating element of ฒ. The Common Locale Data Repository (CLDR) has special 'collations' for searching. Most of them, including the one for Thai, switch these order reversing 'contractions' off for searching.

But yes, this trick is used for Burmese, for although its preposed vowels are stored after the consonant, it effectively swaps final consonants and vowels around for ordering purposes. At least in Burmese one can easily identify final consonants. Lao has the worst of both worlds - it needs cluster aware swapping of preposed vowels and, for the better dictionaries, (further) swapping of vowels and final consonants. The simple (= greedy) rules for splitting a word into collating elements do not quite work for Lao unless you mark the syllable boundaries even when they're perfectly obvious. The CLDR Collation Algorithm allows tables to include the intelligence to spot these obvious cases in Lao, but one ends up with enormous tables.

  • Like 1
Link to comment
Share on other sites

I am still confused why the sorting working for me after:

sudo localedef -f TIS-620 -i th_TH th_TH

Because if do a dump of my memory, I clearly see the string is UTF-8 encoded.

So, I don't understand why

setlocale(LC_COLLATE, "th_TH.tis620");

worked.

Anyway, I tried:

sudo localedef -f UTF-8 -i th_TH th_TH

Now, th_TH.utf8 shows up when I do locale -a

(like on your system, Richard).

And I can change this line my program to:

result = setlocale(LC_COLLATE, "th_TH.tis620");

to:
result = setlocale(LC_COLLATE, "th_TH.utf8");
There seems to be one guy behind all Thai language support in Linux (C programming).
His homepage is :
Here you can read about what he says about ๆ and ฯ
On your linux system you can find them in the th_TH file under the punctuation marks:
% punctuation marks, ordered after ISO/IEC 14651
<U0E2F> IGNORE;IGNORE;<U0E2F>;IGNORE % THAI CHARACTER PAIYANNOI
<U0E46> IGNORE;IGNORE;<U0E46>;IGNORE % THAI CHARACTER MAIYAMOK
And here you see the weight he assigns to ๆ and ฯ marks in thcoll:
Edited by kriswillems
Link to comment
Share on other sites

I am still confused why the sorting working for me after:

sudo localedef -f TIS-620 -i th_TH th_TH

Because if do a dump of my memory, I clearly see the string is UTF-8 encoded.

So, I don't understand why

setlocale(LC_COLLATE, "th_TH.tis620");

worked.

Did it work? For my test cases, with UTF-8 strings sorted according to the 8-bit locale, I get:

Compare กาน v. ขาร: -1
Compare เกา v. คน: 1
Compare การ v. คน: 1
Compare คน v. เด็ก: -1
Compare คน v. ดัด: -1
The two results with '1' as the result are wrong.

So, it works for the one example you gave, but the results are almost random. In particular, the results for the comparison กาน v. ขาร depend on the results of comparing the final letter.

Here you can read about what he says about ๆ and ฯ

http://linux.thai.net/~thep/tsort.html

On your linux system you can find them in the th_TH file under the punctuation marks:

% punctuation marks, ordered after ISO/IEC 14651

<U0E2F> IGNORE;IGNORE;<U0E2F>;IGNORE % THAI CHARACTER PAIYANNOI

<U0E46> IGNORE;IGNORE;<U0E46>;IGNORE % THAI CHARACTER MAIYAMOK

The comment in the file is very misleading. ISO/IEC 14651 and DUCET still treat paiyannoi (ฯ) as though it were the final consonant in alphabetical order, coming between ฮ and ะ, and mai yamok (ๆ) as the first letter. As to the treatment of the third and fourth level weights, ISO/IEC 14651, at least up to 2008, seems to make a total mess out of a complicated system.

I couldn't find any evidence in the RID for the ordering of paiyannoi, and I suspect it doesn't exist. Evidence for mai yamok may exist, but I don't yet intend to flick through the RID looking for it. Part of the complication is that some claim that mai yamok should always be preceded by a space, in which case its sorting properties may simply flow from those of the character SPACE.

Now, there is an option in the collation rules which says, ignore punctuation (or ignore punctuation and symbols - there are conflicting views) until it comes to tie-breaking. Thep puts punctuation at the 3rd level in his tables and capitalisation at the 4th level, whereas the standard is to put capitalisation at the third level and 'ignored' punctuation at the fourth level. For sorting pure Thai the difference in ordering does not matter. This option to ignore punctuation etc. is called 'variable weighting'. I am not sure what should happen to paiyannoi and maiyamok when sorting Thai with this rule is in effect. The question is whether paiyannoi and maiyamok should be ignored to the same extent as spaces when one ignores spaces in sorting.

Link to comment
Share on other sites

My result (using UTF-8 strings) worked both with

setlocale(LC_COLLATE, "th_TH.tis620");

and

result = setlocale(LC_COLLATE, "th_TH.utf8");

Compare กาน v. ขาร: -1
Compare เกา v. คน: -3
Compare การ v. คน: -3
Compare คน v. เด็ก: -33
Compare คน v. ดัด: -24

But I don't understand why the first one (th_TH.tis620) worked.

Edited by kriswillems
Link to comment
Share on other sites

Sorry, I seem to have made a mistake.

The last 2 results are:

Compare คน v. เด็ก: -16
Compare คน v. ดัด: -16

Looks like there's something wrong with tis620 sorting on my system (it incorectly assumes the characters are in utf-8).

Normally I just use utf-8 anyway.

Edited by kriswillems
Link to comment
Share on other sites

Looks like there's something wrong with tis620 sorting on my system (it incorectly assumes the characters are in utf-8).

What is your system?

It's just conceivable that your system is checking the consistency of the locale settings - they're generally not required to work if thoroughly inconsistent. It may be worth setting locale for category LC_ALL rather than just LC_COLLATE and seeing what happens.

Link to comment
Share on other sites

Nice! Thanks.

What does it do with ๆ and ฯ ?

It gets them wrong. It treats them as ordinary characters collating according to their Unicode value.

For example,

เบะปาก
เบา
เบาฯ
เบาๆ
เบ้า
เบาความ

(which I think is in probably the correct alphabetic order. เบาฯ is a made-up word for testing purposes)

sorts as

เบะปาก
เบา
เบ้า
เบาความ
เบาฯ
เบาๆ

Fortunately, that's good enough for my web application.

LiibreOffice (which I presume uses the Linux sorting functionality) sorts as

เบะปาก
เบา
เบ้า
เบาฯ
เบาๆ
เบาความ

which seems wrong to me.

Link to comment
Share on other sites

I agree that
1) เบา 2) เบ้า 3) เบาๆ
is the wrong order. After looking through the RID from ก to ณ I found two clear examples, the orderings
1) งัก ๆ, งั่ก ๆ 2) งั่ก
and
1) งัง ๆ 2) งั่ง

Now this argues that the most significant weight of is secondary or less. The simplest argument is that one should be using a 'variable' weighting (alternate=shifted) as opposed to (alternate=non-ignorable). LibreOffice is probably using (alternate=non-ignorable).

I can't find any evidence for the weighting of paiyannoi.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.







×
×
  • Create New...