Computer Sorting in Thai

Followers

March 13, 201412 yr

I appreciate this is of very limited interest, but I recently faced an interesting challenge in how to sort lists of Thai words alphabetically. For reasons that are unimportant here, I had to code this from scratch. Doing this the same way I do as a human is poorly suited to computers. I found an old algorithm (Londe & Warotamasikkhadit, 1969) and implemented it in Java. I give my code here (i) because I think the algorithm is both clever and interesting, and (ii) in the future someone facing the same challenge might stumble across this page.

First the static variables

    static final char SARA_E = 0x0E40;
    static final char SARA_AI_MAIMALAI = 0x0E44;
    static final char MAITAIKHU = 0x0E47;
    static final char THANTHAKHAT = 0x0E4C;  // a.k.a. "garan"

A couple of utility functions

    boolean isLeadingVowel(char c) {
        // Returns true if character is in the range from SARA E to SARA Ai MAIMALAI, 
        // i.e. if the character is a leading vowel
        
        return (c >= SARA_E && c <= SARA_AI_MAIMALAI);
    }
    boolean isToneMark (char c) {
        // Returns true if character is in the range from MAITHAIKHU to THANTHAKHAT
        // which includes the four tone marks.  I.e. all "above" symbols
        
        return (c >= MAITAIKHU && c <= THANTHAKHAT);
    }

Finally the code to produce Strings that can be compared directly.

    String getThaiComparisonString(String s) {
    
        // Convert String to separate characters so we can manipulate them
        char[] chars = s.toCharArray();
                
        // Swap all leading vowels with next character
        for (int i = 0; i < chars.length; i++) {
            if (isLeadingVowel(chars[i])) {
                char c = chars[i];
                chars[i] = chars[i + 1];
                chars[i + 1] = c;
                i++;
            }
        }
        
        // The String for comparison is built in to parts, here referred to
        // as "head" and "tail".  "tail" always begins with "00".
        String tail = "00";
        
        // For each tone mark, Mai Tai Khoo, or Thantakhat found, remove it
        // from "head", add a 2 digit String to "tail" representing 
        // its original position from the END of the original String, 
        // then append the mark itself to "tail"
        String head = "";
        for (int i = 0; i < chars.length; i++) {
            if (isToneMark(chars[i])) {
                int pos = chars.length - i;
                if (pos >= 10)
                    tail += "" + pos;
                else
                    tail+= "0" + pos;
                tail += chars[i];                                
            }
            else {
                head += chars[i];
            }
        }

        // Return the String for comparison
        return head + tail;
    }

Finally, to sort, all you need is:

Collections.sort(this, new Comparator<Card>() {
            public int compare(String s1, String s2) {
                    return getThaiComparisonString(s1).compareTo(getThaiComparisonString(s2));
            }});

So, for example, in the process เมษายน becomes มเษายน00 for comparison purposes, and กุมภาพันธ์ becomes กุมภาพันธ0001์.

Pretty neat, huh?

March 13, 201412 yr

Nice! Thanks.

What does it do with ๆ and ฯ ?

March 13, 201412 yr

PS. For those that are into C, you can find a library of function related to Thai here:

http://linux.thai.net/websvn/wsvn/software.libthai?

There's another generic sorting algorithm in trunk/src/thcoll. Some header file used are in trunk/include/thai

March 15, 201412 yr

PS. For those that are into C, ...

Standard functions strcoll() and strxfrm() should do the comparison. If you want elaborate solutions, there is also ICU (International Components for Unicode).

March 15, 201412 yr

PS. For those that are into C, ...

Standard functions strcoll() and strxfrm() should do the comparison. If you want elaborate solutions, there is also ICU (International Components for Unicode).

Interesting. To set the language used by strcoll(), i've to call

char * setlocale (int category, const char *locale)

But if I ask a list of supported languages on my system, like this :

locale -a

I don't immediately see anything referring to Thai.

Any suggestions?

March 15, 201412 yr

Ok, I've found the answer.

I had to add a new locale for Thai like this:

sudo localedef -f TIS-620 -i th_TH th_TH

After that th_TH.tis620 shows up when executing locale -a

After that this little program seems to work.

#include <string.h>

#include <stdio.h>

#include <locale.h>

int main (void)

{

char * result;

result = setlocale(LC_COLLATE, "th_TH.tis620");

printf ("setlocale returned %s\n", result);

printf ("compare result: %d\n", strcoll("กาน", "ขาร"));

}

March 15, 201412 yr

I had to add a new locale for Thai like this:

sudo localedef -f TIS-620 -i th_TH th_TH

Useful to know. For Thai, I only had th_TH.utf8 on my machine until I executed that command.

After that this little program seems to work.

<snip>

printf ("compare result: %d\n", strcoll("กาน", "ขาร"));

One needs a bigger battery of tests to ensure that the preposed characters are being swapped properly. 10 years ago the standards (e.g. ISO 14651) relied on preposed vowels being preprocessed as in in the original post. My test results (adapting your code) were:

Compare กาน v. ขาร: -1

Compare เกา v. คน: -3

Compare การ v. คน: -3

Compare คน v. เด็ก: -16

Compare คน v. ดัด: -16

I get the same difference values whether I use th_TH.utf8 or th_TH.tis620, though obviously the strings and selected locale have to be in the right encoding.

March 16, 201412 yr

I found other useful commands:

local -m

Gives an overview of the character sets.

localedef --help

shows the locale path, the files in this path are used as a basis for the compare algorithm (strcoll).

(in my case /usr/lib/locale:/usr/share/i18n)

If you look at this file:

/usr/share/i18n/locales/th_TH

you can guess how the algorithm works

It compares collating elements and each possible consonant with each possible preposed vowel is defined as a certain collating element. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit". This system would also work for other languages that might see 3 or more characters a being one collating element.

March 16, 201412 yr

Author

I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit".

Looking at the code, you're right that leading vowels and consonants are seen as one unit. However, they are actually swapped. Line 493 includes the comment

% Thai consonants, with leading vowels rearrangement

The following lines of code defines the swapped vowel/consonants.

BTW, following on from an earlier post, does anyone know how ๆ and ฯ should be sorted? All the references I've looked at ignore these characters.

March 16, 201412 yr

If you look at this file:

/usr/share/i18n/locales/th_TH

you can guess how the algorithm works

It compares collating elements and each possible consonant with each possible preposed vowel is defined as a certain collating element. I don't think vowel and consonants are swapped by the algorithm, they are just seen as one "unit".

That's how it does the swapping - see http://www.unicode.org/reports/tr10/#Many_To_Many for details. The ISO 14651 'common template' ordering (the ancestor of the file /usr/share/i18n/locales/iso14651_t1_common that is present in many GNU-based systems) and the Default Unicode Collation Element Table (DUCET) are generated by the same program. The difference is that the common template uses symbols while DUCET just has numbers. The use of symbols makes tailoring for specific languages easier, though it requires the use of 'copy' in the local definition files in a way not permitted by POSIX. The en_CA locale is one example where 'copy' is used in this very useful prohibited (and seemingly undocumented) manner.

The swapping mechanism is now present in the 'common template' ordering, though the file iso14651_t1_common in Ubuntu 12.04 (precise pangolin) lacks it.

This system would also work for other languages that might see 3 or more characters a being one collating element.

Be very careful what you say. These collating elements are artifices; you do not want a search for ฒ to reject a sequence เฒ because it does not contain the collating element of ฒ. The Common Locale Data Repository (CLDR) has special 'collations' for searching. Most of them, including the one for Thai, switch these order reversing 'contractions' off for searching.

But yes, this trick is used for Burmese, for although its preposed vowels are stored after the consonant, it effectively swaps final consonants and vowels around for ordering purposes. At least in Burmese one can easily identify final consonants. Lao has the worst of both worlds - it needs cluster aware swapping of preposed vowels and, for the better dictionaries, (further) swapping of vowels and final consonants. The simple (= greedy) rules for splitting a word into collating elements do not quite work for Lao unless you mark the syllable boundaries even when they're perfectly obvious. The CLDR Collation Algorithm allows tables to include the intelligence to spot these obvious cases in Lao, but one ends up with enormous tables.

March 16, 201412 yr

I am still confused why the sorting working for me after:

sudo localedef -f TIS-620 -i th_TH th_TH

Because if do a dump of my memory, I clearly see the string is UTF-8 encoded.

So, I don't understand why

setlocale(LC_COLLATE, "th_TH.tis620");

worked.

Anyway, I tried:

sudo localedef -f UTF-8 -i th_TH th_TH

Now, th_TH.utf8 shows up when I do locale -a

(like on your system, Richard).

And I can change this line my program to:

result = setlocale(LC_COLLATE, "th_TH.tis620");

to:

result = setlocale(LC_COLLATE, "th_TH.utf8");

There seems to be one guy behind all Thai language support in Linux (C programming).

His homepage is :

http://linux.thai.net/~thep/

Here you can read about what he says about ๆ and ฯ

http://linux.thai.net/~thep/tsort.html

On your linux system you can find them in the th_TH file under the punctuation marks:

% punctuation marks, ordered after ISO/IEC 14651

<U0E2F> IGNORE;IGNORE;<U0E2F>;IGNORE % THAI CHARACTER PAIYANNOI

<U0E46> IGNORE;IGNORE;<U0E46>;IGNORE % THAI CHARACTER MAIYAMOK

And here you see the weight he assigns to ๆ and ฯ marks in thcoll:

http://linux.thai.net/websvn/wsvn/software.libthai/trunk/src/thcoll/cweight.c

March 16, 201412 yr

I am still confused why the sorting working for me after:

sudo localedef -f TIS-620 -i th_TH th_TH

Because if do a dump of my memory, I clearly see the string is UTF-8 encoded.

So, I don't understand why

setlocale(LC_COLLATE, "th_TH.tis620");

worked.

Did it work? For my test cases, with UTF-8 strings sorted according to the 8-bit locale, I get:

Compare กาน v. ขาร: -1
Compare เกา v. คน: 1
Compare การ v. คน: 1
Compare คน v. เด็ก: -1
Compare คน v. ดัด: -1

The two results with '1' as the result are wrong.

So, it works for the one example you gave, but the results are almost random. In particular, the results for the comparison กาน v. ขาร depend on the results of comparing the final letter.

Here you can read about what he says about ๆ and ฯ

http://linux.thai.net/~thep/tsort.html

On your linux system you can find them in the th_TH file under the punctuation marks:

% punctuation marks, ordered after ISO/IEC 14651

<U0E2F> IGNORE;IGNORE;<U0E2F>;IGNORE % THAI CHARACTER PAIYANNOI

<U0E46> IGNORE;IGNORE;<U0E46>;IGNORE % THAI CHARACTER MAIYAMOK

The comment in the file is very misleading. ISO/IEC 14651 and DUCET still treat paiyannoi (ฯ) as though it were the final consonant in alphabetical order, coming between ฮ and ะ, and mai yamok (ๆ) as the first letter. As to the treatment of the third and fourth level weights, ISO/IEC 14651, at least up to 2008, seems to make a total mess out of a complicated system.

I couldn't find any evidence in the RID for the ordering of paiyannoi, and I suspect it doesn't exist. Evidence for mai yamok may exist, but I don't yet intend to flick through the RID looking for it. Part of the complication is that some claim that mai yamok should always be preceded by a space, in which case its sorting properties may simply flow from those of the character SPACE.

Now, there is an option in the collation rules which says, ignore punctuation (or ignore punctuation and symbols - there are conflicting views) until it comes to tie-breaking. Thep puts punctuation at the 3rd level in his tables and capitalisation at the 4th level, whereas the standard is to put capitalisation at the third level and 'ignored' punctuation at the fourth level. For sorting pure Thai the difference in ordering does not matter. This option to ignore punctuation etc. is called 'variable weighting'. I am not sure what should happen to paiyannoi and maiyamok when sorting Thai with this rule is in effect. The question is whether paiyannoi and maiyamok should be ignored to the same extent as spaces when one ignores spaces in sorting.

March 16, 201412 yr

My result (using UTF-8 strings) worked both with

setlocale(LC_COLLATE, "th_TH.tis620");

and

result = setlocale(LC_COLLATE, "th_TH.utf8");

Compare กาน v. ขาร: -1
Compare เกา v. คน: -3
Compare การ v. คน: -3
Compare คน v. เด็ก: -33
Compare คน v. ดัด: -24

But I don't understand why the first one (th_TH.tis620) worked.

March 16, 201412 yr

Sorry, I seem to have made a mistake.

The last 2 results are:

Compare คน v. เด็ก: -16
Compare คน v. ดัด: -16

Looks like there's something wrong with tis620 sorting on my system (it incorectly assumes the characters are in utf-8).

Normally I just use utf-8 anyway.

March 16, 201412 yr

Looks like there's something wrong with tis620 sorting on my system (it incorectly assumes the characters are in utf-8).

What is your system?

It's just conceivable that your system is checking the consistency of the locale settings - they're generally not required to work if thoroughly inconsistent. It may be worth setting locale for category LC_ALL rather than just LC_COLLATE and seeing what happens.

March 17, 201412 yr

ubuntu 12.04 (64 bit).

LC_ALL gives the the same results.

March 18, 201412 yr

Author

Nice! Thanks.

What does it do with ๆ and ฯ ?

It gets them wrong. It treats them as ordinary characters collating according to their Unicode value.

For example,

เบะปาก
เบา
เบาฯ
เบาๆ
เบ้า
เบาความ

(which I think is in probably the correct alphabetic order. เบาฯ is a made-up word for testing purposes)

sorts as

เบะปาก
เบา
เบ้า
เบาความ
เบาฯ
เบาๆ

Fortunately, that's good enough for my web application.

LiibreOffice (which I presume uses the Linux sorting functionality) sorts as

เบะปาก
เบา
เบ้า
เบาฯ
เบาๆ
เบาความ

which seems wrong to me.

March 18, 201412 yr

I agree that
1) เบา 2) เบ้า 3) เบาๆ
is the wrong order. After looking through the RID from ก to ณ I found two clear examples, the orderings
1) งัก ๆ, งั่ก ๆ 2) งั่ก
and
1) งัง ๆ 2) งั่ง

Now this argues that the most significant weight of ๆ is secondary or less. The simplest argument is that one should be using a 'variable' weighting (alternate=shifted) as opposed to (alternate=non-ignorable). LibreOffice is probably using (alternate=non-ignorable).

I can't find any evidence for the weighting of paiyannoi.

Create an account or sign in to comment

Followers

Go to topic listing

No registered users viewing this page.

39 baht beers
Pattaya

39 baht beers

georgegeorgia · 18 hours ago18 hr

There were bars in Jomtien when I was last there between certain times selling 39 baht beer , all beers 39 baht Are they still making a profit ?
- 15 replies
- 574 views
georgegeorgia

18 hours ago18 hr

tjintx

1 hour ago1 hr

Have you changed… or has Thailand?
Pattaya

Have you changed… or has Thailand?

Merlin · Saturday at 08:09 AM2 days

When I first came here, it was all pretty predictable. Nights out, bars, always somewhere busy. That was the routine. These days it’s different. Quieter, less going out. I still drop in now and then, but it’s not the center of things anymore. The scene itself feels different too.Fewer girls around, or just moved online? Prices definitely not what they used to be Thailand not quite the “cheap” place it once was So sometimes I wonder… has the place changed, or have I? Do you still live the same
- 62 replies
- 2,084 views
Merlin

Saturday at 08:09 AM2 days

xtrnuno41

1 hour ago1 hr

4D Chess
Political Soapbox

4D Chess

BLMFem · yesterday at 10:04 AM1 day

True story: -Trump imposes a 50% tariff/tax on Canadian aluminium and starts buying from his new friends in the GCC area. -Canada shrugs and starts selling their aluminium in Europe. -Trumps starts a war with Iran. Iran starts hitting aluminium producers in the GCC area. -The US imports around 60% of it's aluminium, and a lot of it comes from the GCC area. -Canada is operating at full capacity, and both won't and can't switch back to supplying the US. -US importers, who already pay more than ot
- 34 replies
- 362 views
BLMFem

Yesterday at 10:04 AM1 day

beautifulthailand99

3 minutes ago3 min

Trump Prepping for Iran Taco?
Political Soapbox

Trump Prepping for Iran Taco?

Alan Zweibel · 10 hours ago10 hr

Trump insists killing of Iran’s leaders ‘truly is regime change’ US President Donald Trump says that the strikes against the top leadership of the Islamic Republic effectively amount to regime change, repeating a contention he first expressed upon announcing talks to end the war with Iran. The broader Islamic Republic regime has not, in fact, fallen, but Trump is apparently trying to frame the decapitation of Supreme Leader Ali Khamenei and dozens of other top officials as tantamount to the coll
- 9 replies
- 155 views
Alan Zweibel

10 hours ago10 hr

Wingate

4 minutes ago4 min

More than 40 paramedics killed by Israeli strikes in Lebanon
What the Papers are Saying

More than 40 paramedics killed by Israeli strikes in Lebanon

JimCM · Thursday at 02:19 AM4 days

Can anyone else see a repeating pattern here of war crimes and total disregard to international law, and of course no consequences as they the terrorist nstate of Israel is protected by the mighty USA. More than 40 paramedics killed by Israeli strikes in Lebanon Rescuers say international law not being applied to them https://www.reuters.com/world/middle-east/lebanon-paramedics-mourn-their-own-killed-israeli-strike-2026-03-25/
- 165 replies
- 1,977 views
JimCM

Thursday at 02:19 AM4 days

JimCM

13 minutes ago13 min

Bad News for Trump, bad news for all retirees
Political Soapbox

Bad News for Trump, bad news for all retirees

EVENKEEL · Saturday at 04:56 AM2 days

US stocks plummeted on March 27, 2026, with the Dow Jones dropping nearly 800 points () to enter correction territory, while the Nasdaq sank . The market faces its fifth straight weekly loss due to rising oil prices above /barrel, fueled by intensifying Middle East conflict, sparking inflation fears and prompting investors to sell riskier assets. The Economic Times +3 If this doesn't change soon the republicans are toast
- 127 replies
- 3,483 views
EVENKEEL

Saturday at 04:56 AM2 days

Paris333

15 minutes ago15 min

Expats Face Visa Uncertainty in Thailand Rules Shift
Immigration
Expats Face Visa Uncertainty in Thailand Rules Shift

JamesPhuket10 replied to Georgealbert's topic in Thailand News - Discussion

I wish the UK would do the same thing, make people prove they have money to support themselves if they are long term visitors like we are in Thailand, I don't see any problem in that. In fact it is a good thing as the locals will know we are not here to scrounge but are supporting ourselves financially. I don't understand the part of your comment regarding "People get stuck in Thailand...." , how are we stuck?
- 2 minutes ago2 min
- 81 replies
4D Chess
4D Chess

beautifulthailand99 replied to BLMFem's topic in Political Soapbox

Even the Sun - yes even the Sun. That sound you here is every right wing punit who isn't in the MAGAsphere rushing for the exits. No one likes a loser and not one who is costing the world dearly.
- 3 minutes ago3 min
- 34 replies
Expats Face Visa Uncertainty in Thailand Rules Shift
Immigration
Expats Face Visa Uncertainty in Thailand Rules Shift

simon43 replied to Georgealbert's topic in Thailand News - Discussion

I go in and out of Thailand on Visa-free 'tourist' stamps all the time... no IO has ever asked me why or caused me problems. They simply stamp me in and stamp me out. 3 weeks ago I entered Thailand, stayed for a few days, flew to Myanmar, then flew back to Bangkok Don Muang and was stamped in again, then I'll fly out to Cambodia in a week, then will re-enter Thailand a month later. No questions, no hassles, no request for proof of funds or other documents - what am I doing right? :)
- 4 minutes ago4 min
- 81 replies
Casualty of War: Commander Leigh R. Tate?
Casualty of War: Commander Leigh R. Tate?

GammaGlobulin replied to GammaGlobulin's topic in ASEAN NOW Community Pub

Nuclear Bombs are VERY powerful.... Trust me....
- 4 minutes ago4 min
- 32 replies

Casualty of War: Commander Leigh R. Tate?
ASEAN NOW Community Pub

Casualty of War: Commander Leigh R. Tate?

GammaGlobulin · 10 hours ago10 hr

Dear Folks, Do you feel sorry for the US Navy Officers Commander Leigh R Tate and Executive Officer Jeffrey E York of the USS Spruance? Recently, it seems that these two officers have been “outed” by various embassies and news services for being in command of the vessel which fired three missiles at a school, ending the lives of about 170 school girls. Can this be true? Of course, this cannot be good news for the two officers. Are they not also casualties of the Israeli-American decision to
- 32 replies
- 587 views
GammaGlobulin

10 hours ago10 hr

GammaGlobulin

4 minutes ago4 min

Is it time to start "prepping" ?
ASEAN NOW Community Pub

Is it time to start "prepping" ?

CharlieH · 11 hours ago11 hr

So, according to the headlines today Thailand has 60 days fuel left! This will undoubtedly mean that it wont be long before stock deliveries to stores start being affected. May be common household supplies will start to become fewer on the shelves? Deliveries no longer possible. I see Japan had a run on ttoilet tissue of all things. How long before this starts to bite the average household? Is it time to start prepping? stock up on bottled water, sacks of rice, stock up the freezer ? Thoug
- 10 replies
- 566 views
CharlieH

11 hours ago11 hr

atpeace

1 hour ago1 hr

We all own slaves...
ASEAN NOW Community Pub

We all own slaves...

unblocktheplanet · 22 hours ago22 hr

Credit: the author, Google, BBC, Reddit, Quora, The Conversation. Proposed by Ghana, a major victim, 123 UN members have just declared the slave trade to be “the gravest crime against humanity”. The United States, Israel and Argentina voted against it. Fifty-two countries abstained, among them the UK and European states. WTF! Pulling down Confederate statues and renaming buildings doesn’t change the fact that an entire population fought a war to keep their slaves. When the war was over and th
- 8 replies
- 549 views
unblocktheplanet

22 hours ago22 hr

tjintx

1 hour ago1 hr

Computer Sorting in Thai

Featured Replies

Create an account or sign in to comment

Recently Browsing 0

Topics

39 baht beers

Have you changed… or has Thailand?

Topics

4D Chess

Trump Prepping for Iran Taco?

More than 40 paramedics killed by Israeli strikes in Lebanon

Bad News for Trump, bad news for all retirees

Popular Contributors

Latest posts...

Expats Face Visa Uncertainty in Thailand Rules Shift

4D Chess

Expats Face Visa Uncertainty in Thailand Rules Shift

Casualty of War: Commander Leigh R. Tate?

Popular in The Pub

Casualty of War: Commander Leigh R. Tate?

Is it time to start "prepping" ?

We all own slaves...

ASEAN NOW

MORE INFO

POPULAR AREAS

CONTACT US

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)