Jump to content

Recommended Posts

Posted

How consistent are Thais at counting the number of words in a sentence or a paragraph? How natural is the task?

I ask because of the problem of splitting Thai text into words, which many word processors and line layout systems attempt to do. One can tell these systems where word boundaries are by inserting what LibreOffice calls a 'no-width break' and can often tell them where they aren't by inserting what LibreOffice calls a 'no-width no-break'. Techies will know these characters as ZERO WIDTH SPACE (ZWSP) and WORD JOINER (WJ). From next week, the Unicode standard will only allow WJ to mark places where a new line cannot be started; it will definitely not be a valid indicator that the letters on either side belong to the same word. The Unicode Technical Committee (UTC) claim that it is already not such an indicator, but that was not the belief of many people who write the software. Consequently, this change is not going through a process of international approval, but has been decided upon by the UTC, largely on the basis that it doesn't upset algorithms which don't work on Thai.

One argument for preserving the functionality is that it is needed so that the number of words can be counted correctly when software attempts to find word boundaries without any help from these 'no width' indicators. However, I am not sure that Thais agree on how many words there are in a paragraph. Europeans get regular training through the use of spaces, but Thais do not.

Posted

I think with educated people the counts are going to be pretty consistent. With the less educated, less so.

One of the major problems is the treatment of คำคู่ and other words that may be a single word in their own right or may be a collocation. For example, how many words is เร็วด่วน?

Thankfully the issue isn't as much as a problem as it is with Vietnamese where it's traditionally been taught that Vietnamese is a monosyllabic language, even though there clearly are polysyllabic words.

As for "software ... to find word boundaries without any help from these ... indicators", the reality is that even when backed by a dictionary, the available parsing algorithms are not particularly good - particularly where the text contains words which are not in the dictionary. See, for example thai-language.com's and thai-notes.com's "bulk lookup", and thai2english.com's equivalent function.

Posted

My fundamental question, though, is whether the counting error due to unrecognised or misidentified words is significant compare to the variation resulting from different people doing the counting.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.



×
×
  • Create New...