Jump to content

Thai In Openoffice On Ubuntu Lucid Lynx


Richard W

Recommended Posts

I can't get Thai spell-checking to work in the word processor of OpenOffice.org 3.2.0 on Ubuntu 10.04 Lucid Lynx. I have installed the Thai dictionary package myspell-th Version 1:3.2.0-3ubuntu3.1, and its installation set it up for hunspell. The problem seems to be that the words provided to the spell checker are defined as columns, so that the spell-checker tries to check the 4-character word for 'island' as three or four words! The default language for complex-text-layout is set to Thai, and I get 'Thai justification' of Thai text (i.e. spaces are clearly inserted between characters rather than just growing interphrase gaps).

Am I missing a trick, or does Thai spell-checking not work for this combination of OS and OpenOffice.org? The interface language is set up for English - I do not yet need it to be set up for Thai.

Separating words by ZWSP, plain space and even having one word per paragraph all fail to help.

Link to comment
Share on other sites

  • 2 months later...

openoffice.org-l10n-th: office productivity suite -- Thai language package

help or not? please give specifics, devs need a little help

The labelling on the tin says not. To quote from http://packages.ubuntu.com/lucid/openoffice.org-l10n-th :

This package contains the localization of OpenOffice.org in Thai. It contains the user interface, the templates and the autotext features. (please note that not all this is available for all possible languages). You can switch user interface language using the locales system.

Spelling dictionaries, hyphenation patterns, thesauri and help are not included in this package. There are some available in separate packages (myspell-*, openoffice.org-hyphenation-*, openoffice.org-thesaurus-*, openoffice.org-help-*)

If you just want to be able to spellcheck etc. in other languages, you can install extra dictionaries/hyphenation patterns/thesauri independently of the language packs.

I gave it a try, but it didn't help. What was dispiriting was that almost none of the spell-checking interface was localised to Thai. I have myspell-th installed, but it doesn't provide anything useful.

I hope it is permitted to give an example in this forum.

As an example, I tried the line ไม่ รู้ ว่า จะ ทำ อย่าง ไหน with three variations - spaces between words, 'zero-width space' (ZWSP) between words, and nothing between words. In each case, the spell-checker saw it as a sequence like ม่ รู้ ว่ ย่ , of which only รู้ is a word - it was recognised as spelt correctly.

Now, Thai may just be a tricky problem because no-one expects to be able to force Thais to use ZWSP, although I get the impression that Cambodians have accepted it. Thai needs a dictionary to split text between lines at word boundaries, but that doesn't help with misspelt words. The task isn't impossible - Word manages it. However, at least one version of OpenOffice seems to have addressed the issue by defining Thai words to be consonants plus associated marks, and these seem to be the words supplied to the spell-checking system.

Link to comment
Share on other sites

I did some more digging, and this time I came up with a partial answer - Spellchecker API isn't appropriate for languages without any space between words like Thai. Interestingly it's being worked on for Khmer as well, so it looks as though they may not be working with ZWSP. I still suspect there are other text-breaking issues fouling up Thai spell-checking. So, OpenOffice does not substitute for Word in Thai!

Link to comment
Share on other sites

I did some more digging, and this time I came up with a partial answer - Spellchecker API isn't appropriate for languages without any space between words like Thai. Interestingly it's being worked on for Khmer as well, so it looks as though they may not be working with ZWSP. I still suspect there are other text-breaking issues fouling up Thai spell-checking. So, OpenOffice does not substitute for Word in Thai!

You may want to check the successor to Openoffice - Libreoffice. A google search shows it has an add-on for Thai language. Both have deb packages. I'm not sure openofice has a future any more.

Link to comment
Share on other sites

  • 3 weeks later...

The only Thai-specific improvement I can see in LibreOffice 3.3 over OpenOffice.org 3.2 is that more of the spelling interface has been translated to Thai - and that might be in OpenOffice.org 3.3. The font sizing seems worse for LibreOffice than OpenOffice - as though someone had assumed one pitch (as opposed to x-size) fitted all scripts. It's actually slightly worrying that I couldn't confirm that LibreOffice has inherited OpenOffice's bug and to-do lists. Apart from that, Thai spell-checking is just as dysfunctional.

Link to comment
Share on other sites

  • 6 months later...

I thought I had a solution, but I need some help.

My thought for getting some spell-checking was as follows:

1) Separate my Thai words with zero-width space.

2) Extract content.xml and insert line breaks outside paragraphs and headings, e.g.

unzip -p file.odt content.xml | spaceit > content.xml

where spaceit is a program of my own.

3) Edit content.xml with emacs using a Thai spell checker.

4) Put the edited file back

zip file.odt content.xml

. When I edit the file in LibreOffice, the the line breaks inserted in Step 2 are removed.

I've been having trouble with Step 3. I've proceeded as follows:

a) Install myspell-th, which is installed in a hunspell directory.

B) Remove the qualifier from TIS620 in the th_TH.aff directory. (Grrr!)

c) Add a dictionary definition to ispell-local-dictionary-alist using customize-variable. It took me a while to realise that I should enter the characters in Thai script and specify the encoding as utf-8. (I probably need to tweak the definition to make a combined Thai/English dictionary.)

c) Hunspell then wasn't playing nicely with Emacs. I had to fix Hunspell bug Bad UTF-8 char count in pipe mode - ID: 3178449, originally raised as Emacs bug GNU bug report logs - #7781 23.2.91; ispell problem with hunspell and UTF-8 file.

d) I can then step through the spelling errors in Emacs, but I don't get correction suggestions. This looks like a Thai script or UTF-8 problem. I do get the prompts when I use an English dictionary.

Can anyone help me with problem (d)?

A possible solution is to avoid emacs and use the Hunspell spelling correction program directly, but it isn't very friendly when its suggested corrections omit the correct correction.

Link to comment
Share on other sites

  • 2 weeks later...
d) I can then step through the spelling errors in Emacs, but I don't get correction suggestions. This looks like a Thai script or UTF-8 problem. I do get the prompts when I use an English dictionary.

Can anyone help me with problem (d)?

A possible solution is to avoid emacs and use the Hunspell spelling correction program directly, but it isn't very friendly when its suggested corrections omit the correct correction.

hunspell -a is messed up for UTF-8 input. Fixing the problem in emacs is complicated. I've tried dropping the -a qualifier, but when I do that, Hunspell uses a subtly different interface, so though it works with Emacs in a simple test case, it fails with content.xml.

Using Hunspell on its own fails because it can't display very long lines. I also get the impression I hit some of its capacity limits.

Link to comment
Share on other sites

I've now identified the relevant problem with Hunspell, at least for Versions 1.2.8 and 1.3.2. The bugs are all in the pipe_interface() function in the hunspell.cxx for the stand-alone program. Firstly, the method get_tokenpos returns an offset in bytes, but it needs to be converted to an offset in characters. Secondly, when generating suggestions the word checked is converted to the dictionary's encoding (typically TIS-620 for Thai, as in Version 1:3.2.0-3ubuntu3.1 of myspell-th) if filter_mode is NORMAL, but not if it is PIPE. It should be converted in both cases.

There's a slight fault on the Emacs side, in ispell.el (from Version 1.4.0ubuntu2 of package dictionaries-common) - function ispell-show-choices needs to call fit-window-to-buffer just after the call to switch-to-buffer so it will display Thai choices properly.

I've now used the scheme outline above to do some Thai spell-checking. Code is available on request. I'm not sure I'll have formalised the bug reports before the new year. There is some odd behaviour in the generation of suggestions for Thai, so I may have more than just the matters mentioned here to report.

Link to comment
Share on other sites

  • 2 weeks later...

I now have a nice collection of bug reports:

On Hunspell:

Bad UTF-8 char count in pipe mode - ID: 3178449

No Encoding of Word for Suggestions in Piped Mode

Multidictionary guesses dictionary for suggestions

Hunspell 1.2.8 Groups Thai TIS-620 Chars in Lower/Upper Case Pairs

On the Thai dictionary:

th_TH Affix File Inadequate for Hunspell

Corrected code (at least, as far as Hunspell for spell-checking via Emacs is concerned) is pointed to by the last of the four Hunspell bug reports.

Link to comment
Share on other sites

  • 1 month later...

I have long bemoaned lack of an effective Thai spell checker in LibreOffice in Ubuntu. So I was very happy to see that you were addressing this bug. I am not a technical type. I understand though from your last post that you have fixed the problem, and if I downloaded the files you posted and ran the configure file that it would fix the problem for me too. I tried this but no change. Is there a way to fix it? I am using Ubuntu 11.10. Will Ubuntu eventually make an update available to fix this problem?

Link to comment
Share on other sites

I have long bemoaned lack of an effective Thai spell checker in LibreOffice in Ubuntu. So I was very happy to see that you were addressing this bug. I am not a technical type. I understand though from your last post that you have fixed the problem, and if I downloaded the files you posted and ran the configure file that it would fix the problem for me too. I tried this but no change. Is there a way to fix it? I am using Ubuntu 11.10. Will Ubuntu eventually make an update available to fix this problem?

JiangWade, how far did you get with the process? When you say you 'ran the configure file', I take it you rebuilt Hunspell Version 1.2.8 with my changes, for which you would need to run configure and then make, and then look for the hunspell executable in the src/tools subdirectory. (If you didn't get his far, I will happily walk you through the process. It may seem intimidating, but I don't recall any complexities.)

The next thing you needed to do is to ensure that your setting of PATH picks up the new executable. (I presume you're loath to mess about with stuff in /usr/share.) For example on my machine, for me the environment variable PATH has the value

/home/richard/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

and in /home/richard/bin hunspell is set up as a link to my corrected src/tools/hunspell.

As a test, try issuing the following two lines at a terminal followed by ctrl/C or ctrl/D:

hunspell -a -d th_TH
อไร

I get the outputs

@(#) International Ispell Version 3.2.06 (but really Hunspell 1.2.8) hacked by JRW
& อไร 4 0: อุไร, อะไร, อมร, อรไท

Note that any ad lib changes in the identification code have to go after the closing parenthesis - that threw me for a while.

What is your locale set to? That might be a cause of problems. To find your locale, issue the commands:

env | grep LC
enc | grep LANG

On my machine, they yield:

LANG=en_GB.utf8
GDM_LANG=en_GB.utf8
LANGUAGE=en_GB:en

The important bit is the '.utf8'.

For running emacs, what do you have in your .emacs file relating to ispell? One think I didn't mention in my README.txt (because I wasn't aware of it) was that to set up ispell-related variables for emacs, you first need to start ispell (M-x ispell). The emacs command I used (commands and responses) are:

M-x find-file content.xml # Load file to edit
M-x ispell                # Load ispell functions (needed for
	          # M-x customize-variable ispell-progam-name)
q y	                  # The initial spell process will have to be killed
                  # at some point if it is using the wrong dictionary.
M-x ispell-chage-dictionary thai
M-x load-file ispell.el   # To get window resizing - your set-up may not need this.
M-x ispell                # Away we go - x to break out of spell checking,
		  # q to force a restart.

As to who will fix what, I don't think Ubuntu will fix much directly. Moving fixes downstream is the best I can hope for. This may result in Hunspell 1.2.8 Groups Thai TIS-620 Chars in Lower/Upper Case Pairs being fixed, possibly by simply moving to a later version of Hunspell. What version is configured for Oneiric? Getting th_TH Affix File Inadequate for Hunspell fixed is trickier - I may have to find another way of getting that fixed. Addding your name to those affected may help.

As for fixing Thai spell-checking in LibreOffice directly, I shall check on related progress. I thought Javier Solá's work for Khmer would fix it, but I am suddenly not so sure.

Edited by Richard W
Link to comment
Share on other sites

For LibreOffice, it looks as though someone is going to have popularise a fully working Thai spell checker for LibreOffice. Professionally, it's doable - SIL implemented graphite font support for a version of OpenOffice, a Burmese company maintained it, and finally it was adopted by OpenOffice. The Khmer spell checker (depending on ZWSP) is already functional. There appears to be a licensing issue with the relevant ICU tool, from what I can glean from various discussions around the internet.

Apparently similar problems arise with Tibetan.

Link to comment
Share on other sites

This may result in Hunspell 1.2.8 Groups Thai TIS-620 Chars in Lower/Upper Case Pairs being fixed, possibly by simply moving to a later version of Hunspell.

This happened, apparently today (Thursday 9 February 2012), for Lucid (10.04) at least today, presumably as part of the presumed upgrade of LibreOffice supported in Lucid from 3.3.2 to 3.4.5.

Unfortunately for me, LibreOffice Version 3.4.5 drops support for Graphite fonts using Version 1.0 of the Silf table - 'SIL: Are there any such fonts in the wild?' and mangles fonts using 'pseudoglyphs' (bug and fix reported to SIL). On the other hand, a bug that stopped my upgrading my old fonts to use later table versions has now been fixed in the trunk version of the Graphite compiler.

Link to comment
Share on other sites

Thanks, I am a little lost in the developers language though. Is there a way we can "fix" the Thai spellchecker now or will we have to wait until LibreOffice 3.5.1 is released?

Yes, I believe, in principle. 'Just' download the Ubuntu package libreoffice-core (corresponding to the version you have), make the change in breakiterator_unicode.cxx, rebuild the package, and substitute the i18npool.uno.so it creates for the /usr/lib/libreoffice/basis3.4/program/i18npool.uno.so. There may be complications - you may have to install a lot of packages, you may have to choose between stripped and unstripped i18npool.uno.so (the difference may not matter), and, just possibly, you may find you have the wrong compiler version. Back-up any files you use it to modify until you know it is working fine!

This process should be repeated if a security patch comes out.

I'll give it a try, and report back. I've been using a full set of source code and GIT repository, and that takes up 8GB!

Link to comment
Share on other sites

I have to report total failure. The command

apt-get source libreoffice-core

failed - 'E: Unable to find a source package for libreoffice'.

I therefore downloaded the Version 3.4.5 source code (the current version of LibreOffice for Ubuntu 10.04 being 1:3.4.5-0ubuntu1~lucid1) from the LibreOffice site and compiled and linked that, but when I substituted the recompiled i18npool.uno.so, Ubuntu-supplied LibreOffice crashed at start-up. I've now got a slight modification of Version 3.4.5 that seems to have working Thai spell-checking, but any Ubuntu tweaks are missing.

Link to comment
Share on other sites

JiangWade pointed out that one can download 'source files' from Oneiric, and I was able to do that from Lucid by adding the line

deb-src http://archive.ubuntu.com/ubuntu/ oneiric-updates main restricted

to /etc/apt/sources.list.

The structure of what was downloaded by running

sudo apt-get update
apt-get source libreoffice

is unusual. The configure file only works if run from the downloaded directory libreoffice-3.4.4/libreoffice-build . It then diverges from the usual pattern - the next instruction should be ./download, not make. Unfortunately, ./download does not work - it fails to download a couple of files.

It looks as though the debianised source is no longer available in an easily usable form. However, as part of the build process was to download most of the source code from the LibrOffice git repository, the loss is perhaps not too small, especially as LibreOffice 3.5.1 appears to be scheduled for April this year. Instructions for a straight source download are available at http://www.libreoffice.org/developers-2/ . It may be necessary to package up dictionaries as extensions - I'm checking that out now.

Link to comment
Share on other sites

The instructions at http://www.libreoffi...g/developers-2/ need some comment.

I don't know if

sudo apt-get build-dep libreoffice

works. I now seem to have all the tools I need, so I merely get an error message. There is a list of tools needed at http://wiki.document...ld_Dependencies . The fall-back list there omitted g++ and ant, and for the configuration options I have tried, I also needed junit4.

The instructions don't say what to do with the .bz2 files you download. You untar the libreoffice-bootstrap-*.bz2 file you download, and move the other *.bz2 files to the libreoffice-bootstrap-*/src directory.

The instruction for autogen should be enhanced, if you have a multi-core processor, to

[core] ./autogen.sh --with-max-jobs=2 --with-num-cpus=2[/core]

with 2 replaced by the number of cores in your machine.

After successfully running make for the first time, you should then apply the correction to breakiterator_unicode.cxx (http://cgit.freedesk...c146f06c0e26932) and rerun make.

You may need to package up the Thai (and other) spelling dictionaries as an extension (.oxt file). There ought to be a simpler way to do things! The refinements to th_TH.aff I mentioned earlier in the thread are relevant.

If anyone needs the complete corrected breakiterator_unicode.cxx, or the Thai spelling dictionaries as an extension, just PM me. Note that .oxt files are just zip files with unpublished(?) rules as to what files go in them.

I suspect that dictionary or interface language support should be available by use of, for example, --with-lang="en-GB th" on the autogen.sh command, but all methods of getting Version 3.4.5 are broken in that respect.

Link to comment
Share on other sites

  • 2 weeks later...

Just a note to confirm that Libreoffice Version 3.5.1 Release Candidate 1 contains a workable Thai spelling checker. Note though that my proposed changes to th_TH.aff do improve it significantly. Also, you can build 3.5.1 (just after release candidate) with the options

./autogen.sh  --with-max-jobs=2 --with-lang="en-GB th"

This gives you a Thai dictionary, a British English dictionary, and the option of having menus in Thai, English, or American. You don't need to install the C changes to enable Thai spelling.

Link to comment
Share on other sites

  • 2 weeks later...

Installed 3.5.1 and the Thai spell check worked. Just having any spell-checker is great but still some problems. Some words ( e.g. แคลิฟอร์เนืย) that you would think would be in the dictionary either aren't there or aren't being seen by the spell-checker. There doesn't seem to be any way to automatically add longer words(>3-4 characters) like this to the dictionary, and even when I cut and past them into the standard dictionary they still show up as misspelled.

Link to comment
Share on other sites

There are still quite a few minor problems to iron out - so far we've had the LibreOffice experts doing the work and making the massive difference in functionality

แคลิฟอร์เนืย is a nasty problem. There is a simple solution, but I will show you the brute force approach, for you may need that in other cases. However, it seems you do need to fix th_TH.aff - see th_TH Affix File Inadequate for Hunspell . You've already downloaded my fixed version of th_TH.aff.

The first problem is that you're fighting the ICU word-break locator ('break iterator'), which quite rightly doesn't know this word, but breaks it up. The solution is to join the bits by respelling it as แค<WJ>ลิ<WJ><WJ>อร์เนืย where '<WJ>' is U+2060 WORD JOINER, which you enter in LibreOffice through the sequence insert, formatting mark, No-width no break. (That's what the WORD JOINER character is for - overruling word-break locators.) Be wary of accidentally inserting No-width optional break, which does display differently if you enable the display of formatting marks. If you then want to keep this weird spelling, you can the add the sequence into your personal dictionary in the normal fashion. The down side is that you have to enter the WJs whenever you want to use that weird spelling. Alternatively, you can accept the offer to correct it to แคลิฟอร์เนียม (the chemical element), take off the final ม and discover that แคลิฟอร์เนีย is in the spelling dictionary!

The simple solution was to notice that เนืย is impossible in Thai - you typed อื for อี!

Link to comment
Share on other sites

Sorry I misspelled my example in my last posting. I meant แคลิฟอร์เนีย. I went back to my document and surprise to me Libreoffice is not marking it as misspelled even without WJ inserted. Maybe I was dreaming. Anyway when I intentionally misspell it, it shows the entire word misspelled, but does not give me the correct spelling option. However just having the indication it is misspelled is enough for me to figure out and spell correctly. I have typed เฮมมิงเวย์ (the author) and not surprisingly it shows up misspelled. I would guess it is not in the dictionary. There is as you note no way to put it in the personal dictionary automatically without using the WJ and it is not very practical for me to use the WJ. When I cut and past it in, the world still shows up as misspelled. I don't understand why the spell checker can see แคลิฟอร์เนีย from the spelling dictionary but can't see เฮมมิงไวย์ from my personal dictionary.

Another strange problem, if I type ๕ เปอร์เซ็นต์ของ..., เปอร์เซ็นต์ shows misspelled. If I take the space out after ๕ then it shows as spelled correctly.

Link to comment
Share on other sites

The big problem with Thai spell-checking in LibreOffice is that segmenting the text into words and then checking the spell are two independent processes. Although the segmentation logic primarily works from a long list of words, this list is currently independent of the spell checker's knowledge of the language. This is not good, but it needs some careful programming work and possibly changing of LibreOffice concepts to ensure segmentation and spell-checking use the same list. This is why it is sometimes necessary to tell the segmentation logic where the word breaks are (by inserting ZWSP - menu sequence insert, formatting mark, no-width optional break, short cut ctrl+/) or are not (by inserting WJ - menu sequence insert, formatting mark, no-width no break, no predefined short cut).

For example, both segmentation and spell-checking know แคลิฟอร์เนีย, so that word is accepted without complaint. However, when นี is mistyped as นื, the segmentation does not recognise the misspelt word, and unsurprisingly comes up with an unhelpful segmentation. I can't see any way round this problem but telling the system where the word breaks should be.

With a long foreign name such as เฮมมิงเวย์, what you can reasonably hope for is to be able to override the erroneous segmentation, add the word to the user's personal directory, and then have subsequent occurrences accepted as correct without further ado. Until the segmentation code uses the personal directory, we are stuck with adding WJ to all the occurrences.

๕ เปอร์เซ็นต์ของ is another matter. First, if you delete the spaces, the string of characters is taken to be word containing a number, and is therefore not spell checked. (You can override this by the tick box at tools, options, language settings, writing aids, check words with numbers.) Secondly, there is an error in the dictionary file th_TH.aff. เปอร์เซ็นต์ and 62 other entries have trailing spaces in that file. Consequently, เปอร์เซ็นต์ gets passed to the spell-checker, which suggests adding a trailing space. If the trailing space is added, the spell-checker is again passed เปอร์เซ็นต์, and the spell-checker then suggests adding yet another trailing space!

Link to comment
Share on other sites

Richard, Thanks this explains a lot. You really understand this well. I hope these problems can be worked out. Thailand really needs a good open alternative to Word, which is probably 90 percent pirated here. Out of curiosity does the Thai version of Word have problems too? I used it only a couple of times but seem to remember the spell-checker wasn't all that great.

Link to comment
Share on other sites

My comparisons are probably unfair. The latest version of Word I've used for Thai is Word 2002, and in 2003 that must have been better than Star Office, because when I had to type letters or faxes in Thai I used Word rather than Star Office. I do remember having to fight the line-breaker - Word 2002 didn't understand WJ or ZWSP, so my only tool was plain space. Working WJ and ZWSP enable much robuster victories when fighting LibreOffice's line breaker - the battle comes nearer to simply being me helping out the line-breaker.

I've finally just dug out my Word installation disk and installed Word 2002 on a machine that has an idle Windows XP OS so I can compare spell checking. (The machine normally serves as a Youtube viewer running under Ubuntu.) Word 2002 does know the word แคลิฟอร์เนีย but without understanding of WJ it cannot handle เฮมมิงเวย์ or the misspelling แคลิฟอร์เนืย at all. It splits the words up and I know no way of fixing that in Word 2002. U+FEFF ZWNBSP doesn't work any better, and actually renders!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.



×
×
  • Create New...