Jump to content

Sizing And Encoding Mixed Language Text


Richard W

Recommended Posts

I've broken this loose from 'Using Thai With Your Computer' as I fear it will detract from the utility of that thread.

Is there a way to only increase the size of the Thai font in your internet browser (and not the Roman font)? I am using firefox and IE on windows and konqueror and firefox on linux.

Doesn't Mike's plug-in work?

Yet if the Thai fonts were are smaller they would be difficult to read.

I am also trying to install Kymer fonts and they are less well supported than Thai.

I guess that it is just a problem which has not been addressed yet.

<rant>The information needed is there in a true-type font. There are several distances stored in a TrueType font (including its extension OpenType). Referenced to the notional width of the letter 'm' there are both the distance between closely spaced lines - the 'pitch' - and the height of the letter 'x'. Lay-out programs generally allow a 'pitch' to be selected. Unfortunately, what matters for readability is the 'x-height'. It would be a vast improvement if browsers allowed one to select font size by 'x-height' rather than by pitch.</rant>

One day I am sure that there will be a way to adjust the two fonts independently.
There is in Word XP, or at least the Thai edition. When you use the font menu (not the drop-down selection) you can choose different sizes for different fonts. Unfortunately, you are still selecting the pitch, so the ratio depends on the fonts.

You can do a similar thing in HTML - for example, in one of my pages I have the following fragment:

<style type="text/css">
.normal {font-family: Arial, Tahoma, Helvetica; font-size: 100%}
.kh {font-size: 140%}
</style>
</head>
<body class="normal">

I put 'class="kh"' in a tag 'enclosing' text containing Khmer and it automatically increase the size of that text by 40%. 40% is a bit of a guess, but as I only have one good Khmer font (KhmerOS) it works fine for me, and should be useful when the text goes on the net. Of course, you do need a version of Uniscribe that works with Khmer - the standard issue with Windows XP does not.

<meta http-equiv="Content-Type" content="text/html; charset=windows-874">

That's just saying that it's an HTML file and that the bytes it contains are to be interpreted according to the encoding known as 'windows-874', which is just an externsion of 'TIS-620', the Thai national standard.

and unicode.

That's an encoding which aims to cover all languages. It's still being extended - at the moment there are arguments about how to represent the Lanna script. Unicode gets complicated because it is basically an association between characters and numbers, and there are different ways of organising the numbers into sequences of 8 bit chunks - bytes to most of us, octets to the nit-pickers. UTF-8 uses a single byte per number when it can, e.g. for all the characters in this post, UTF-16 uses two bytes per number when it can (it needs four bytes for obscure Chinese characters and ancient scripts, such as Linear B and Cuneiform), and UTF-32 always uses four bytes per number. UTF-16 and UTF-32 are complicated by there being two obvious orders in which the bytes may be transmitted, and both are used, though that problem is usually solved by a little bit of ingenuity. The two-orders are called little-endian (favoured by Microsoft) and big-endian (favoured by Sun and, I think, IBM and Apple).

Now with HTML, UTF-8 works well because most mark-up then needs just one byte per character. However, Thai and Khmer need three bytes per character with UTF-8. For Thai HTML, Windows-874 (or TIS-620, depending on how you're feeling about Microsoft), for each Thai or English character only neds one byte. If your page truly had a lot of Khmer text - more Khmer than mark-up - UTF-16 would probably be the best encoding for compactness. This is the encoding that Windows confusingly describes as 'UTF-8'. Saving plain text as Windows-874 is tricky on Windows - with the right settings for your computer, it is the encoding described as 'ANSI' by Notepad. If you're providing pictures, they will dominate download time, making the choice of encoding irrelevant.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.





×
×
  • Create New...