Jump to content

Recommended Posts

Posted
Can you not post with the encoding set to ISO-8859-1? That is what the encoding of TV pages is explicitly set to. The characters should be stored in these pages as 'numeric entities', not as UTF-8 byte sequences.

Thanks Richard. I use Apple's Safari browser, which doesn't have that coding option. But then I checked it out on Firefox, which I guess I will use when posting on this forum.

I think you do have the option on Safari - but it may be known as '(ISO) Latin-1' or even 'Western European'. I can't guarantee that the keyboard isn't tied to the encoding.

Does anyone have the knowledge, tools and time to figure out what to do to make Safari work? I'd like to see exactly what the different browsers transmit in the 'post' request sent to the server, depending on the encoding in force when the message is posted. We may also want to compare Windows systems with different default codepages - the Firefox Conquery add-in using thai-language.com as a dictionary behaves differently depending on whether the codepage for 'non-Unicode' applications is Thai (presumably Windows-874) or Western European! (And that's with basically English-language installations!) I don't knowingly have access to server-side scripting - unless it's easy to temporarily (e.g. for a few hours at a time) set up a Windows PC to act as a server. (We might fall foul of security checks if the HTML 'form' is in a webpage hosted on a different server to the one the completed form is directed to.) HTTP allows a browser to tag a post request with the character encoding it is using for a 'post' request with a 'Content-Encoding' header, but perhaps Safari isn't tagging the request and perhaps the TV server code is ignoring any such tagging.

One solution would be for ThaiVisa to switch to UTF-8, but I don't see it happening. Theoretically the contents of untagged incoming posts could be examined to distinguish between ISO-8859-1, UTF-8 and TIS-620, but I don't see that appealing either. It should be feasible, for Thai and English are the only languages allowed in the forums.

I believe Safari users have similar problems with www.thai-language.com, which uses TIS-620 as its normal character encoding. This long-standing problem does not just irritate users of Safari.

Richard.

Posted (edited)
None of the above show up ok for me unfortunately. Any more options available in Safari?

The last one looks ok to me (not that I can read it) when I set character encoding to Thai.

post-5469-1181153969_thumb.jpg

Sophon

Edited by Sophon
Posted (edited)
None of the above show up ok for me unfortunately. Any more options available in Safari?

The last one looks ok to me (not that I can read it) when I set character encoding to Thai.

post-5469-1181153969_thumb.jpg

Sophon

Right, but the problem we need to figure out is how to make the Thai script readable without having to change the character encodings. Cheers.

Edited by mangkorn
Posted

Here's an experiment. I've modified a Thai Visa reply-posting page to UTF-8, declaring UTF-8 and ISO-8859-1 to be acceptable encodings. จะรับหรือเปล่า

Richard W.

Posted

Here's an experiment. I've modified a Thai Visa reply-posting page to UTF-8, declaring, not UTF-8 and ISO-8859-1 in that order, but ISO-8859-1 and TIS-620, to be acceptable encodings for transmitting the post. จะรับหรือเปล่า

Posted (edited)

The key part of the page when posting is the following stretch of HTML:

<form id="postingform" action="http://www.thaivisa.com/forum/index.php?" method="post" name="REPLIER" onsubmit="return ValidateForm()" enctype="multipart/form-data">

Now, being slightly archaic, the HTML 4.0 specifications defines two significant fields:

  • enctype = content-type [CI]
    This attribute specifies the content type used to submit the form to the server (when the value of method is "post"). The default value for this attribute is "application/x-www-form-urlencoded". The value "multipart/form-data" should be used in combination with the INPUT element, type="file".
  • accept-charset = charset list [CI]
    This attribute specifies the list of character encodings for input data that is accepted by the server processing this form. The value is a space- and/or comma-delimited list of charset values. The client must interpret this list as an exclusive-or list, i.e., the server is able to accept any single character encoding per entity received.
    The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

Now, the form therefore seems to be saying that the reply should be transmitted using the character-set of the page. Now, Thai cannot be transmitted using Latin-1 or MacOS Roman, but with a Thai encoding it is straightforward to send it. In the first two cases Safari decides 'no can do', and converts the Thai to question marks.

You will see two experiments above, where I changed the encoding of the page to tis-620 and the opening form tag to, in the latter case,

<form id='postingform' action='http://www.thaivisa.com/forum/index.php?' method='post' name='REPLIER' 
onsubmit='return ValidateForm()' enctype='multipart/form-data' accept-charset = "iso-8859-1 tis-620">

. (The encoding of the page was changed because of some non-ASCII characters lurking on the page.) In each case I used Firefox 1.5.0.6 under Windows XP SP2 with British preferences and Thai as the codepage for non-Unicode applications.

I wish I could log what was being sent to the server. I suspect that Firefox's treatment of text encodings is as follows:

1) If the encoding used is the same as the first listed in 'accept-charset', there is no header to indicate the charset (strictly speaking, character encoding rather than set, but like the standards, let us stick to the terminoilogy used in the MIME e-mail standard). This encoding is different to the 'transfer-encoding', which will probably be 'quoted printable' or 'base64'.

2) Otherwise, the encoding is listed in one of the 'entity headers'.

The question now is, what does Safari do if the 'accept-charset' attribute is added to offer a choice of encodings? To do the experiment, the process is as follows:

1) Start a reply.

2) View source.

3) Save it as a text file with extension .htm or .html. You may have to use a character set different to ISO-8859-1. If so, at the next step amend the character set recorded in the tag starting <meta http-equiv="content-type" - it will be at or near line 5 - to record the actual character set used.

4) Edit the 'form' element (the one labelled 'postingform'), and save the edits.

5) Open the edited file in the browser, and try and post some Thai.

If this experiment works, we can then see if we can get ThaiVisa to add an appropriate 'accept-charset' to the form, e.g. accept-charset="iso-8859-1 tis-620". It would be too much work to go through the rigmarole above each time.

Edited by Richard W
Posted (edited)

Another possibility would be to convert the text to numeric character references (NCRs) as suggested at http://m10lmac.blogspot.com/2006/12/s-when...ari.html​. The author tells me there is a suitable program for doing the conversions is UnicodeChecker:

PS Here is UnicodeChecker, in case you don't have it. When

installed, you have Unicode item in the Services menu that lets you

convert anything to NCR's and lots of other stuff.

http://earthlingsoft.net/UnicodeChecker/

Let's see what happens when I do a semi-manual conversion (using Word 2002) with the same text as before: จะรับ

หรือเปล่า

Previewing post displays the numeric character references as such in the preview, and converts them to Thai in the editing window! I've converted the text back to numeric character references, and now let's see what happens when I post without previewing.

//Edited to put a space and thus an optional line break in the failed Thai Text.

Edited by Richard W
Posted (edited)
Decimal NCR's seem to work but Hex NCR's not?

I'll give it a whirl - same text as before, but in decimal NCRs. NB: From Firefox on Windows.

จะรับหรือเปล่า

By Jove! Decimals work and hex doesn't! Does UnicodeChecker generate decimal NCRs or do we need an ancillary web page to do the conversions? The web page'll be a piece of cake so long as one doesn't want to handle Linear B, cuneiform or other obscure characters, i.e. no problem for codes below 65,536. In theory one could get the posting page to do the conversion.

Incidentally, I've been looking at some Firefox bug reports. It definitely looks as though there is no character set tagging going on - too many servers get confused by it!

Edited by Richard W
Posted
Does UnicodeChecker generate decimal NCRs or do we need an ancillary web page to do the conversions? The web page'll be a piece of cake so long as one doesn't want to handle Linear B, cuneiform or other obscure characters, i.e. no problem for codes below 65,536.

Here's the conversion code. Just save it as a web page and keep it handy if you want a visibly Trojan-free converter. The procedure for use is:

1) Compose text of post in normal fashion. (I don't know if preview works with Safari.)

2) Copy text to top pane of conversion page.

3) Click button on conversion page.

4) Copy converted text from bottom pane to editing window for post, replacing previous text.

5) Submit post.

Let me know if there are any problems using it with Safari. I've tested it with Firefox and IE6, using a mixture of English, French accented characters, Thai and Linear B.

The next step is to persuade TV to incorporate the conversion in the site's web page so that no manual intervention is needed. (An alternative is to switch the site to using UTF-8, but that might not be trivial.)

Remember that Firefox and Windows IE6 (and presumably also IE7) users do not have these troubles - just problems with bracketed letters being converted to smilies.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Text to Decimal NCR Converter</TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META http-equiv=Content-Script-Type content=text/javascript>
</HEAD>
<BODY>
<FORM NAME="myform">
<H2>Raw Text</H2>
<TEXTAREA NAME="wordsin" ROWS=15 COLS=80 style="font-size: 12pt">
</TEXTAREA>
<script>document.myform.wordsin.value += "";</script>
<p>Paste normal text into top window.  Click the button below to convert to use
numeric character references for characters outside the Latin-1 range.
The generated form, in the box below, is then suitable for use in HTML pages
encoded in Latin-1.
<p><input type="button", value="Convert to Latin-1 + NCR now",
onClick="convert('Latin-1')">
<H2>Converted Text</H2>
<TEXTAREA NAME="scratch" ROWS=15 COLS=80 style="font-size: 12pt">
</TEXTAREA>
</FORM>
<script type=text/javascript><!--
function convert(target) {
var pi, onch, code, held;

pi = 0;
	document.myform.scratch.value = "";
held = 0; // High surrogate if non-zero.
while (pi < document.myform.wordsin.value.length) {
			onch = document.myform.wordsin.value.charAt(pi);
			code = document.myform.wordsin.value.charCodeAt(pi++);
	if (held) {
		if (0xdc00 <= code && code < 0xe000) {
			code = (held-0xd800) * 1024 + (code-0xdc00) + 0x10000;
		} else {
			document.myform.scratch.value += 
				String.fromCharCode(held); // GIGO
		}
		held = 0;
	}
//			document.myform.scratch.value += "(";
//			document.myform.scratch.value += code;
//			document.myform.scratch.value += ")";
	if (code <= 127) { // ASCII
		document.myform.scratch.value += onch;
	} else if (160 <= code && code <= 255) { // ISO-8859-1
		document.myform.scratch.value += onch;
	} else if (0xd800 <= code && code < 0xdc00) { // High surrogate
		held = code;
	} else {
		document.myform.scratch.value += "";
		document.myform.scratch.value += code;
		document.myform.scratch.value += ";";
	}
}
if (held) { // Trailing high surrogate!
	document.myform.scratch.value += String.fromCharCode(held); // GIGO
}
}
// -->
</SCRIPT>
</BODY></HTML>

Posted
Does UnicodeChecker generate decimal NCRs or do we need an ancillary web page to do the conversions?

UnicodeChecker can be set to generate either hex or decimal NCR's in its preferences.

Test of Linear B (decimal ncr's):

???????

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.



×
×
  • Create New...