HELP: Unicode in Java 1.3.1 vs 1.4.2

HELP: Unicode in Java 1.3.1 vs 1.4.2

Post by modes » Wed, 16 Feb 2005 21:58:17

Hi All,

according to :

"If a byte array contains non-Unicode text, you can convert the text to
Unicode with one of the String constructor methods. Conversely, you can
convert a String object into a byte array of non-Unicode characters
with the String.getBytes method. When invoking either of these methods,
you specify the encoding identifier as one of the parameters."

It works fine in Java 1.3.1

// Convert ASCII to Unicode
str_uni = new String(str_ascii.getBytes(), "ISO8859_8");

// Convert Unicode to ASCII
str_ascii = new String(str_uni.getBytes("ISO8859_8"));

In Java 1.4.2 it returns question marks only.

What is the difference and how it can be fixed?

I need the solution URGENTLY.


HELP: Unicode in Java 1.3.1 vs 1.4.2

Post by John C. Bo » Thu, 17 Feb 2005 00:15:07

You are not using the canonical name of the charset, which is
"ISO-8859-8". Which charsets are available and how they are configured
depends on your Java installation. On my Sun JDK 1.4.2_05 installation,
the charset in question has no defined aliases and therefore can only be
referred to by its canonical name. I don't know why you are getting
anything at all in this case (you should get an
UnsupportedEncodingException if the charset name were unknown).

That said, your code is deeply flawed. If you have data in a Java
String then it is already Unicode, *that is a fundamental characteristic
of Java Strings*. It does not make sense to talk about changing the
encoding / charset of a String -- the concept just doesn't apply (and
the i18n tutorial refer to doesn't suggest otherwise). If you have
taken a byte sequence and created a String from it without accounting
for the bytes' charset then you are already hosed. This may be your
real problem, and it has not changed from 1.3 to 1.4 (or 1.5).

In addition, it might be relevant to you that ASCII, Unicode, and all
the ISO-8859 nationalized charsets all assign the same codes to the
characters covered by ASCII. The UTF-8 charset for encoding Unicode is
produces encoded character codes for the ASCII characters that are the
same as the character codes themselves.

John Bollinger


HELP: Unicode in Java 1.3.1 vs 1.4.2

Post by Chris Uppa » Thu, 17 Feb 2005 00:26:24

Well spotted.

-- chris