Encoding conversion problem

This is a discussion on Encoding conversion problem within the JDBC JAVA forums in Framework and Interface Programming category; Andrea wrote: > Hi Roedy, > the database (DB2) has this configuration: > ... > Database territory = US > Database code page = 850 > Database code set = IBM-850 > ... > > I've exported to a file the content of a table with a CHAR(N) field > containing the EURO currency character, then I've opened the file with > EncodingRecognizer: if I choose IBM850 I see a strange character (like > a small X), if I choose ISO-8859-1 I see a square. > > I tried a translation with: > > String problematicString = rs.getString(index); > problematicString ...

Go Back   Application Development Forum > Framework and Interface Programming > JDBC JAVA

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
Reply

 

LinkBack Thread Tools Display Modes
  #11  
Old 02-13-2008, 06:36 AM
Silvio Bierman
Guest
 
Default Re: Encoding conversion problem

Andrea wrote:
> Hi Roedy,
> the database (DB2) has this configuration:
> ...
> Database territory = US
> Database code page = 850
> Database code set = IBM-850
> ...
>
> I've exported to a file the content of a table with a CHAR(N) field
> containing the EURO currency character, then I've opened the file with
> EncodingRecognizer: if I choose IBM850 I see a strange character (like
> a small X), if I choose ISO-8859-1 I see a square.
>
> I tried a translation with:
>
> String problematicString = rs.getString(index);
> problematicString = new String(problematicString, "IBM850"); // Am I
> correct?
>
> but I still get garbage :-(
>
>
> Thanks,
> Andrea


Those are quite interesting database configuration parameters. Sounds
like a pre-Unicode setup to me...

Silvio Bierman

Reply With Quote
  #12  
Old 02-13-2008, 06:48 AM
Silvio Bierman
Guest
 
Default Re: Encoding conversion problem

Andrea wrote:
> Hi Roedy,
> the database (DB2) has this configuration:
> ...
> Database territory = US
> Database code page = 850
> Database code set = IBM-850
> ...
>
> I've exported to a file the content of a table with a CHAR(N) field
> containing the EURO currency character, then I've opened the file with
> EncodingRecognizer: if I choose IBM850 I see a strange character (like
> a small X), if I choose ISO-8859-1 I see a square.
>
> I tried a translation with:
>
> String problematicString = rs.getString(index);
> problematicString = new String(problematicString, "IBM850"); // Am I
> correct?
>
> but I still get garbage :-(
>
>
> Thanks,
> Andrea


You should look at the numeric byte values in problematicString. That
could give you an idea of what you are dealing with although it will
only disclose what your JDBC driver has made of it. It might have
already done an incorrect interpretation of a byte sequence.
Things could be even worse, the data could already have been mutilated
during insertion in the database when some program (possibly the same
program + JDBC driver?) put the data in. If the database encoding does
not support all characters that where in the original data then that is
what most likely happened.

Really, as I said in my previous post you should consider (a) going to a
different database that supports Unicode (b) refrain from
using/supporting non ASCII characters in your application or (c) do what
others have suggested and do your own translation from Unicode -> ASCII
-> DB -> ASCII -> Unicode. The latter option is only realistic if you
have wrapped all your JDBC code in some generic wrappers (which is
usually a good idea) so you can handle this locally.

Good luck,

Silvio
Reply With Quote
  #13  
Old 02-13-2008, 09:22 AM
Andrea
Guest
 
Default Re: Encoding conversion problem

Hi Silvio,
the settings are taken from the DB2 instance of a customer (and I
can't change them). The very same code works, of course, without
problems with a DB2 instance configured with ISO-8859-1.

The problem arises also when a C program stores to DB a string with
non-IBM850 valid characters: another C program can read the string
without problems while Java can't; so the string is not corrupted when
saved to DB but someone (JDBC driver? Java I/O?) looses something when
I read the field with Resultset.getString(int index) and I can't
convert it correctly (or I haven't found the right way to do it yet,
if it exists...).

BTW: a test has been made on DB2 with a table with a field declared
CHAR(n) FOR BIT DATA and Java code works without problems reading and
writing non-IBM850 characters.

Having read your feedback (THANK YOU EVERYONE!) I would say that
there's no way to read back those characters in Java in my
application.

Thanks again,
Andrea
Reply With Quote
  #14  
Old 02-13-2008, 10:28 AM
Andrea
Guest
 
Default Re: Encoding conversion problem

Hi Silvio,
the settings are taken from the DB2 instance of a customer (and I
can't change them). The very same code works, of course, without
problems with a DB2 instance configured with ISO-8859-1.

The problem arises also when a C program stores to DB a string with
non-IBM850 valid characters: another C program can read the string
without problems while Java can't; so the string is not corrupted when
saved to DB but someone (JDBC driver? Java I/O?) looses something when
I read the field with Resultset.getString(int index) and I can't
convert it correctly (or I haven't found the right way to do it yet,
if it exists...).

A test has been made on DB2 with a table with a field declared CHAR(n)
FOR BIT DATA and Java code works without problems reading and writing
non-IBM850 characters.

Having read your feedback (THANK YOU EVERYONE!) I would say that
there's no way to read back those characters in Java in my
application.

Thanks again,
Andrea
Reply With Quote
  #15  
Old 02-13-2008, 10:39 AM
Silvio Bierman
Guest
 
Default Re: Encoding conversion problem

Hello Andrea,


Andrea wrote:
> Hi Silvio,
> the settings are taken from the DB2 instance of a customer (and I
> can't change them). The very same code works, of course, without
> problems with a DB2 instance configured with ISO-8859-1.
>

I already expected you could not change this but it was worth suggesting...

I do not really understand why a Euro sign would work with 8859-1 since
that does not contain that character as far as I am aware of.

> The problem arises also when a C program stores to DB a string with
> non-IBM850 valid characters: another C program can read the string
> without problems while Java can't; so the string is not corrupted when
> saved to DB but someone (JDBC driver? Java I/O?) looses something when
> I read the field with Resultset.getString(int index) and I can't
> convert it correctly (or I haven't found the right way to do it yet,
> if it exists...).
>


It will probably depend on how you access the DB from C (which I have
never done with DB2) but it does not sound surprising that the C binding
would just pass through byte sequences. As long as you stay inside 8-bit
character encoding and always interpret them in the same codepage then
this usually works (which is why many programmers are totally unaware of
character encoding issues).

Since Java programs handle strings as sequences of characters from the
Unicode character set all interfaces with external character storage
needs to be done encoding aware. This means that when doing input Java
String objects are created from byte sequences ALWAYS assuming an
encoding, if not explicitly specified defaulting to the platform
default. The other way around String objects are converted to byte
sequences when doing output as well ALWAYS using an encoding, again
defaulting to the platform default.

> BTW: a test has been made on DB2 with a table with a field declared
> CHAR(n) FOR BIT DATA and Java code works without problems reading and
> writing non-IBM850 characters.
>


That makes sense. What I said before means that if you do not specify
different encodings the same default will be used for input and output.
If the storage medium leaves the intermediate bytes alone (which any
binary database column type would do, just like a binary file would)
then output will again match input, again allowing encoding unaware
programmers to get away with it.

> Having read your feedback (THANK YOU EVERYONE!) I would say that
> there's no way to read back those characters in Java in my
> application.
>


The problem is probably that the default encoding in Java turns your
Unicode character into something the database is unwilling to store as
is in this encoding and therefore mutilates trying to fix something.
When queried the database then returns a byte sequence different from
what was put in.
After that modification the JDBC driver has no chance of restoring the
original value.

> Thanks again,
> Andrea



Best regards,

Silvio

Reply With Quote
  #16  
Old 02-13-2008, 11:36 AM
Roedy Green
Guest
 
Default Re: Encoding conversion problem

On Wed, 13 Feb 2008 06:22:22 -0800 (PST), Andrea
<tol7481@iperbole.bologna.it> wrote, quoted or indirectly quoted
someone who said :

>Hi Silvio,
>the settings are taken from the DB2 instance of a customer (and I
>can't change them). The very same code works, of course, without
>problems with a DB2 instance configured with ISO-8859-1.


Is is possible to ask the database driver to do the conversions for
you? Perhaps internally it is Unicode or some other encoding that can
deal with Euros. We have the clue that C++ programs seem to store
euro s and get them back out.

I am puzzled. I thought JDBC always talked to you in Unicode.
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Reply With Quote
  #17  
Old 02-13-2008, 11:38 AM
Roedy Green
Guest
 
Default Re: Encoding conversion problem

On Wed, 13 Feb 2008 16:39:59 +0100, Silvio Bierman
<sbierman@jambo-software.com> wrote, quoted or indirectly quoted
someone who said :

>I do not really understand why a Euro sign would work with 8859-1 since
>that does not contain that character as far as I am aware of.


You could do an experiment. Try feeding your database all possible
unicode chars in a set of 1-char records, and see which ones come back
unmangled. This is a kludge, but you could preconvert your Euro to
one of those invariant unused chars.
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Reply With Quote
  #18  
Old 02-14-2008, 11:00 AM
Andrea
Guest
 
Default Re: Encoding conversion problem

Hi everyone,
sorry for my previous double-post (a mistake).

>Is is possible to ask the database driver to do the conversions for
>you? Perhaps internally it is Unicode or some other encoding that can
>deal with Euros.

I've checked the properties of the JDBC driver I use (http://
publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/
com.ibm.db2.udb.doc/ad/rjvdsprp.htm) but there's nothing concerning
encoding conversions.

>We have the clue that C++ programs seem to store euro s and get them back out.

Yes we have C and COBOL programs that can store and write non-IBM850
chars without problems too.
As pointed out by Sabine in her post the reason may be that C programs
work with the pure sequences of bytes, without performing any encoding
conversion.

>>I do not really understand why a Euro sign would work with 8859-1 since
>>that does not contain that character as far as I am aware of.


SORRY SORRY SORRY SORRY SORRY
I tried to insert (through JDBC) the EURO character in a DB2
configured with
....
Database territory = C
Database code page = 819
Database code set = ISO8859-1
....
and I can't neither write nor read in Java the EURO character
correctly :-(
A COBOL program works instead correctly.

Then I tried the same thing on a SQL-Server 2000 instance with
collation compatibility_51_409_30003 (correponding to a 1252 codepage,
i.e. Latin 1) and I can store and read the EURO character via
Java&JDBC.

That doesn't work in Java with Oracle 10g configured with
....
NLS_LANGUAGE = AMERICAN
NLS_TERRITORY = AMERICA
NLS_CHARACTERSET = US7ASCII
NLS_LENGTH_SEMANTICS = BYTE
....
store&read through COBOL is ok, and in Java I can even write&read
accented vowels... even if those characters are outside USASCII7...

>You could do an experiment. Try feeding your database all possible
>unicode chars in a set of 1-char records, and see which ones come back
>unmangled. This is a kludge, but you could preconvert your Euro to
>one of those invariant unused chars.

The EURO character is just an example and part of the problem, I can't
use this type of kludges.
The specific problem is much more complex: a password is crypted and
stored to DB with a C program but the crypted chars fall outside
IBM850 range and in Java I'm unable to read and decrypt back the
string... this works if the database is ISO-8859-1 (that's why I
though I were able to write another 'weird' char, the euro char, on an
ISO-8859-1 DB, sorry...). I've also the more general problem of data
entry: I don't know wich characters users will insert so I can't
substitute chars.
I've found a workaround for my crypting problem but I'm just trying to
understand the reason of the problem.

Now it's clear to me that with a CHAR field Java performs an encoding
conversion using the encodings of the JVM and of the DBMS: if some
characters fall outside the destination encoding then they are lost
(i.e. converted in something completely different).
The only 'mysterious' thing for me now is the behavior on Oracle (JDBC
can read&write accented vowels even if they are outside ascii7)... any
idea? Maybe the Oracle driver is smarter than the DB2 Universal
Driver...

Thanks everyone,
Andrea
Reply With Quote
  #19  
Old 02-14-2008, 02:11 PM
Lothar Kimmeringer
Guest
 
Default Re: Encoding conversion problem

Andrea wrote:

>> What is source? How do you create that from the JDBC-
>> resultset?

>
> I tried:
> InputStream source = new
> ByteArrayInputStream(stringFetchedFromDB.getBytes( ));


getBytes() uses the system-encoding for generating the
byte-array. Why do you generate an InputStream anyway?

What you want to do is
OutputStreamWriter osw = new OutputStreamWriter(
output, "8859_1");
osw.write(resultset.getString("mycolumn"));

BTW: The Euro is not part of ISO-8859-1, so it will get
lost that way anyway.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: spamfang@kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
Reply With Quote
  #20  
Old 02-14-2008, 03:44 PM
Silvio Bierman
Guest
 
Default Re: Encoding conversion problem

Andrea wrote:
> Hi everyone,
> sorry for my previous double-post (a mistake).
>
>> Is is possible to ask the database driver to do the conversions for
>> you? Perhaps internally it is Unicode or some other encoding that can
>> deal with Euros.

> I've checked the properties of the JDBC driver I use (http://
> publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/
> com.ibm.db2.udb.doc/ad/rjvdsprp.htm) but there's nothing concerning
> encoding conversions.
>
>> We have the clue that C++ programs seem to store euro s and get them back out.

> Yes we have C and COBOL programs that can store and write non-IBM850
> chars without problems too.
> As pointed out by Sabine in her post the reason may be that C programs
> work with the pure sequences of bytes, without performing any encoding
> conversion.
>
>>> I do not really understand why a Euro sign would work with 8859-1 since
>>> that does not contain that character as far as I am aware of.

>
> SORRY SORRY SORRY SORRY SORRY
> I tried to insert (through JDBC) the EURO character in a DB2
> configured with
> ...
> Database territory = C
> Database code page = 819
> Database code set = ISO8859-1
> ...
> and I can't neither write nor read in Java the EURO character
> correctly :-(
> A COBOL program works instead correctly.
>
> Then I tried the same thing on a SQL-Server 2000 instance with
> collation compatibility_51_409_30003 (correponding to a 1252 codepage,
> i.e. Latin 1) and I can store and read the EURO character via
> Java&JDBC.
>
> That doesn't work in Java with Oracle 10g configured with
> ...
> NLS_LANGUAGE = AMERICAN
> NLS_TERRITORY = AMERICA
> NLS_CHARACTERSET = US7ASCII
> NLS_LENGTH_SEMANTICS = BYTE
> ...
> store&read through COBOL is ok, and in Java I can even write&read
> accented vowels... even if those characters are outside USASCII7...
>
>> You could do an experiment. Try feeding your database all possible
>> unicode chars in a set of 1-char records, and see which ones come back
>> unmangled. This is a kludge, but you could preconvert your Euro to
>> one of those invariant unused chars.

> The EURO character is just an example and part of the problem, I can't
> use this type of kludges.
> The specific problem is much more complex: a password is crypted and
> stored to DB with a C program but the crypted chars fall outside
> IBM850 range and in Java I'm unable to read and decrypt back the
> string... this works if the database is ISO-8859-1 (that's why I
> though I were able to write another 'weird' char, the euro char, on an
> ISO-8859-1 DB, sorry...). I've also the more general problem of data
> entry: I don't know wich characters users will insert so I can't
> substitute chars.
> I've found a workaround for my crypting problem but I'm just trying to
> understand the reason of the problem.
>
> Now it's clear to me that with a CHAR field Java performs an encoding
> conversion using the encodings of the JVM and of the DBMS: if some
> characters fall outside the destination encoding then they are lost
> (i.e. converted in something completely different).
> The only 'mysterious' thing for me now is the behavior on Oracle (JDBC
> can read&write accented vowels even if they are outside ascii7)... any
> idea? Maybe the Oracle driver is smarter than the DB2 Universal
> Driver...
>
> Thanks everyone,
> Andrea



Hello Andrea,

Even if you set a database encoding to ASCII it is very unlikely that
the DB will strip non-ASCII characters. Actually, most databases treat
every byte-size (ie 8-bit) encoding almost identically internally. They
may sometimes have different default collations but that is about it.
The codepage attribute is mostly important for programs interfacing with
the DB. As most of those (especially older ones) are encoding unaware
also bytes pass in and out inharmed. In the end all 8-bit encodings are
equal until actually interpreted to represent characters, aren't they?

I have seen application running on cp-1252 platforms using 8859-1
encoded databases for years without anyone noticing. Same for cp-1257 on
a cp-1252 database. Nobody realy cares when the same data that was put
in comes out again.

This is not unlike SMTP which is supposed to be 7-bit only but since the
transport encoding passes 8-bit characters freely people are used to
sending non-ascii characters in plain-text emails although this is not
supported. This all works great until someone from Lithuania sends me an
email (I am in the Netherlands).

Regards,

Silvio
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 06:08 PM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.