| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hi, I am creating a PDF from the output received from servlet. There are special swiss/german accented characters. One such character's output from servlet is received as &#E1; for which the unicode is "\u00E1". What I do here, is replace all the occurrences of &#E1; to \u00E1 and thus, it displays properly. However, this only happens on windows. When I try to do the same thing on Linux machines, it gives me garbage characters. Those garbage characters look like from the KOI8 character set. Can anyone help me please? |
|
#2
| |||
| |||
| Fuck U! |
|
#3
| |||
| |||
| winlin wrote: > Hi, > > I am creating a PDF from the output received from servlet. There are > special swiss/german accented characters. > One such character's output from servlet is received as &#E1; for > which the unicode is "\u00E1". > > What I do here, is replace all the occurrences of &#E1; to \u00E1 and > thus, it displays properly. > However, this only happens on windows. When I try to do the same thing > on Linux machines, it gives me garbage characters. > Those garbage characters look like from the KOI8 character set. > > Can anyone help me please? Without more details, I can only guess that maybe your Linux box does not have the right locales installed. -- Sabine Dinis Blochberger Op3racional www.op3racional.eu |
|
#4
| |||
| |||
| On Wed, 27 Aug 2008 23:10:23 -0700 (PDT), winlin <bg.winlin@gmail.com> wrote, quoted or indirectly quoted someone who said : >Can anyone help me please? for background, see http://mindprod.com/jgloss/encoding.html -- Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com |
|
#5
| |||
| |||
| On Aug 28, 3:45*pm, Sabine Dinis Blochberger <no.s...@here.invalid> wrote: > winlin wrote: > > Hi, > > > I am creating a PDF from the output received from servlet. There are > > special swiss/german accented characters. > > One such character's output from servlet is received as &#E1; for > > which the unicode is "\u00E1". > > > What I do here, is replace all the occurrences of &#E1; to \u00E1 and > > thus, it displays properly. > > However, this only happens on windows. When I try to do the same thing > > on Linux machines, it gives me garbage characters. > > Those garbage characters look like from the KOI8 character set. > > > Can anyone help me please? > > Without more details, I can only guess that maybe your Linux box does > not have the right locales installed. > -- > Sabine Dinis Blochberger > > Op3racionalwww.op3racional.eu Hi, I checked with the locale and it shows all the locales. On issuing locale command I can see the LANG env set to 'en_US.UTF-8'. I tried changing it to de_DE.UTF-8 with no success. If you need some other details please let me know. # output for locale -a (gives all the locales installed) af_ZA af_ZA.iso88591 an_ES an_ES.iso885915 ar_AE ar_AE.iso88596 ar_AE.utf8 ar_BH ar_BH.iso88596 ar_BH.utf8 ar_DZ ar_DZ.iso88596 ar_DZ.utf8 ar_EG ar_EG.iso88596 ar_EG.utf8 ar_IN ar_IN.utf8 ar_IQ ar_IQ.iso88596 ar_IQ.utf8 ar_JO ar_JO.iso88596 ar_JO.utf8 ar_KW ar_KW.iso88596 ar_KW.utf8 ar_LB ar_LB.iso88596 ar_LB.utf8 ar_LY ar_LY.iso88596 ar_LY.utf8 ar_MA ar_MA.iso88596 ar_MA.utf8 ar_OM ar_OM.iso88596 ar_OM.utf8 ar_QA ar_QA.iso88596 ar_QA.utf8 ar_SA ar_SA.iso88596 ar_SA.utf8 ar_SD ar_SD.iso88596 ar_SD.utf8 ar_SY ar_SY.iso88596 ar_SY.utf8 ar_TN ar_TN.iso88596 ar_TN.utf8 ar_YE ar_YE.iso88596 ar_YE.utf8 be_BY be_BY.cp1251 be_BY.utf8 bg_BG bg_BG.cp1251 bg_BG.utf8 bokmal bokmål br_FR br_FR.iso88591 bs_BA bs_BA.iso88592 C ca_ES ca_ES@euro ca_ES.iso88591 ca_ES.iso885915@euro ca_ES.utf8 ca_ES.utf8@euro catalan croatian cs_CZ cs_CZ.iso88592 cs_CZ.utf8 cy_GB cy_GB.iso885914 cy_GB.utf8 czech da_DK da_DK.iso88591 da_DK.iso885915 da_DK.utf8 danish dansk de_AT de_AT@euro de_AT.iso88591 de_AT.iso885915@euro de_AT.utf8 de_AT.utf8@euro de_BE de_BE@euro de_BE.iso88591 de_BE.iso885915@euro de_BE.utf8 de_BE.utf8@euro de_CH de_CH.iso88591 de_CH.utf8 de_DE de_DE@euro de_DE.iso88591 de_DE.iso885915@euro de_DE.utf8 de_DE.utf8@euro de_LU de_LU@euro de_LU.iso88591 de_LU.iso885915@euro de_LU.utf8 de_LU.utf8@euro deutsch dutch eesti el_GR el_GR.iso88597 el_GR.utf8 en_AU en_AU.iso88591 en_AU.utf8 en_BW en_BW.iso88591 en_BW.utf8 en_CA en_CA.iso88591 en_CA.utf8 en_DK en_DK.iso88591 en_DK.utf8 en_GB en_GB.iso88591 en_GB.iso885915 en_GB.utf8 en_HK en_HK.iso88591 en_HK.utf8 en_IE en_IE@euro en_IE.iso88591 en_IE.iso885915@euro en_IE.utf8 en_IE.utf8@euro en_IN en_IN.utf8 en_NZ en_NZ.iso88591 en_NZ.utf8 en_PH en_PH.iso88591 en_PH.utf8 en_SG en_SG.iso88591 en_SG.utf8 en_US en_US.iso88591 en_US.iso885915 en_US.utf8 en_ZA en_ZA.iso88591 en_ZA.utf8 en_ZW en_ZW.iso88591 en_ZW.utf8 es_AR es_AR.iso88591 es_AR.utf8 es_BO es_BO.iso88591 es_BO.utf8 es_CL es_CL.iso88591 es_CL.utf8 es_CO es_CO.iso88591 es_CO.utf8 es_CR es_CR.iso88591 es_CR.utf8 es_DO es_DO.iso88591 es_DO.utf8 es_EC es_EC.iso88591 es_EC.utf8 es_ES es_ES@euro es_ES.iso88591 es_ES.iso885915@euro es_ES.utf8 es_ES.utf8@euro es_GT es_GT.iso88591 es_GT.utf8 es_HN es_HN.iso88591 es_HN.utf8 es_MX es_MX.iso88591 es_MX.utf8 es_NI es_NI.iso88591 es_NI.utf8 es_PA es_PA.iso88591 es_PA.utf8 es_PE es_PE.iso88591 es_PE.utf8 es_PR es_PR.iso88591 es_PR.utf8 es_PY es_PY.iso88591 es_PY.utf8 es_SV es_SV.iso88591 es_SV.utf8 estonian es_US es_US.iso88591 es_US.utf8 es_UY es_UY.iso88591 es_UY.utf8 es_VE es_VE.iso88591 es_VE.utf8 et_EE et_EE.iso88591 et_EE.utf8 eu_ES eu_ES@euro eu_ES.iso88591 eu_ES.iso885915@euro eu_ES.utf8 eu_ES.utf8@euro fa_IR fa_IR.utf8 fi_FI fi_FI@euro fi_FI.iso88591 fi_FI.iso885915@euro fi_FI.utf8 fi_FI.utf8@euro finnish fo_FO fo_FO.iso88591 fo_FO.utf8 français fr_BE fr_BE@euro fr_BE.iso88591 fr_BE.iso885915@euro fr_BE.utf8 fr_BE.utf8@euro fr_CA fr_CA.iso88591 fr_CA.utf8 fr_CH fr_CH.iso88591 fr_CH.utf8 french fr_FR fr_FR@euro fr_FR.iso88591 fr_FR.iso885915@euro fr_FR.utf8 fr_FR.utf8@euro fr_LU fr_LU@euro fr_LU.iso88591 fr_LU.iso885915@euro fr_LU.utf8 fr_LU.utf8@euro ga_IE ga_IE@euro ga_IE.iso88591 ga_IE.iso885915@euro ga_IE.utf8 ga_IE.utf8@euro galego galician german gl_ES gl_ES@euro gl_ES.iso88591 gl_ES.iso885915@euro gl_ES.utf8 gl_ES.utf8@euro greek gv_GB gv_GB.iso88591 gv_GB.utf8 hebrew he_IL he_IL.iso88598 he_IL.utf8 hi_IN hi_IN.utf8 hr_HR hr_HR.iso88592 hr_HR.utf8 hrvatski hu_HU hu_HU.iso88592 hu_HU.utf8 hungarian icelandic id_ID id_ID.iso88591 id_ID.utf8 is_IS is_IS.iso88591 is_IS.utf8 italian it_CH it_CH.iso88591 it_CH.utf8 it_IT it_IT@euro it_IT.iso88591 it_IT.iso885915@euro it_IT.utf8 it_IT.utf8@euro iw_IL iw_IL.iso88598 iw_IL.utf8 ja_JP ja_JP.eucjp ja_JP.ujis ja_JP.utf8 japanese japanese.euc ka_GE ka_GE.georgianps kl_GL kl_GL.iso88591 kl_GL.utf8 ko_KR ko_KR.euckr ko_KR.utf8 korean korean.euc kw_GB kw_GB.iso88591 kw_GB.utf8 lithuanian lo_LA lo_LA.utf8 lt_LT lt_LT.iso885913 lt_LT.utf8 lv_LV lv_LV.iso885913 lv_LV.utf8 mi_NZ mi_NZ.iso885913 mk_MK mk_MK.iso88595 mk_MK.utf8 mr_IN mr_IN.utf8 ms_MY ms_MY.iso88591 ms_MY.utf8 mt_MT mt_MT.iso88593 mt_MT.utf8 nb_NO nb_NO.ISO-8859-1 nl_BE nl_BE@euro nl_BE.iso88591 nl_BE.iso885915@euro nl_BE.utf8 nl_BE.utf8@euro nl_NL nl_NL@euro nl_NL.iso88591 nl_NL.iso885915@euro nl_NL.utf8 nl_NL.utf8@euro nn_NO nn_NO.iso88591 nn_NO.utf8 no_NO no_NO.iso88591 no_NO.utf8 norwegian nynorsk oc_FR oc_FR.iso88591 pl_PL pl_PL.iso88592 pl_PL.utf8 polish portuguese POSIX pt_BR pt_BR.iso88591 pt_BR.utf8 pt_PT pt_PT@euro pt_PT.iso88591 pt_PT.iso885915@euro pt_PT.utf8 pt_PT.utf8@euro romanian ro_RO ro_RO.iso88592 ro_RO.utf8 ru_RU ru_RU.iso88595 ru_RU.koi8r ru_RU.utf8 russian ru_UA ru_UA.koi8u ru_UA.utf8 se_NO se_NO.utf8 sk_SK sk_SK.iso88592 sk_SK.utf8 slovak slovene slovenian sl_SI sl_SI.iso88592 sl_SI.utf8 spanish sq_AL sq_AL.iso88591 sq_AL.utf8 sr_YU sr_YU@cyrillic sr_YU.iso88592 sr_YU.iso88595@cyrillic sr_YU.utf8 sr_YU.utf8@cyrillic sv_FI sv_FI@euro sv_FI.iso88591 sv_FI.iso885915@euro sv_FI.utf8 sv_FI.utf8@euro sv_SE sv_SE.iso88591 sv_SE.iso885915 sv_SE.utf8 swedish ta_IN ta_IN.utf8 te_IN te_IN.utf8 tg_TJ tg_TJ.koi8t thai th_TH th_TH.tis620 th_TH.utf8 tl_PH tl_PH.iso88591 tr_TR tr_TR.iso88599 tr_TR.utf8 turkish uk_UA uk_UA.koi8u uk_UA.utf8 ur_PK ur_PK.utf8 uz_UZ uz_UZ.iso88591 vi_VN vi_VN.tcvn vi_VN.utf8 wa_BE wa_BE@euro wa_BE.iso88591 wa_BE.iso885915@euro yi_US yi_US.cp1255 zh_CN zh_CN.gb18030 zh_CN.gb2312 zh_CN.gbk zh_CN.utf8 zh_HK zh_HK.big5hkscs zh_HK.utf8 zh_TW zh_TW.big5 zh_TW.euctw zh_TW.utf8 |
|
#6
| |||
| |||
| winlin @ Thursday 28 August 2008 08:10: > Hi, > > I am creating a PDF from the output received from servlet. There are > special swiss/german accented characters. > One such character's output from servlet is received as &#E1; for > which the unicode is "\u00E1". > > What I do here, is replace all the occurrences of &#E1; to \u00E1 and > thus, it displays properly. > However, this only happens on windows. When I try to do the same thing > on Linux machines, it gives me garbage characters. > Those garbage characters look like from the KOI8 character set. > > Can anyone help me please? Technically, KOI-8 isn't a character set; it's a character encoding. But I assume what you mean is that Cyrillic characters appear in the output. Since the KOI-8 encoding (as well as Windows-1251, BTW) maps codepoints to Cyrillic characters that in Unicode (and ISO-Latin1 et al.) are mapped to the accented characters you want, it seems likely that whatever it is you're using to generate the PDFs gets confused about what encoding is in effect. Maybe you could tell us what PDF generator you're using. m. |
|
#7
| |||
| |||
| On 28 Aug, 07:10, winlin <bg.win...@gmail.com> wrote: > I am creating a PDF from the output received from servlet. There are > special swiss/german accented characters. > One such character's output from servlet is received as &#E1; for > which the unicode is "\u00E1". The "Unicode" for this is just U+00E1, any way you wish to represent it. The difference between the two examples you quote is that they're different syntactic representations for this same Unicode, one for SGML / HTML / XML and the other for Java. There's also the question of "encoding": how to represent characters as a stream of bytes or octets. This isn't a problem here because both of the forms you describe use a syntactic escaping mechanism at an even higher level, such that Unicode characters can be represented in an encoding (like ASCII's encoding) that doesn't support that character. It's good practice to use UTF-8 encoding thoroughout (if you can enforce it on the rest of the team, stuff starts to "just work"), however this isn't always permissible, owing to limitations of some tools. Java properties files are one example. HTML and Java always use Unicode characters for these numeric entities, no matter what the encoding. > What I do here, is replace all the occurrences of &#E1; to \u00E1 and > thus, it displays properly. That's to go from HTML to Java. Same character set (i.e Unicode) and the overall encoding doesn't matter because you're not dependent on it (while these characters are wrapped up as numeric entities). > However, this only happens on windows. When I try to do the same thing > on Linux machines, it gives me garbage characters. That sounds like you're generating the right content, with correctly encoded characters (probably as UTF-8), but the servlet is mis- labelling this encoding as something else. Very easily done, and the most common error of this type. However your particular results would suggest the mis-labelling would be as KOI, which sounds unlikely. Alternatively, the encoding process is broken (rare, but possible). Your Unicode characters are being pulled out of their safe references and converted to encoded characters, which are then getting mangled. When looked at as UTF-8, their mangled remains looks like a radically different set of characters, i.e. KOI. It's hard to diagnose this stuff. Really what you need is a clear understanding of the concepts and of your workflow, then to check each step and to ensure that it's valid in each intermediate format (i.e. content always matches for its encoding for creation and its encoding on use). Life is also simpler by ignoring ISO-8859-* in favour of consistent Unicode / UTF-8 throughout. Wikipedia is quite readable on these topics. |
|
#8
| |||
| |||
| On Aug 29, 2:14*pm, magloca <magl...@mailinater.com> wrote: > winlin @ Thursday 28 August 2008 08:10: > > > Hi, > > > I am creating a PDF from the output received from servlet. There are > > special swiss/german accented characters. > > One such character's output from servlet is received as &#E1; for > > which the unicode is "\u00E1". > > > What I do here, is replace all the occurrences of &#E1; to \u00E1 and > > thus, it displays properly. > > However, this only happens on windows. When I try to do the same thing > > on Linux machines, it gives me garbage characters. > > Those garbage characters look like from the KOI8 character set. > > > Can anyone help me please? > > Technically, KOI-8 isn't a character set; it's a character encoding. But > I assume what you mean is that Cyrillic characters appear in the > output. Since the KOI-8 encoding (as well as Windows-1251, BTW) maps > codepoints to Cyrillic characters that in Unicode (and ISO-Latin1 et > al.) are mapped to the accented characters you want, it seems likely > that whatever it is you're using to generate the PDFs gets confused > about what encoding is in effect. Maybe you could tell us what PDF > generator you're using. > > m. Hi All, First of all thank you all for the effort you guys are taking to help me out... I have further broken the problem into a small program, which gives me different output for Windows and Linux (running same version of JAVA - 1_4_2_16). The output on windows shows up the actual character as expected, however on Linux it shows up a garbage output probably using KOI-8R encoding. Please see if the program helps you get to the bottom of the problem. I also read in the documentation of Character(version 5.0) that String and Char arrays use UTF-16 encoding and hope its not a problem. import java.io.UnsupportedEncodingException; import java.nio.ByteBuffer; import java.nio.CharBuffer; import java.nio.charset.Charset; import java.util.Iterator; import java.util.Map; import java.util.Set; public class TestCharSet { /** * Default Constructor */ public TestCharSet() { super(); } /** * @param args * @throws UnsupportedEncodingException */ public static void main( String[] args ) throws UnsupportedEncodingException { //System.out.println("Special Characters:" + "á â ä è é ê ë ï ò ó ö ú ü") ; String stateProvince = "á"; //This is the character á System.out.println("State Province before conversion : " + stateProvince) ; String stateProvince_post = unescapeXMLSpecialCharacters( stateProvince ); System.out.println("State Province after conversion : " + stateProvince_post) ; } /** * Replaces all occurrences of the substring in the data string with the * replacement string. * * @param data the string to check. * @param substring the substring to replace. * @param replacement the string the substring is replaced with. * @return the result of the replacement(s). */ // @PMD:REVIEWED:AvoidReassigningParameters: Bajrang Gupta private static String replace( String data, final String substring, final String replacement ) { int index = data.indexOf( substring, 0 ) ; while ( index >= 0 ) { data = data.substring( 0, index ) + replacement + data.substring( index + substring.length() ) ; index += replacement.length() ; index = data.indexOf( substring, index ) ; } return data ; } /** * Checks the string on none xml well formed characters, meaning '<' and * '&', and if found, escapes these characters and returns a well formed * xml string. * * @param xmlData the data string to make well formed. * @return the well formed variant of the xml data. */ public static String unescapeXMLSpecialCharacters( String xmlData ) throws UnsupportedEncodingException { xmlData = replace( xmlData, "á", "\u00E1" ) ; return xmlData ; } } |
|
#9
| |||
| |||
| On 29 Aug, 14:36, winlin <bg.win...@gmail.com> wrote: > * * * * String stateProvince = "á"; //This is the character á It isn't. It's an SGML(and friends)-only numeric entity that represents that character in an SGML context. It makes no sense in Java or text files (it's valid, it just doesn't mean anything). It also _only_ works as "á" in SGML. It doesn't work as "&#xE1;" any more (that should render to the literal string "á". If you have "SGML" content that uses entities, then you not only shouldn't but in fact must not run an SGML-entity-encoder over them. Entity encoding like this isn't idempotetent. Do it to things that are either already entity-encoded, or are a deliberate use of entities, and you'll break stuff (Symptom is that you see "entities" appearing in the browser). |
|
#10
| |||
| |||
| On Aug 29, 7:30*pm, Andy Dingley <ding...@codesmiths.com> wrote: > On 29 Aug, 14:36, winlin <bg.win...@gmail.com> wrote: > > > * * * * String stateProvince = "á"; //This is the character á > > It isn't. > > It's an SGML(and friends)-only numeric entity that represents that > character in an SGML context. It makes no sense in Java or text files > (it's valid, it just doesn't mean anything). > > It also _only_ works as "á" in SGML. It doesn't work as > "&#xE1;" any more (that should render to the literal string > "á". * If you have "SGML" content that uses entities, then you > not only shouldn't but in fact must not run an SGML-entity-encoder > over them. Entity encoding like this isn't idempotetent. Do it to > things that are either already entity-encoded, or are a deliberate use > of entities, and you'll break stuff (Symptom is that you see > "entities" appearing in the browser). Hi Andy, I understand that á does not have any meaning in JAVA and its an SGML entity. However, the SGML entity is generated by a Servlet, which I need to pass to the custom PDFWrapper over pdflib (version 7.0.2). The wrapper takes parses the content sent by the servlet and displays it on the PDF (using content type as application/pdf and passing it to the outputstream). The whole conversion process is String and byte[] based. Thus, if I do not replace the á with the unicode equivalent, the PDF would show up as á - WHICH IS UNDESIRABLE ![]() thus, I replace the String matching á with \u00E1 which should display the proper character. However, it has a different behavior on different platforms, which is the problem that I have tried to summarise in the small program above. Please let me know, if you were able to get the problem. Cheers |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.