Reading XML Encoding errors

A

AGP

I am programming an XML reader in VB.NET 2005 and it works fairly well.
Once in a while though I encounter an old XML file without the header
<?xml version="1.0" encoding="UTF-8"?>
It craps out on the Load with an error similar to "Invalid character in the
given encoding. Line 3, position 5475070".
After some research the character in question is the copyright character. My
question is how can i force the reader to assume UTF-8?
It seems like my other newer files do not have this problem, just my older
files. I want to be able to catch this error
and then attempt to load the file. It also seems like this older file does
not have a BOM so Im assuming the XML reader has no idea how to interpret
it. Im hoping i can force a UTF-8 read of the XML file.

As a secondary question, it seems like these older XML files were
originally written out as one or two huge lines. is there a way to output a
copy
that is more user readable in the node-type format with line breaks and all?

Thanks for any help
AGP
 
H

Herfried K. Wagner [MVP]

AGP said:
Once in a while though I encounter an old XML file without the header
<?xml version="1.0" encoding="UTF-8"?>
It craps out on the Load with an error similar to "Invalid character in
the given encoding. Line 3, position 5475070".
After some research the character in question is the copyright character.
My question is how can i force the reader to assume UTF-8?
It seems like my other newer files do not have this problem, just my older
files. I want to be able to catch this error
and then attempt to load the file. It also seems like this older file does
not have a BOM so Im assuming the XML reader has no idea how to interpret
it. Im hoping i can force a UTF-8 read of the XML file.

IIRC UTF-8 is the default encoding for XML files.
 
M

Martin Honnen

AGP said:
I am programming an XML reader in VB.NET 2005 and it works fairly well.
Once in a while though I encounter an old XML file without the header
<?xml version="1.0" encoding="UTF-8"?>
It craps out on the Load with an error similar to "Invalid character in the
given encoding. Line 3, position 5475070".
After some research the character in question is the copyright character. My
question is how can i force the reader to assume UTF-8?
It seems like my other newer files do not have this problem, just my older
files. I want to be able to catch this error
and then attempt to load the file. It also seems like this older file does
not have a BOM so Im assuming the XML reader has no idea how to interpret
it. Im hoping i can force a UTF-8 read of the XML file.


Without a BOM, without an XML declaration (and without any protocol like
HTTP declaring an encoding( an XML parser assumes UTF-8 so I doubt that
"enforcing" UTF-8 solves the problem.
You will need to find out which encoding those XML documents have, then
you can "enforce" that for instance doing e.g.
Using Reader As XmlReader = XmlReader.Create(New
StreamReader("file.xml", Encoding.GetEncoding("encoding-name")))

As a secondary question, it seems like these older XML files were
originally written out as one or two huge lines. is there a way to output a
copy
that is more user readable in the node-type format with line breaks and all?

Yes, should be possible by passing an XmlReader reading the XML to the
WriteNode method of an XmlWriter that uses the XmlWriterSettings with
Indent = True.
So pseudo code is e.g.
Dim WriterSettings As New XmlWriterSettings()
WriterSettings.Indent = True
Using Reader As XmlReader = XmlReader.Create("input.xml")
Using Writer As XmlWriter = XmlWriter.Create("output.xml",
WriterSettings)
Writer.WriteNode(Reader, False)
End Using
End Using
 
A

AGP

Herfried K. Wagner said:
IIRC UTF-8 is the default encoding for XML files.

ok so then why do i get the error? I did a test and loaded the XML into
notepad
and then saved that file as Text UTF-8 and it seems that file is read
correctly. So my
question is why does the original not load properly?

AGP
 
A

AGP

Martin Honnen said:
Without a BOM, without an XML declaration (and without any protocol like
HTTP declaring an encoding( an XML parser assumes UTF-8 so I doubt that
"enforcing" UTF-8 solves the problem.
You will need to find out which encoding those XML documents have, then
you can "enforce" that for instance doing e.g.
Using Reader As XmlReader = XmlReader.Create(New
StreamReader("file.xml", Encoding.GetEncoding("encoding-name")))

the encoding is UTF-8. the only "non-standard" character in the file is the
copyright character.
Ive loaded the file into Notepad and saved as Text UTF-8 and the reuslting
file loads just fine. So im missing something and not really sure what.
Yes, should be possible by passing an XmlReader reading the XML to the
WriteNode method of an XmlWriter that uses the XmlWriterSettings with
Indent = True.
So pseudo code is e.g.
Dim WriterSettings As New XmlWriterSettings()
WriterSettings.Indent = True
Using Reader As XmlReader = XmlReader.Create("input.xml")
Using Writer As XmlWriter = XmlWriter.Create("output.xml",
WriterSettings)
Writer.WriteNode(Reader, False)
End Using
End Using

ill try this.

AGP
 
H

Herfried K. Wagner [MVP]

AGP said:
ok so then why do i get the error? I did a test and loaded the XML into
notepad
and then saved that file as Text UTF-8 and it seems that file is read
correctly. So my
question is why does the original not load properly?

Maybe it's stored in an encoding other than UTF-8, Windows ANSI, for
example.
 
M

Martin Honnen

AGP said:
the encoding is UTF-8. the only "non-standard" character in the file is the
copyright character.

The copyright character is part of Unicode so there is nothing
"non-standard" about it in terms of Unicode or UTF-8. If you get an
error then the document is not UTF-8 encoded. If you get the error for
that character only then the document is perhaps Windows-1252 encoded.
Ive loaded the file into Notepad and saved as Text UTF-8 and the reuslting
file loads just fine. So im missing something and not really sure what.

Loading in an editor and saving as UTF-8 changes the encoding to UTF-8.
 
D

DIOS

The copyright character is part of Unicode so there is nothing
"non-standard" about it in terms of Unicode or UTF-8. If you get an
error then the document is not UTF-8 encoded. If you get the error for
that character only then the document is perhaps Windows-1252 encoded.


Loading in an editor and saving as UTF-8 changes the encoding to UTF-8.

That makes sense to me but I'm still struggling with the questions...
How would I determine what encoding is used on this file?
If the original file comes from a Zip file and I extract it, that
would not change its encoding correct? it should retain its original
encoding?
Once I figure this out, how can i load the XML file without altering
the original? I eventually want to load it, read it, parse it, then
output it to another XML file, but at this point I cant even load it.

AGP

AGP
 
M

Martin Honnen

DIOS said:
That makes sense to me but I'm still struggling with the questions...
How would I determine what encoding is used on this file?

Ask the author of the XML document how he/she encoded it, there is no
way to check that programmatically in general, that is why there should
be an XML declaration declaring the encoding if the document is not
UTF-8 or UTF-16 encoded.
 
A

AGP

Martin Honnen said:
Ask the author of the XML document how he/she encoded it, there is no way
to check that programmatically in general, that is why there should be an
XML declaration declaring the encoding if the document is not UTF-8 or
UTF-16 encoded.

I will ask but the source could be from a variety of providers so I may end
up
with no concrete answer as to whats used for encoding. However i did open
the file in Notepad and added the XML declaration and then just did a plain
old save and the file still errors out. But in my mind of there is no
declaration then
the functions assume a UTF-8 correct? But not sure why if this is the case
why
I still get an error.

AGP
 
A

AGP

Martin Honnen said:
Ask the author of the XML document how he/she encoded it, there is no way
to check that programmatically in general, that is why there should be an
XML declaration declaring the encoding if the document is not UTF-8 or
UTF-16 encoded.

I will ask but the source could be from a variety of providers so I may end
up
with no concrete answer as to whats used for encoding. However i did open
the file in Notepad and added the XML declaration and then just did a plain
old save and the file still errors out. But in my mind of there is no
declaration then
the functions assume a UTF-8 correct? But not sure why if this is the case
why
I still get an error.

AGP
 
M

Martin Honnen

AGP said:
I will ask but the source could be from a variety of providers so I may end
up
with no concrete answer as to whats used for encoding. However i did open
the file in Notepad and added the XML declaration and then just did a plain
old save and the file still errors out. But in my mind of there is no
declaration then
the functions assume a UTF-8 correct? But not sure why if this is the case
why
I still get an error.

Try whether using
New StreamReader("file.xml", Encoding.Default)
works with those files.
 
A

AGP

AGP said:
I will ask but the source could be from a variety of providers so I may
end
up
with no concrete answer as to whats used for encoding. However i did open
the file in Notepad and added the XML declaration and then just did a
plain
old save and the file still errors out. But in my mind of there is no
declaration then
the functions assume a UTF-8 correct? But not sure why if this is the case
why
I still get an error.

i finally got a hold of the provider and sure enough as you guys stated
there was one character
that was written in the wrong encoding. so when the XML reader assumes a
UTF-8 and encounters
this one character (an ISO character) it errors out. Thanks for all the
advice.

AGP
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top