reading, writing xml and encoding question

B

billsahiker

I am researching for an upcoming c# .net 2.0 project that may require
reading and writing xml files. I don't want to use xmltextreader/
xmltextwriter as I prefer to have lower level file access and more
control, and a smaller memory footprint, and I wont need most of their
functionality.

I do need to handle utf8 and utf16 encodings and I need to insert
some tags.

Should I use the streamreader/streamwriter classes and the encoding
property, or are there other I/O classes better suited for this?
Should I read and write bytes to avoid the whole encoding issue?

Bill
 
M

Marc Gravell

First off; have you investigated whether you need the complexity of all
of this? How big is the xml?
I am researching for an upcoming c# .net 2.0 project that may require
reading and writing xml files. I don't want to use xmltextreader/
xmltextwriter as I prefer to have lower level file access and more
control, and a smaller memory footprint, and I wont need most of their
functionality.

Fine; if you want a smaller footprint, then don't use XmlDocument - but
XmlReader / XmlWriter (and the concrete implementations) are streaming
APIs - there is nothing very costly about using them, other than they
are harder to get right (than XmlDocument).
I do need to handle utf8 and utf16 encodings and I need to insert
some tags.

Well, you'll still need to rewrite the entire file (i.e. have two files:
one in (XmlReader), one out (XmlWriter)) - and blitz through the data,
normally writing verbatim. Alternatively, consider an xslt transform.
Should I use the streamreader/streamwriter classes and the encoding
property, or are there other I/O classes better suited for this?
Should I read and write bytes to avoid the whole encoding issue?

To be honest, unless you are /very/ good you will struggle to do better
than XmlReader/XmlWriter; plus they are pre-tested. I'd just use them
(less risky, and more effective development). Anything more hardcore is
just making work for the sake of it.

Marc
 
J

Jon Skeet [C# MVP]

I am researching for an upcoming c# .net 2.0 project that may require
reading and writing xml files. I don't want to use xmltextreader/
xmltextwriter as I prefer to have lower level file access and more
control, and a smaller memory footprint, and I wont need most of their
functionality.

Do you have any evidence to suggest that avoiding using XmlTextReader/
XmlTextWriter will actually give you a significantly smaller memory
footprint? What file access do you need that these classes won't give
you? Could you possibly achieve the same level of access by providing
your own stream to XmlTextReader/XmlTextWriter?

Getting XML stuff right at a low level can be a tricky business. Are
you sure your time is best spent by reinventing this particular wheel
instead of concentrating on actual business value?

If you have real concerns about performance etc, it's definitely worth
*measuring* the different options available - see whether there's
actually a problem if you use the built-in classes, rather than
assuming there is (and that you could do better yourself).

Jon
 
M

Marc Gravell

If you have real concerns about performance etc, it's definitely worth
*measuring* the different options available - see whether there's
actually a problem if you use the built-in classes, rather than
assuming there is (and that you could do better yourself).

To put that into context - here is a crude test that uses XmlWriter to
write a large file with all the cost of encoding etc, then uses a Stream
to write raw binary (of repeated simple data, moderate length - and yes
I tried the buffered stream too).

The results suggest no real benefit (in fact, many tests were
significantly quicker with XmlWriter) - almost certainly because the CPU
can run rings around the HDD. So: do you really want to add layers of
complexity (and bugs) here? I wouldn't...

c:\xml.xml: 5598ms, 34MB, 6.07MB/s
c:\bin.bin: 4941ms, 34MB, 6.88MB/s

using System.IO;
using System.Xml;
using System.Diagnostics;
using System;

class Program
{
static void Main()
{
const string XML_PATH = @"c:\xml.xml", BIN_PATH = @"c:\bin.bin";

if (File.Exists(XML_PATH)) File.Delete(XML_PATH);
Stopwatch watch = Stopwatch.StartNew();
using (XmlWriter writer = XmlWriter.Create(XML_PATH))
{
writer.WriteStartElement("Foo");
for (int i = 0; i < 1500000; i++)
{
writer.WriteStartElement("Bar");
writer.WriteAttributeString("val", i.ToString());
}
writer.WriteEndElement();
writer.Close();
}
watch.Stop();
long bytes = WriteWatch(watch.ElapsedMilliseconds, XML_PATH);

if (File.Exists(BIN_PATH)) File.Delete(BIN_PATH);
watch = Stopwatch.StartNew();
// note: also tried BufferedStream here...
using (Stream stream = File.Create(BIN_PATH))
{ // write any old garbage to the same file size
byte[] buffer = new byte[200];
while (bytes > 0)
{
stream.Write(buffer, 0, buffer.Length);
bytes -= buffer.Length;
}
stream.Close();
}
watch.Stop();
WriteWatch(watch.ElapsedMilliseconds, BIN_PATH);
}
static long WriteWatch(long ms, string path)
{
long bytes = new FileInfo(path).Length, mb = bytes / (1024 * 1024);
Console.WriteLine("{0}: {1}ms, {2}MB, {3:###.##}MB/s",
path, ms, mb,
mb / (ms / 1000M));
return bytes;
}
}
 
B

billsahiker

On Apr 30, 3:01 pm, (e-mail address removed) wrote:
Do you have any evidence to suggest that avoiding using XmlTextReader/
XmlTextWriter will actually give you a significantly smaller memory
footprint?
Jon,

Yes. tested them extensively. the reader slows to a crawl with very
large files and memory usage skyrockets(somewhere over 500Mb,
depending on xml structure and machine). I can read pretty much any
size file with streamreader. If I could find an xml editor than can
handle such files, and checks well-formedness, I could use it for some
of the work.

Bill
 
J

Jon Skeet [C# MVP]

Yes. tested them extensively. the reader slows to a crawl with very
large files and memory usage skyrockets(somewhere over 500Mb,
depending on xml structure and machine). I can read pretty much any
size file with streamreader. If I could find an xml editor than can
handle such files, and checks well-formedness, I could use it for some
of the work.

That seems very odd, if you're *just* using XmlTextReader and not (for
instance) loading it into an XmlDocument. Could you provide a short
but complete program demonstrating that? Does it depend on the XML
within the file, for instance? I wouldn't expect XmlTextReader to be
doing anything particularly significant.

(Aside from anything else, having a test case for XmlTextReader being
slow will prove very valuable if you *do* write your own reader...)

Jon
 
J

Jon Skeet [C# MVP]

(e-mail address removed)
I am workign in project in .NET 1.1 in which we are using XMLtextreader to
read XML files which is in GB size.

We are able to read the files till 2GB in size but for files greater than
2GB xml textreader is not workign and giving the errror Index out of range.

Is this the .NET 1.1 problem ? Is there is any workaround to read xml files
larger than 2GB ? we are in stage of completion of project and looking for
quick workaround.

Please help us i am also giving hte stacktrace of the exception for your
reference.

It could well be something that's been fixed in 2.0. Presumably you can
reproduce this easily - why not just run it under 2.0 and see whether
that fixes the problem?
 
B

billsahiker

instance) loading it into an XmlDocument. Could you provide a short
but complete program demonstrating that? Does it depend on the XML
within the file, for instance? I wouldn't expect XmlTextReader to be
doing anything particularly significant.

(Aside from anything else, having a test case for XmlTextReader being
slow will prove very valuable if you *do* write your own reader...)

Jon

Jon,

below is the content of a console app file program.cs. Most of the
files I tested I created with another c# program that autogenerates
them with options for structure and an optional short text string,.
even with files having no text or attributes, it seems to hang on the
big files. the task manager shows the available memory drops from
about 850Mb to as low as 3Mb and cpu usage drops to near zero and
stays there, and tm indicates the program is not responding.

Bill

program.cs

using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.IO;

namespace ReadXml
{
class Program
{
static void Main(string[] args)
{

if (args.Length > 0)
{
Console.WriteLine(args[0].ToString());
Console.ReadLine();
ReadXml r = new ReadXml();
r.ReadFile(args[0].ToString());
}
else
{
Console.WriteLine("file name required");
Console.ReadLine();
}
}
}

public class ReadXml
{

private XmlTextReader reader = null;
private XmlTextReader reader1 = null;
private int totalNodeCount = 0;

public void ReadFile(string fname)
{
Console.WriteLine(fname);
Console.ReadLine();
reader = new XmlTextReader(fname);
reader.WhitespaceHandling = WhitespaceHandling.All;
TimeSpan t1 = DateTime.Now.TimeOfDay;
while (reader.Read() )
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
//Console.Write("<{0}", reader.Name);
//while(reader.MoveToNextAttribute())
//{
// Console.Write(" " + reader.Name + " " +
reader.Value + " ");
//}
totalNodeCount++;
break;
case XmlNodeType.Text:
//Console.Write(reader.Value);
//totalNodeCount++;
break;
case XmlNodeType.CDATA:
//Console.Write("<![CDATA[{0}]]>",
reader.Value);
totalNodeCount++;
break;
case XmlNodeType.ProcessingInstruction:
//Console.Write("<?{0} {1}?>", reader.Name,
reader.Value);
totalNodeCount++;
break;
case XmlNodeType.Comment:
//Console.Write("<!--{0}-->", reader.Value);
totalNodeCount++;
break;
case XmlNodeType.XmlDeclaration:
//Console.Write("<?xml version='1.0'?>");
totalNodeCount++;
break;
case XmlNodeType.EndElement:
//Console.Write("</{0}>", reader.Name);
break;
case XmlNodeType.Whitespace:
//Console.Write(reader.Value);
break;
}
if (totalNodeCount >= 100000) break;
}
TimeSpan t2 = DateTime.Now.TimeOfDay;
Console.WriteLine(t2.Subtract(t1));
Console.WriteLine(totalNodeCount);
Console.ReadLine();

}

}

}
 
N

neerajb

Hi Jon,

Thanks for the quick reply. I am trying to run in 2.0 also but it will take
time to set up adn run the appplication.

Since we are in the completion phase of project we are not able to move the
code to .NET 2.0 version. Is there is any other way to parse files greater
than 2 GB in .NET 1.1
 
J

Jon Skeet [C# MVP]

below is the content of a console app file program.cs. Most of the
files I tested I created with another c# program that autogenerates
them with options for structure and an optional short text string,.
even with files having no text or attributes, it seems to hang on the
big files. the task manager shows the available memory drops from
about 850Mb to as low as 3Mb and cpu usage drops to near zero and
stays there, and tm indicates the program is not responding.

If you run it in a debugger and break into it, what does it show?

For what it's worth, I've just tried it with a big file and it had no
problems at all. If you could give the generator code as well, we might
be able to get further.

With the very simple generation program listed below, I generated a 3GB
file. I removed the "break" from your test program to make it read the
whole thing, and it did so in a couple of minutes (during which the
disk was constantly active). Memory stayed constant (and low)
throughout.


using System;
using System.IO;

class Generate
{
static void Main(string[] args)
{
int iterations = int.Parse(args[0]);
using (TextWriter writer = new StreamWriter("test.xml"))
{
writer.WriteLine("<root>");
for (int i=0; i < iterations; i++)
{
writer.WriteLine("<element><nested /></element>");
}
writer.WriteLine("</root>");
}
}
}
 
B

billsahiker

With the very simple generation program listed below, I generated a 3GB
file. I removed the "break" from your test program to make it read the
whole thing, and it did so in a couple of minutes (during which the
disk was constantly active). Memory stayed constant (and low)
throughout.

Jon,

I made some good progress, thanks to your generate program. Seems that
mine was the problem. I put the code below. Not sure what the
underlying problem with it is. One difference is that it writes with
the streamwriter, not textwriter. another is that it writes each
element on a separate line.
I used your program but changed the tag names to what I was using,
added spaces for indents, added two more levels, but put it all on one
line and I have no problem reading the output file upto 3Gb so far.

I should add that using my CreateRawXml method, I generated a small
file with the same structure and IE had no problem with it -I did this
to see if I had violated a tag name rule or orther rule of some sort.

So what do you think caused the problem I had?

Bill


public class BuildRawXml
{
private UInt64 maxNodes = 10000;
private UInt64 maxChildren = 150;
private UInt64 maxGrandchildren = 4;
private UInt64 maxGreatGC = 1;
private string root = "<root>";
private string rootEndTag = "</root>";

public void CreateRawXml()
{
string fname = @"C:\Development\BuildXml\BuildXml\bin\Debug
\big10K 100 4 1.xml";

StreamWriter sw = new StreamWriter(fname);
sw.WriteLine(root);
AddNodes(sw);
sw.WriteLine(rootEndTag);
sw.Close();
}

private void AddNodes(StreamWriter sw)
{
string child = string.Empty;
string startTag = "<";
string closeTag = ">";
string EndTag = " />";
string endElementTag = "</";
string grandchild = string.Empty;
string Greatgrandchild = string.Empty;
string GreatGC = string.Empty;
string text = "text for ";

for (UInt64 i = 0; i < maxNodes; i++)
{
child = "Child" + Convert.ToString(i) ;
sw.WriteLine(Indent(2) + startTag + child +
closeTag);
for (UInt64 j = 0; j < maxChildren; j++)
{
grandchild = child + "_GC" +
Convert.ToString(j);
sw.WriteLine(Indent(4) + startTag + grandchild +
closeTag);
//XmlNode textnode =
grandchild.OwnerDocument.CreateTextNode(text);
//grandchild.AppendChild(textnode);
for (UInt64 k = 0; k < maxGrandchildren; k++)
{
Greatgrandchild = grandchild + "_GGC" +
Convert.ToString(k);
sw.WriteLine(Indent(6) + startTag +
Greatgrandchild + closeTag);
for (UInt64 l = 0; l < maxGreatGC; l++)
{
GreatGC = Greatgrandchild + "_GGGC" +
Convert.ToString(l);
if(string.IsNullOrEmpty(text))
{
sw.WriteLine(Indent(8) + startTag +
GreatGC + EndTag);
}
else{
sw.WriteLine(Indent(8) + startTag +
GreatGC + closeTag + text + GreatGC + endElementTag + GreatGC +
closeTag);
}
}
sw.WriteLine(Indent(6) + endElementTag +
Greatgrandchild + closeTag);
}
sw.WriteLine(Indent(4) + endElementTag +
grandchild + closeTag);
}
sw.WriteLine(Indent(2) + endElementTag + child +
closeTag );
}
}

private string Indent(int offset)
{

return( new string(Convert.ToChar(" "), offset) );
}

public void Generate()
{
string fname = @"C:\Development\BuildXml\BuildXml\bin\Debug
\test4.xml";
TextWriter tr = new StreamWriter(fname);
tr.WriteLine("<root>");
for (int i = 0; i < 40000000; i++)
{
tr.WriteLine("<Child> <gc> <ggc> <gggc>some
text</gggc></ggc></gc></Child>");
}
tr.WriteLine("</root>");
tr.Close();
}

}
 
J

Jon Skeet [C# MVP]

I made some good progress, thanks to your generate program.
Good.

Seems that
mine was the problem. I put the code below. Not sure what the
underlying problem with it is. One difference is that it writes with
the streamwriter, not textwriter.

Mine was using StreamWriter as well - the variable was of type
TextWriter, but the actual object was a StreamWriter.
another is that it writes each element on a separate line.
I used your program but changed the tag names to what I was using,
added spaces for indents, added two more levels, but put it all on one
line and I have no problem reading the output file upto 3Gb so far.
Excellent.

I should add that using my CreateRawXml method, I generated a small
file with the same structure and IE had no problem with it -I did this
to see if I had violated a tag name rule or orther rule of some sort.

So what do you think caused the problem I had?

Not sure yet - but I'll look into it further. I've now seen the issue
as well, using your CreateRawXml.
 
J

Jon Skeet [C# MVP]

I made some good progress, thanks to your generate program. Seems that
mine was the problem. I put the code below. Not sure what the
underlying problem with it is. One difference is that it writes with
the streamwriter, not textwriter. another is that it writes each
element on a separate line.
I used your program but changed the tag names to what I was using,
added spaces for indents, added two more levels, but put it all on one
line and I have no problem reading the output file upto 3Gb so far.

I should add that using my CreateRawXml method, I generated a small
file with the same structure and IE had no problem with it -I did this
to see if I had violated a tag name rule  or orther rule of some sort.

So what do you think caused the problem I had?

I now know exactly what caused the problem. Your code generates a
different element name for *every element*. Now, XmlReader has an
XmlNameTable - the idea being to avoid generating too many strings,
instead using a dictionary of element and attribute names that it's
already seen. That's great for almost all real-world XML, which uses a
small set of element/attribute names throughout even large documents.
It breaks completely on your XML though.

As far as I can tell, you can't ask XmlReader not to use an
XmlNameTable at all, but you can provide a no-op one which *doesn't*
cache things:

class NoOpNameTable : XmlNameTable
{
public override string Add(string array)
{
return array;
}

public override string Add(char[] array, int offset, int
length)
{
if (length == 0)
{
return string.Empty;
}
return new string(array, offset, length);
}

public override string Get(string array)
{
return array;
}

public override string Get(char[] array, int offset, int
length)
{
if (length == 0)
{
return string.Empty;
}
return new string(array, offset, length);
}
}

Pass an instance of that into the constructor to XmlTextReader, and I
think you'll find everything works fine.

Alternatively, try to avoid creating such pathological XML
documents :)

Jon
 
B

billsahiker

Jjon,

Glad you found it. Thanks. The tag names also get very long as the
counter increments; I did this to determine where I was in the xml
when inspecting a fragment.

I will give your NoOpNameTable class a try and let you know how it
goes.

Bill


ps I do not suffer from being pathological -I enjoy every minute of
it.

I made some good progress, thanks to your generate program. Seems that
mine was the problem. I put the code below. Not sure what the
underlying problem with it is. One difference is that it writes with
the streamwriter, not textwriter. another is that it writes each
element on a separate line.
I used your program but changed the tag names to what I was using,
added spaces for indents, added two more levels, but put it all on one
line and I have no problem reading the output file upto 3Gb so far.
I should add that using my CreateRawXml method, I generated a small
file with the same structure and IE had no problem with it -I did this
to see if I had violated a tag name rule  or orther rule of some sort.
So what do you think caused the problem I had?

I now know exactly what caused the problem. Your code generates a
different element name for *every element*. Now, XmlReader has an
XmlNameTable - the idea being to avoid generating too many strings,
instead using a dictionary of element and attribute names that it's
already seen. That's great for almost all real-world XML, which uses a
small set of element/attribute names throughout even large documents.
It breaks completely on your XML though.

As far as I can tell, you can't ask XmlReader not to use an
XmlNameTable at all, but you can provide a no-op one which *doesn't*
cache things:

    class NoOpNameTable : XmlNameTable
    {
        public override string Add(string array)
        {
            return array;
        }

        public override string Add(char[] array, int offset, int
length)
        {
            if (length == 0)
            {
                return string.Empty;
            }
            return new string(array, offset, length);
        }

        public override string Get(string array)
        {
            return array;
        }

        public override string Get(char[] array, int offset, int
length)
        {
            if (length == 0)
            {
                return string.Empty;
            }
            return new string(array, offset, length);
        }
    }

Pass an instance of that into the constructor to XmlTextReader, and I
think you'll find everything works fine.

Alternatively, try to avoid creating such pathological XML
documents :)

Jon
 
J

Jon Skeet [C# MVP]

Glad you found it. Thanks. The tag names also get very long as the
counter increments; I did this to determine where I was in the xml
when inspecting a fragment.

Obviously I don't know the details of your situation, but a much more
reasonable approach (IMO) would be to set the *value* of an attribute
to be the counter, rather than changing the element name.
 
J

Jon Skeet [C# MVP]

Yes, that would work better. The NoOpNameTable class worked like a
charm. Thanks again.

No problem - I'm just glad we managed to avoid you writing your own XML
parser :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top