XML parsing with streamed XML

ginguene · Mar 5, 2006

I am sending continuous stream of XML like this :
_____________________

<stream>
<balise>
test
</balise>
<balise>
test2
</balise>

[..........etc..]

</stream>
_____________________

as reading this stream (in a string builder), i need to extract the
<balise> tag in order to get this :
<balise>
test
</balise>
and remove it from the stream as we parse
exemple :

_____________________

<stream>
<balise>
test
</balise>
<balise> [waiting for the rest of the stream...]

_____________________
with this in the stream, we should extract :
<balise>
test
</balise>
so at the end we have :
_____________________

<stream>
<balise> [waiting for the rest of the stream...]

_____________________

And so on as we read the stream !
The thing is, i can receive this stream byte after byte, or more...

Currently, i am using RegEx.
But its a bit tricky with the CDATA
For exemple if we have something like this in our stream :
<balise ><![CDATA[<balise ></balise >
We should be waiting for the real end of <balise >, but i cant do it
with regex (or you have tips maybe ?)

So i thought maybe using some XMLreader, or xmlstreamreader or
whatever...

I need the fastest processing solution

Thanks

Jon Skeet [C# MVP] · Mar 5, 2006

I am sending continuous stream of XML like this :

And so on as we read the stream !
The thing is, i can receive this stream byte after byte, or more...

Currently, i am using RegEx.
But its a bit tricky with the CDATA
For exemple if we have something like this in our stream :
<balise ><![CDATA[<balise ></balise >
We should be waiting for the real end of <balise >, but i cant do it
with regex (or you have tips maybe ?)

So i thought maybe using some XMLreader, or xmlstreamreader or
whatever...

I need the fastest processing solution

XmlReader would certainly be the way to go for simplicity. When you say
you need the fastest processing solution - presumably you only need it
to go *acceptably* fast - it's very rare that you need the absolutely
fastest solution. Such a solution would almost certainly involve
writing your own custom parsing code which would be *extremely*
complicated.

I would strongly recommend trying XmlReader and seeing whether that's
good enough for your needs. I suspect it will be.

ginguene · Mar 5, 2006

ok for Xml Reader, but how?
i dont want to be testing for xml validation each time i receive a new
byte in the stream...

and i need reasonably fast, which means i dont want to be creating xml
objects everytime by exemple... lets just say i need Efficient

, as i
ll be processing a lot of data, on multiple streams

Thanks Jon

Jon Skeet [C# MVP] · Mar 5, 2006

ok for Xml Reader, but how?

Well, we'd need more information about what you need to do in order to
give you sample code, but the normal thing is to create an XmlReader
(of some description, eg XmlTextReader) and then just let it read nodes
as you ask for them.

i dont want to be testing for xml validation each time i receive a new
byte in the stream...

Well, for one thing you can turn some of the validation off.

and i need reasonably fast, which means i dont want to be creating xml
objects everytime by exemple... lets just say i need Efficient , as i
ll be processing a lot of data, on multiple streams

You can give the XmlTextReader a stream of data. You don't need to
manually feed it each individual byte, although it will have to process
each byte in turn.

You *will* end up creating XML objects, but they're likely to be
short-lived unless you actually *need* to hold onto them for a long
time. You should definitely try the simple solution, see whether it
performs well enough for you. I'm sure you'll find it performs a lot
quicker than using regular expressions!

ginguene · Mar 5, 2006

alright...

but how can i validate something like that :
// Begining of the stream
<stream>
<balise>
test
</balise>
<balise>

All i want here in this exemple, is to extract
<balise>
test
</balise>

Actually, i do this by using regex, but if you tell me that using
XMLReader would be faster, then i have to find a way to do what i want
(extracting tags in an unfinished xml file) with XML objects.

More precisions : the stream will evolve as we receive new bytes
exemple :
________
Step 1
<stream>
________
Step 2
<stream>
<balis
________
Step 3
<stream>
<balise>
test
________
Step 4
<stream>
<balise>
test
</balise>
<balise>
________
etc...

We should test at every step of the stream, and in this exemple, we
should be able to extract something (the 1st <balise>) on Step 4 only.

I hope i am clear enough, thank you for your help.

Jon Skeet [C# MVP] · Mar 5, 2006

alright...

but how can i validate something like that :
// Begining of the stream
<stream>
<balise>
test
</balise>
<balise>

All i want here in this exemple, is to extract
<balise>
test
</balise>

So you ask the XmlReader for the next node (from the start) and it will
return the <stream> element. After that, you'll keep asking for nodes,
keeping track of any text nodes you're given (the "test" part here) and
when you see an element which is an end "balise" element, do whatever
processing you need.

Actually, i do this by using regex, but if you tell me that using
XMLReader would be faster, then i have to find a way to do what i want
(extracting tags in an unfinished xml file) with XML objects.

We should test at every step of the stream, and in this exemple, we
should be able to extract something (the 1st <balise>) on Step 4 only.

I hope i am clear enough, thank you for your help.

I suggest you read up on XmlTextReader, including the examples in MSDN.
I'm sure you'll find it useful.

ginguene · Mar 5, 2006

good, i am looking at XmlTextReader, it seems to do the trick (once i
completely figured how it actually works

)

But i am still wondering if it is able to handle this :

<balise ><![CDATA[</balise > <---- here we are supposed to wait for
the REAL end "</balise>"

Its an extreme case, unlikely to happen, but well, you are never too
carefull !

I have to test it to check, but as i am still discovering i will try
this later.

Have a good day and thanks for the help !

Jon Skeet [C# MVP] · Mar 5, 2006

good, i am looking at XmlTextReader, it seems to do the trick (once i
completely figured how it actually works )

But i am still wondering if it is able to handle this :

<balise ><![CDATA[</balise > <---- here we are supposed to wait for
the REAL end "</balise>"

Its an extreme case, unlikely to happen, but well, you are never too
carefull !

XmlTextReader is a proper XML parser - it should cope fine with it. As
you say, the best way is to test it though

XML parsing with streamed XML

ginguene

Jon Skeet [C# MVP]

ginguene

Jon Skeet [C# MVP]

ginguene

Jon Skeet [C# MVP]

ginguene

Jon Skeet [C# MVP]