Merging Xml Files

simon · Feb 15, 2010

Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

Best regards

Simon

Arne VajhÃ¸j · Feb 15, 2010

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

Sounds as if either just treating them as text files and
use StreamReader & StreamWriter would work. For the same
XML aware use XmlTextReader and XmlTextWriter.

Arne

Peter Duniho · Feb 16, 2010

simon said:
Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

To be clear, do you mean this:

File A:

<document>
<dataA />
<dataB />
<dataC />
</document>

File B:

<document>
<dataD />
<dataE />
<dataF />
</document>

Output:

<document>
<dataA />
<dataB />
<dataC />
<dataD />
<dataE />
<dataF />
</document>

?

Are there possibly duplicated elements between the input files? If so,
is it okay for them to be duplicated in the output, or does the
duplication itself need to be merged somehow?

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

I agree with Arne that it seems like simply reading the files and
writing a new one would be fine. I'd prefer the XML-specific
reader/writer approach, to avoid any possibility of breaking the
structure of each XML document (which it sounds like you are already
having trouble with).

If you can be more specific about what you believe it means to "merge"
two or more XML documents, it's possible you could get even better, more
specific advice.

Pete

Simon · Feb 16, 2010

Hi guys,

Thanks for that. Pete - you're right - the output of the merge should be:

<document>
<dataA />
<dataB />
<dataC />
<dataD />
<dataE />
<dataF />
</document>

Duplicates shouldn't occur - it should be as simple as snapping the
<xml> and root nodes off and then slapping them together.

The issue I'm having is in solving the problem whilst not having to read
the whole file into memory, because they are too large.

The bit I'm struggling with is where I need to remove the last
</rootNode></xml> node from any given document, as I can't find an
effective way of reading to that part of the file and snapping the end
off. If I read the whole document in, I run out of memory. If I use some
sort of "chunking" mechanism (which I think is exactly what I want), I
cant figure out how to detect the </rootNode> portion then snap it off.

If anyone can advise on an easy approach - I'd be very greatful.

Thanks

Simon

Peter Duniho · Feb 16, 2010

Simon said:
[...]
The issue I'm having is in solving the problem whilst not having to read
the whole file into memory, because they are too large.

The bit I'm struggling with is where I need to remove the last
</rootNode></xml> node from any given document, as I can't find an
effective way of reading to that part of the file and snapping the end
off. If I read the whole document in, I run out of memory. If I use some
sort of "chunking" mechanism (which I think is exactly what I want), I
cant figure out how to detect the </rootNode> portion then snap it off.

If anyone can advise on an easy approach - I'd be very greatful.

Arne's suggestion to use XmlTextReader/Writer should work fine. Have
you looked at those classes? It should be no more difficult than:

â€“ for the first file, read and copy the outer-most document element
â€“ for every other file, read past the outer-most document element
without copying it to the output
â€“ for every file, read and copy every bit of content within the
outer-most document element
â€“ for all files except the last, read past without copying the
element end for the outer-most document element
â€“ finally, for the last file, read and copy the element end for the
outer-most document element

If you have tried the above and cannot get it to work, you should post a
concise-but-complete code example showing what you've tried and how it
doesn't work for you. Then some specific advice with respect to your
attempt can be offered.

If you have not tried the above, wellâ€¦you should.

Pete

Andy O'Neill · Feb 16, 2010

simon said:
Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been

I would be inclined to look at sql server and ssis myself.
Maybe that's not an option for you but severs and big files and batch
overnight processing kind of go together in my mind.

SSIS is pretty efficient and can do some dead clever stuff..
Large files are often one of those things that kind of arrives overnight or
some time tomorrow is fine.
May not be appropriate for some reason, but just thought I'd run it past
you.

Merge XML Files from DataSets	3	Jun 8, 2009
How do I transform this xml file into html by using this xslt document	9	Feb 18, 2012
Parse (recovered) corrupt xml files and automatically repair them.	6	Jan 9, 2010
Removing Offsetting XML Nodes	5	Jun 27, 2011
XML DTDs Versus XML Schema	0	Nov 26, 2014
Newbee on XML - Needs some little help	1	Sep 3, 2010
read XML and add to it with XMLTextReader / XMLTextWriter	4	Dec 10, 2009
Writing to existing XML file	4	Mar 9, 2010

Merging Xml Files

simon

Arne VajhÃ¸j

Peter Duniho

Simon

Peter Duniho

Andy O'Neill

Ask a Question

Similar Threads