Merging Xml Files

S

simon

Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

Best regards

Simon
 
A

Arne Vajhøj

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

Sounds as if either just treating them as text files and
use StreamReader & StreamWriter would work. For the same
XML aware use XmlTextReader and XmlTextWriter.

Arne
 
P

Peter Duniho

simon said:
Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
and I need to perform a merge operation on them. However, because each
individual document has it's own XML node and root element, I need to
find the most efficient way of stripping those elements out where
appropriate so that the relevant data can be merged into on large file.

To be clear, do you mean this:

File A:

<document>
<dataA />
<dataB />
<dataC />
</document>

File B:

<document>
<dataD />
<dataE />
<dataF />
</document>

Output:

<document>
<dataA />
<dataB />
<dataC />
<dataD />
<dataE />
<dataF />
</document>

?

Are there possibly duplicated elements between the input files? If so,
is it okay for them to be duplicated in the output, or does the
duplication itself need to be merged somehow?
On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been
attempting to read the relevant files into memory and to text
manipulation to eliminate the header and footer elements that I don't
need, but I'm not having much luck. I'm beginning to wonder if it would
be better to try and use some of the XML Apis in .net rather than trying
to treat it as purely a text manipulation and file io problem.

Thanks to anyone who can advise on a possible approach

I agree with Arne that it seems like simply reading the files and
writing a new one would be fine. I'd prefer the XML-specific
reader/writer approach, to avoid any possibility of breaking the
structure of each XML document (which it sounds like you are already
having trouble with).

If you can be more specific about what you believe it means to "merge"
two or more XML documents, it's possible you could get even better, more
specific advice.

Pete
 
S

Simon

Hi guys,

Thanks for that. Pete - you're right - the output of the merge should be:

<document>
<dataA />
<dataB />
<dataC />
<dataD />
<dataE />
<dataF />
</document>

Duplicates shouldn't occur - it should be as simple as snapping the
<xml> and root nodes off and then slapping them together.

The issue I'm having is in solving the problem whilst not having to read
the whole file into memory, because they are too large.

The bit I'm struggling with is where I need to remove the last
</rootNode></xml> node from any given document, as I can't find an
effective way of reading to that part of the file and snapping the end
off. If I read the whole document in, I run out of memory. If I use some
sort of "chunking" mechanism (which I think is exactly what I want), I
cant figure out how to detect the </rootNode> portion then snap it off.

If anyone can advise on an easy approach - I'd be very greatful.

Thanks

Simon
 
P

Peter Duniho

Simon said:
[...]
The issue I'm having is in solving the problem whilst not having to read
the whole file into memory, because they are too large.

The bit I'm struggling with is where I need to remove the last
</rootNode></xml> node from any given document, as I can't find an
effective way of reading to that part of the file and snapping the end
off. If I read the whole document in, I run out of memory. If I use some
sort of "chunking" mechanism (which I think is exactly what I want), I
cant figure out how to detect the </rootNode> portion then snap it off.

If anyone can advise on an easy approach - I'd be very greatful.

Arne's suggestion to use XmlTextReader/Writer should work fine. Have
you looked at those classes? It should be no more difficult than:

– for the first file, read and copy the outer-most document element
– for every other file, read past the outer-most document element
without copying it to the output
– for every file, read and copy every bit of content within the
outer-most document element
– for all files except the last, read past without copying the
element end for the outer-most document element
– finally, for the last file, read and copy the element end for the
outer-most document element

If you have tried the above and cannot get it to work, you should post a
concise-but-complete code example showing what you've tried and how it
doesn't work for you. Then some specific advice with respect to your
attempt can be offered.

If you have not tried the above, well…you should. :)

Pete
 
A

Andy O'Neill

simon said:
Hi all,

I'm hoping that someone could help me with a relatively simple problem
I'm having. I have a number of quite large XML files - say 150meg each,
On average, there will be three files to be merged in this fashion. Can
anyone advise on what the best approach to this would be? I've been

I would be inclined to look at sql server and ssis myself.
Maybe that's not an option for you but severs and big files and batch
overnight processing kind of go together in my mind.

SSIS is pretty efficient and can do some dead clever stuff..
Large files are often one of those things that kind of arrives overnight or
some time tomorrow is fine.
May not be appropriate for some reason, but just thought I'd run it past
you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top