Parsing complex xml file with C#

P

Pir8

I have a complex xml file, which contains stories within a magazine. The
structure of the xml file is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<magazine>
<story>
<story_id>112233</story_id>
<pub_name>Puleen's Publication</pub_name>
<pub_code>PP</pub_code>
<edition_date>20031201</edition_date>
<edition_name></edition_name>
<section_name></section_name>
<page_id></page_id>
<headline>My Story Headline</headline>
<subhead>Sub head</subhead>
<byline>Puleen</byline>
<source></source>
<dateline></dateline>
<storytype></storytype>
<column>Search</column>
<company_list></company_list>
<keyword_list></keyword_list>
<text><p>In other news....</p><p>second paragraph</p></text>
<photo>
<caption></caption>
<photo_filename>197943-96068.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96069.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96067.jpg</photo_filename>
<photocredit></photocredit>
</photo>
</story>
</magazine>

So there could be multiple <story>'s for each magazine. Now in the backend,
the data gets stored into an Oracle database. However, the data for the
photo's are stored in a separate table from the actual story. What's the
best way to approach the parsing of the story contents, and building a query
out of it, and then parsing the photo contents and building a query out of
that.

Any ideas are welcome. I've been trying to parse the xml file, however I
cannot think of a quick way of doing this. So I wonder maybe someone out
there, can guide me in the right direction and/or suggest a quick solution.
 
G

Guest

Serialize and Deserialize for persistant storage.

Then play with the object :D Much easier
 
D

Daniel O'Connell

If I may ask, what kind of problems are you having? Serialization is
probably not your only answer(it could have flexibility issues). My
immediate idea would be to use xpath. At that, is this xml format set in
stone?
 
P

Pir8

The main problem that I am concerned with is that within the <text> there
might be and will be html tags i.e. <p><strong><a href=""> and so on. I do
realize that I could use the node's innerxml property to retrieve this
but will there be any other complications in the future?

The xml format that I pasted is pretty much the same...There are some other
tags that I did not include, which
also will go into a separate table of its own into oracle.

My main concern is that, when parsing the <story>, <photo> separately, I
need to associate the <story_id>
along with the data from the <photo> section, so as to enter it into the
database to keep the appropriate
relationships for the application that will be using this data.

I will read more about XPath and how it can be helpful. I appreciate your
suggestions.
 
D

Daniel O'Connell

Pir8 said:
The main problem that I am concerned with is that within the <text> there
might be and will be html tags i.e. <p><strong><a href=""> and so on. I do
realize that I could use the node's innerxml property to retrieve this
but will there be any other complications in the future?
It *should* work, but it depends on your html. I wouldn't want to throw
non-xhtml html at an xml parser, its just not particularly safe. You might
want to consider wrapping the body of text in a CDATA section.
The xml format that I pasted is pretty much the same...There are some other
tags that I did not include, which
also will go into a separate table of its own into oracle.
I asked about the format because, personally, I would have used an id
attribute instead of a said:
My main concern is that, when parsing the <story>, <photo> separately, I
need to associate the <story_id>
along with the data from the <photo> section, so as to enter it into the
database to keep the appropriate
relationships for the application that will be using this data.

I will read more about XPath and how it can be helpful. I appreciate your
suggestions.
Well, as a very base concept I would probably query the xml document with
XPathNavigator using the xpath query /magazine/story, use the resultant
XPathNodeIterator to grab each story and use subseqent queries to pull out
the various pieces out.
 
N

Nick Malik

assuming that the file is valid XML (even with the embedded HTML), you can
easily extract components of the structure using XPath queries, and even
iterate over the structure, pulling out each photo and each item.

The query for story_id is literally: /magazine/story/story_id

--- Nick
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top