Parsing complex xml file with C#

Pir8 · Jan 18, 2004

I have a complex xml file, which contains stories within a magazine. The
structure of the xml file is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<magazine>
<story>
<story_id>112233</story_id>
<pub_name>Puleen's Publication</pub_name>
<pub_code>PP</pub_code>
<edition_date>20031201</edition_date>
<edition_name></edition_name>
<section_name></section_name>
<page_id></page_id>
<headline>My Story Headline</headline>
<subhead>Sub head</subhead>
<byline>Puleen</byline>
<source></source>
<dateline></dateline>
<storytype></storytype>
<column>Search</column>
<company_list></company_list>
<keyword_list></keyword_list>
<text>In other news....second paragraph</text>
<photo>
<caption></caption>
<photo_filename>197943-96068.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96069.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96067.jpg</photo_filename>
<photocredit></photocredit>
</photo>
</story>
</magazine>

So there could be multiple <story>'s for each magazine. Now in the backend,
the data gets stored into an Oracle database. However, the data for the
photo's are stored in a separate table from the actual story. What's the
best way to approach the parsing of the story contents, and building a query
out of it, and then parsing the photo contents and building a query out of
that.

Any ideas are welcome. I've been trying to parse the xml file, however I
cannot think of a quick way of doing this. So I wonder maybe someone out
there, can guide me in the right direction and/or suggest a quick solution.

Guest · Jan 18, 2004

Serialize and Deserialize for persistant storage.

Then play with the object

Much easier

Daniel O'Connell · Jan 19, 2004

If I may ask, what kind of problems are you having? Serialization is
probably not your only answer(it could have flexibility issues). My
immediate idea would be to use xpath. At that, is this xml format set in
stone?

Pir8 · Jan 19, 2004

The main problem that I am concerned with is that within the <text> there
might be and will be html tags i.e. <a href=""> and so on. I do
realize that I could use the node's innerxml property to retrieve this
but will there be any other complications in the future?

The xml format that I pasted is pretty much the same...There are some other
tags that I did not include, which
also will go into a separate table of its own into oracle.

My main concern is that, when parsing the <story>, <photo> separately, I
need to associate the <story_id>
along with the data from the <photo> section, so as to enter it into the
database to keep the appropriate
relationships for the application that will be using this data.

I will read more about XPath and how it can be helpful. I appreciate your
suggestions.

Daniel O'Connell · Jan 19, 2004

Pir8 said:
The main problem that I am concerned with is that within the <text> there
might be and will be html tags i.e. <a href=""> and so on. I do
realize that I could use the node's innerxml property to retrieve this
but will there be any other complications in the future?

It *should* work, but it depends on your html. I wouldn't want to throw
non-xhtml html at an xml parser, its just not particularly safe. You might
want to consider wrapping the body of text in a CDATA section.

The xml format that I pasted is pretty much the same...There are some other
tags that I did not include, which
also will go into a separate table of its own into oracle.

I asked about the format because, personally, I would have used an id

attribute instead of a said:
My main concern is that, when parsing the <story>, <photo> separately, I
need to associate the <story_id>
along with the data from the <photo> section, so as to enter it into the
database to keep the appropriate
relationships for the application that will be using this data.

I will read more about XPath and how it can be helpful. I appreciate your
suggestions.

Well, as a very base concept I would probably query the xml document with
XPathNavigator using the xpath query /magazine/story, use the resultant
XPathNodeIterator to grab each story and use subseqent queries to pull out
the various pieces out.

Nick Malik · Jan 19, 2004

assuming that the file is valid XML (even with the embedded HTML), you can
easily extract components of the structure using XPath queries, and even
iterate over the structure, pulling out each photo and each item.

The query for story_id is literally: /magazine/story/story_id

--- Nick

Parsing complex xml file with C#

Pir8

Guest

Daniel O'Connell

Pir8

Daniel O'Connell

Nick Malik

Ask a Question

Similar Threads