Regex replace question

H

Hardy Wang

Hi:
I have a XML like
<?xml version="1.0" ?>
<object>
<comments>www.site.com/page.aspx?param1=value1&param2=value2</comments>
</object>

Since "&" is invalid in XML, I need to replace all "&" to "&amp;" only
within <comments> tag, so I need to build a Regex pattern to replace "&"
only between <comments> and </comments>.
Anybody has idea how to make it?


Thanks!
 
E

Eric Gunnerson [MS]

I'd try something like:

(?<start>\<comments\>.+?)&(.+?<end>\</comments\>)

The .+? is a non-greedy match, so you won't match anything in between.
You'll need to refere to the start and end captures in your replacement
string so that that part of the text ends up back in the string.

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://weblogs.asp.net/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.
 
B

BMermuys

Hi
[inline]
Hardy Wang said:
Hi:
I have a XML like
<?xml version="1.0" ?>
<object>
<comments>www.site.com/page.aspx?param1=value1&param2=value2</comments>
</object>

Since "&" is invalid in XML, I need to replace all "&" to "&amp;" only
within <comments> tag, so I need to build a Regex pattern to replace "&"
only between <comments> and </comments>.
Anybody has idea how to make it?

Because .NET supports variable lookbehind (which is special) you can do
something like this:
string ouput = Regex.Replace(input,
"(?<=\\<comments\\>[^\\<\\>]*?)&(?=[^\\<\\>]*\\</comments\\>)", "&amp");

Using lookahead and lookbehind to make sure the & is inside comments tags.

HTH,
greetings
 
N

Nick Malik

You stated >> I have a XML like<<
Clearly, you don't have XML, because the string is not well formed. I
assume, therefore, that you are actually CREATING the xml in your code.

so while creating the XML document in code, and you want to replace all of
the & characters because the resulting XML would be invalid.

Why not just create an XML object, add the <object> node, under it add the
<comments> node, and in that provide the text. The XML object will escape
the chararacters for you when you output the document.

Creating the object in a string is the problem.

On the other hand, if you are creating it in code, you can replace all of
the invalid characters BEFORE placing it in the XML tags. I believe that
there is a method similar to HTMLEncode that will do this for you... and
then you can add the resulting string to the tags.

So, two solutions... neither requiring difficult Regex programming.

Hope this helps,
--- Nick
 
H

Hardy Wang

Thanks man, because my program need to receive a text file passed from a
third party, unfortunately we cannot control the output from other side. The
text file SHOULD be just a XML document, sadly there are some "&" in it. So
that is the reason I will clean up them.
 
M

mortb

I think this would do it for you:

string goodXML = Regex.Replace(badXML,
@"(?<=\<comments\>.*)&(?=.*\</comments\>)", "&amp;")

Regular expressions are in a strange but seemingly beautiful domain ;-)

cheers,

mortb
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top