Regex help

S

Stephen Brown

I have a simple regex need and I've already wasted too much time on it
spinning in circles. Can a regex god help a stranded soul? I just need to
replace all non-escaped ampersands in a file. It needs to skip escaped
ampersands such as & and 

&[a-zA-Z0-9]+; will get the escaped ampersands (some inproper escapes will
slip buy, but good enough for my purposes), but I need to replace all the
ampersands that aren't escaped

example:
&abc;&def& ghi&_jkl
after replace:
&abc;*def* ghi*_jkl
 
P

Paul E Collins

Stephen said:
&[a-zA-Z0-9]+; will get the escaped ampersands
(some inproper escapes will slip buy, but good enough
for my purposes), but I need to replace all the ampersands
that aren't escaped
example:
&abc;&def& ghi&_jkl
after replace:
&abc;*def* ghi*_jkl

You can't do this unambiguously. If you've got a file that's somehow
been *partially* escaped, it's no longer in a state that makes any
sense, and you can't tell "&123;" (intended to be an escaped
character) from the identical "©" (an unescaped ampersand that
just happens to be followed by the string "123"). Where are you
getting this input from?

Eq.
 
J

Jesse Houwing

* Stephen Brown wrote, On 18-8-2006 1:13:
I have a simple regex need and I've already wasted too much time on it
spinning in circles. Can a regex god help a stranded soul? I just need to
replace all non-escaped ampersands in a file. It needs to skip escaped
ampersands such as & and 

&[a-zA-Z0-9]+; will get the escaped ampersands (some inproper escapes will
slip buy, but good enough for my purposes), but I need to replace all the
ampersands that aren't escaped

example:
&abc;&def& ghi&_jkl
after replace:
&abc;*def* ghi*_jkl

This will probably do for most circumstances, though Pauls remark does
apply of course.

&(?![a-z0-9]+;)

This will find all '&' not directly followed by a number of letters,
digits and finally a ';'.

To really ensure you're only escaping unescaped '&'s you'll need to
write a very long regex that looks like:

&(?!([0-9]+|copy|euml|amp|......);)

Where you'll have to fill the dots with all allowed escape sequences and
optimize afterwards (so that 'amp' & 'auml' bccome 'a(?:uml|mp)' for
improved speed. But my guess is that in most cases the first option will
suffice.

Jesse Houwing
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top