HowTo? RegEx - pattern to exclude the whole word

S

shonend

I am trying to extract the pattern like this :

"SUB: some text LOT: one-word"

Described, "SUB" and "LOT" are key words; I want those words,
everything in between and one word following the "LOT:". Source text
may contain multiple "SUB: ... LOT:" blocks.

For example this is my source text:

SUB: this text I want to extract LOT: 2345 , something in between, new
SUB: again something I want to extract LOT: 2145 and more text here,
the end

When I apply this pattern:

SUB:\s+[^\r\n]+\s+LOT:\s+[^\r\n\s]+

in .NET's Regex.Matches(...), I only get one match:

SUB: this text I want to extract LOT: 2345 , something in between, new
SUB: again something I want to extract LOT: 2145

Obviously, something in this regex tells it to be "greedy", and I need
the partial matches too.

I thought this pattern would return ALL matches, which are:
1) SUB: this text I want to extract LOT: 2345
2) SUB: again something I want to extract LOT: 2145
3) SUB: this text I want to extract LOT: 2345 , something in between,
new SUB: again something I want to extract LOT: 2145

The last one I don't need of course, but I can handle it - ignore it,
and use only the first two.

So my idea was to modify my pattern to read like this:
give me all matches resembling text between "SUB:" and "LOT:",
including those keywords, plus one word after "LOT:", but (!) the text
between cannot contain "LOT:"

If I manage to compose such RegEx pattern, it would even eliminate the
result 3), and return only what I really need. But the problem is how
to define pattern that will eliminate (exclude) the whole word. I
tried "[^ ... ]" pattern, but that works only for single characters
listed between the brackets.
For example:

SUB:\s+[^\r\n(LOT:)]+\s+LOT:\s+[^\r\n\s]+

is not working. I thought that "( )" brackets would group the
characters and tell the regex not the match the appearance of the whole
word "LOT:". But instead, it invalidates any text that contain any of
these characters:
) ( : L T O

So if you could answer at least one of the following questions, I would
appreciate it very much:

1) generally, how do you compose the regex pattern to not match the
text that contain certain word?
2) if there is no easy solution for 1), or there is a better solution
for the problem I described above, what is it?

Thank you so much!

Shone
 
N

Nanda Lella[MSFT]

Hello,

Have you tried something like this

(sub:(.)*LOT:(.)+?\s)

Let me know if it solved your problem.

--------------------
From: (e-mail address removed)
Newsgroups: microsoft.public.dotnet.general
Subject: HowTo? RegEx - pattern to exclude the whole word
Date: 8 Feb 2006 11:01:28 -0800
Organization: http://groups.google.com
Lines: 67
Message-ID: <[email protected]>
NNTP-Posting-Host: 63.86.206.3
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Trace: posting.google.com 1139425293 27748 127.0.0.1 (8 Feb 2006 19:01:33 GMT)
X-Complaints-To: (e-mail address removed)
NNTP-Posting-Date: Wed, 8 Feb 2006 19:01:33 +0000 (UTC)
User-Agent: G2/0.2
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
NET CLR 1.1.4322),gzip(gfe),gzip(gfe)
Complaints-To: (e-mail address removed)
Injection-Info: g43g2000cwa.googlegroups.com; posting-host=63.86.206.3;
posting-account=FAgaQQwAAABNdEbvXlMQ8kQ3uQ0Iry25
Path: TK2MSFTNGXA01.phx.gbl!TK2MSFTNGP08.phx.gbl!newsfeed00.sul.t-online.de!t-onli
ne.de!news.glorb.com!postnews.google.com!g43g2000cwa.googlegroups.com!not-fo
r-mail
Xref: TK2MSFTNGXA01.phx.gbl microsoft.public.dotnet.general:188343
X-Tomcat-NG: microsoft.public.dotnet.general

I am trying to extract the pattern like this :

"SUB: some text LOT: one-word"

Described, "SUB" and "LOT" are key words; I want those words,
everything in between and one word following the "LOT:". Source text
may contain multiple "SUB: ... LOT:" blocks.

For example this is my source text:

SUB: this text I want to extract LOT: 2345 , something in between, new
SUB: again something I want to extract LOT: 2145 and more text here,
the end

When I apply this pattern:

SUB:\s+[^\r\n]+\s+LOT:\s+[^\r\n\s]+

in .NET's Regex.Matches(...), I only get one match:

SUB: this text I want to extract LOT: 2345 , something in between, new
SUB: again something I want to extract LOT: 2145

Obviously, something in this regex tells it to be "greedy", and I need
the partial matches too.

I thought this pattern would return ALL matches, which are:
1) SUB: this text I want to extract LOT: 2345
2) SUB: again something I want to extract LOT: 2145
3) SUB: this text I want to extract LOT: 2345 , something in between,
new SUB: again something I want to extract LOT: 2145

The last one I don't need of course, but I can handle it - ignore it,
and use only the first two.

So my idea was to modify my pattern to read like this:
give me all matches resembling text between "SUB:" and "LOT:",
including those keywords, plus one word after "LOT:", but (!) the text
between cannot contain "LOT:"

If I manage to compose such RegEx pattern, it would even eliminate the
result 3), and return only what I really need. But the problem is how
to define pattern that will eliminate (exclude) the whole word. I
tried "[^ ... ]" pattern, but that works only for single characters
listed between the brackets.
For example:

SUB:\s+[^\r\n(LOT:)]+\s+LOT:\s+[^\r\n\s]+

is not working. I thought that "( )" brackets would group the
characters and tell the regex not the match the appearance of the whole
word "LOT:". But instead, it invalidates any text that contain any of
these characters:
) ( : L T O

So if you could answer at least one of the following questions, I would
appreciate it very much:

1) generally, how do you compose the regex pattern to not match the
text that contain certain word?
2) if there is no easy solution for 1), or there is a better solution
for the problem I described above, what is it?

Thank you so much!

Shone

--

Thank You,
Nanda Lella,

This Posting is provided "AS IS" with no warranties, and confers no rights.
 
S

shonend

Unfortunately not.
It returns the same result (only one match) as my original pattern.

thanks
 
S

shonend

Foud it!
Actually it's more a work-around than a solution, but as long as it
works...

The approach: natural ingenuity and ancient wisdom: "if you can't beat
them - join them!".
So RegEx doesn't have required pattern, i.e. [^...] pattern requires
single character. OK, I'll give it a single character. Simply replace
all occurences of the keyword "LOT:" in the source string with one
single character. Only have to be careful to pick the one that
certainly will not appear as a regular character contained in original
text. In this case ~ (tilda) is fine. So the regex pattern is:

SUB:\s+[^\r\n~]+\s+~\s+[^\r\n\s]+

Applied to modified source text it returns me exactly what I want.

Thanks for looking, take care...
Shone
 
N

Nanda Lella[MSFT]

Glad to see that you figured it out.

and; by the way (SUB:(.)*LOT:(.)+?\s) works for me.
I dont know how you are implementing it. I used case insensitive and
multiline options. And it returned follwing results.

+ Match [0] SUB: this text I want to extract LOT: 2345
+ Match [1] SUB: again something I want to extract LOT: 2145


--------------------
From: (e-mail address removed)
Newsgroups: microsoft.public.dotnet.general
Subject: Re: HowTo? RegEx - pattern to exclude the whole word
Date: 9 Feb 2006 08:03:36 -0800
Organization: http://groups.google.com
Lines: 20
Message-ID: <[email protected]>
References: <[email protected]>
NNTP-Posting-Host: 63.86.206.3
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Trace: posting.google.com 1139501021 6216 127.0.0.1 (9 Feb 2006 16:03:41 GMT)
X-Complaints-To: (e-mail address removed)
NNTP-Posting-Date: Thu, 9 Feb 2006 16:03:41 +0000 (UTC)
In-Reply-To: <[email protected]>
User-Agent: G2/0.2
X-HTTP-UserAgent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
NET CLR 1.1.4322),gzip(gfe),gzip(gfe)
Complaints-To: (e-mail address removed)
Injection-Info: g14g2000cwa.googlegroups.com; posting-host=63.86.206.3;
posting-account=FAgaQQwAAABNdEbvXlMQ8kQ3uQ0Iry25
Path: TK2MSFTNGXA01.phx.gbl!TK2MSFTFEED02.phx.gbl!tornado.fastwebnet.it!tiscali!ne
wsfeed1.ip.tiscali.net!news.glorb.com!postnews.google.com!g14g2000cwa.google
groups.com!not-for-mail
Xref: TK2MSFTNGXA01.phx.gbl microsoft.public.dotnet.general:188428
X-Tomcat-NG: microsoft.public.dotnet.general

Foud it!
Actually it's more a work-around than a solution, but as long as it
works...

The approach: natural ingenuity and ancient wisdom: "if you can't beat
them - join them!".
So RegEx doesn't have required pattern, i.e. [^...] pattern requires
single character. OK, I'll give it a single character. Simply replace
all occurences of the keyword "LOT:" in the source string with one
single character. Only have to be careful to pick the one that
certainly will not appear as a regular character contained in original
text. In this case ~ (tilda) is fine. So the regex pattern is:

SUB:\s+[^\r\n~]+\s+~\s+[^\r\n\s]+

Applied to modified source text it returns me exactly what I want.

Thanks for looking, take care...
Shone

--

Thank You,
Nanda Lella,

This Posting is provided "AS IS" with no warranties, and confers no rights.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

regex pattern 4
Regex Help 2
Regex in C# 4
regex help 1
Regex Pattern 11
Looking for a RegEx pattern 1
Regex Help 1
Regex problem - any help greatfully accepted! 2

Top