Regular Expression Help - when the components are optional

E

erxuan

The situation is like this:

Given a paragraph, there are 3 subsections that I want to capture and
they have subtitles "Restaurant Description", "Best Known For" and
"Reviews" respectively.

The paragraph can be something like this: "blah blah0... Restaurant
Description blah blah1... Best Known For blah blah2... Reviews blah
blah3..."

I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:

string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();

However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."

What Regex should I use so that I can capture the right subsection
under such situation?

Thanks a lot!
 
J

Jesse Houwing

Hello erxuan,
The situation is like this:

Given a paragraph, there are 3 subsections that I want to capture and
they have subtitles "Restaurant Description", "Best Known For" and
"Reviews" respectively.

The paragraph can be something like this: "blah blah0... Restaurant
Description blah blah1... Best Known For blah blah2... Reviews blah
blah3..."

I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:
string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();
However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."

What Regex should I use so that I can capture the right subsection
under such situation?

Thanks a lot!

You could use:

"(?:Restaurant Description(?<description>.*))?(?:Best known for(?<knownfor>.*))?(?:Reviews(?<review>.*))?"

I figured there is no real need for using reluctant matching .*?, but it
might be faster depending on the length of each section.

Now in you code use this in code:

if(m.Groups["description"].Success)
{
string description = m.Groups["descrption".Value;
}

Jess
 
E

erxuan

Hello erxuan,




The situation is like this:
Given a paragraph, there are 3 subsections that I want to capture and
they have subtitles "Restaurant Description", "Best Known For" and
"Reviews" respectively.
The paragraph can be something like this: "blah blah0... Restaurant
Description blah blah1... Best Known For blah blah2... Reviews blah
blah3..."
I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:
string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();
However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."
What Regex should I use so that I can capture the right subsection
under such situation?
Thanks a lot!

You could use:

"(?:Restaurant Description(?<description>.*))?(?:Best known for(?<knownfor>.*))?(?:Reviews(?<review>.*))?"

I figured there is no real need for using reluctant matching .*?, but it
might be faster depending on the length of each section.

Now in you code use this in code:

if(m.Groups["description"].Success)
{
string description = m.Groups["descrption".Value;

}

Jesse- Hide quoted text -

- Show quoted text -

Thanks Jesse for the reply.
However, it does not give the expected result.

string text = @"Restaurant Description Data1 Best Known For Data2
Reviews Data3";
string pattern = "(?:Restaurant Description(?<description>.*))?(?:Best
Known For(?<specialities>.*))?(?:Reviews and Awards(?<review>.*))?";
Match m = Regex.Match(text, pattern);
if (m.Groups["description"].Success)
{
string description = m.Groups["description"].Value;
Console.WriteLine("Description: {0}",description);
}

It shows that "Description: Data1 Best Known For Data2 Reviews Data3"

If we use the reluctant machine .*? instead:
string pattern = "(?:Restaurant Description(?<description>.*?))?
(?:Best Known For(?<specialities>.*))?(?:Reviews and Awards(?
<review>.*))?";

It shows that description is empty string.
 
G

GS

Naturally .* is greedy and gobbles up everything
assuming words only \b\w\w*(\s\w*)* instead of .* may work better

of course you may wish to include some punctuation marks instead the \s*
following \s


However if it does not, you try look ahead at end of \s\w for not "next
thing" like "Best
Known For"
Good luck

erxuan said:
Hello erxuan,




The situation is like this:
Given a paragraph, there are 3 subsections that I want to capture and
they have subtitles "Restaurant Description", "Best Known For" and
"Reviews" respectively.
The paragraph can be something like this: "blah blah0... Restaurant
Description blah blah1... Best Known For blah blah2... Reviews blah
blah3..."
I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:
string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();
However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."
What Regex should I use so that I can capture the right subsection
under such situation?
Thanks a lot!

You could use:

"(?:Restaurant Description(?<description>.*))?(?:Best known
for(? said:
I figured there is no real need for using reluctant matching .*?, but it
might be faster depending on the length of each section.

Now in you code use this in code:

if(m.Groups["description"].Success)
{
string description = m.Groups["descrption".Value;

}

Jesse- Hide quoted text -

- Show quoted text -

Thanks Jesse for the reply.
However, it does not give the expected result.

string text = @"Restaurant Description Data1 Best Known For Data2
Reviews Data3";
string pattern = "(?:Restaurant Description(?<description>.*))?(?:Best
Known For(?<specialities>.*))?(?:Reviews and Awards(?<review>.*))?";
Match m = Regex.Match(text, pattern);
if (m.Groups["description"].Success)
{
string description = m.Groups["description"].Value;
Console.WriteLine("Description: {0}",description);
}

It shows that "Description: Data1 Best Known For Data2 Reviews Data3"

If we use the reluctant machine .*? instead:
string pattern = "(?:Restaurant Description(?<description>.*?))?
(?:Best Known For(?<specialities>.*))?(?:Reviews and Awards(?
<review>.*))?";

It shows that description is empty string.
 
J

Jesse Houwing

Hello erxuan,
Hello erxuan,
The situation is like this:

Given a paragraph, there are 3 subsections that I want to capture
and they have subtitles "Restaurant Description", "Best Known For"
and "Reviews" respectively.

The paragraph can be something like this: "blah blah0... Restaurant
Description blah blah1... Best Known For blah blah2... Reviews blah
blah3..."

I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:
string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();
However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."
What Regex should I use so that I can capture the right subsection
under such situation?

Thanks a lot!
You could use:

"(?:Restaurant Description(?<description>.*))?(?:Best known
for(?<knownfor>.*))?(?:Reviews(?<review>.*))?"

I figured there is no real need for using reluctant matching .*?, but
it might be faster depending on the length of each section.

Now in you code use this in code:

if(m.Groups["description"].Success)
{
string description = m.Groups["descrption".Value;
}

Jesse- Hide quoted text -

- Show quoted text -
Thanks Jesse for the reply.
However, it does not give the expected result.
string text = @"Restaurant Description Data1 Best Known For Data2
Reviews Data3";
string pattern = "(?:Restaurant Description(?<description>.*))?(?:Best
Known For(?<specialities>.*))?(?:Reviews and Awards(?<review>.*))?";
Match m = Regex.Match(text, pattern);
if (m.Groups["description"].Success)
{
string description = m.Groups["description"].Value;
Console.WriteLine("Description: {0}",description);
}
It shows that "Description: Data1 Best Known For Data2 Reviews Data3"

If we use the reluctant machine .*? instead:
string pattern = "(?:Restaurant Description(?<description>.*?))?
(?:Best Known For(?<specialities>.*))?(?:Reviews and Awards(?
<review>.*))?";
It shows that description is empty string.


You could try making the whole groups reluctant:

"(?:Restaurant Description(?<description>.*?))??(?:Best known for(?<knownfor>.*?))??(?:Reviews(?<review>.*?))??"

I just re-installed my Windows Vista installation after a defective driver,
and havent' had time to re-install Visual Studio, so I'm unable to test this
right now.

(?:Restaurant Description(?<description>((?!Best Known For|Reviews).)*))?
(?:Best Known For(?<knownfor>((?!Reviews).)*))?
(?:Reviews(?<knownfor>((?!Reviews).)*))?

should surely do the trick. it makes sure the previous item stops if the
start of the next item has been found.

Jesse
 
J

Jesse Houwing

Hello GS,
Naturally .* is greedy and gobbles up everything
assuming words only \b\w\w*(\s\w*)* instead of .* may work better
of course you may wish to include some punctuation marks instead the
\s* following \s

The problem lies in the fact that you'll be adding new options for every
new input you['re parsing. It'll be a large try/retry/reretry fest and it'll
never get right.
However if it does not, you try look ahead at end of \s\w for not
"next
thing" like "Best
Known For"
Good luck

This should work better. I should've though of that before ;) I already added
a sample expression to my jost of about 2 minutes ago.

Jesse

erxuan said:
Hello erxuan,

The situation is like this:

Given a paragraph, there are 3 subsections that I want to capture
and they have subtitles "Restaurant Description", "Best Known For"
and "Reviews" respectively.

The paragraph can be something like this: "blah blah0...
Restaurant Description blah blah1... Best Known For blah blah2...
Reviews blah blah3..."

I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:
string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();
However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."
What Regex should I use so that I can capture the right subsection
under such situation?

Thanks a lot!

You could use:

"(?:Restaurant Description(?<description>.*))?(?:Best known
for(? said:
I figured there is no real need for using reluctant matching .*?,
but it might be faster depending on the length of each section.

Now in you code use this in code:

if(m.Groups["description"].Success)
{
string description = m.Groups["descrption".Value;
}

Jesse- Hide quoted text -

- Show quoted text -
Thanks Jesse for the reply.
However, it does not give the expected result.
string text = @"Restaurant Description Data1 Best Known For Data2
Reviews Data3";
string pattern = "(?:Restaurant
Description(?<description>.*))?(?:Best
Known For(?<specialities>.*))?(?:Reviews and Awards(?<review>.*))?";
Match m = Regex.Match(text, pattern);
if (m.Groups["description"].Success)
{
string description = m.Groups["description"].Value;
Console.WriteLine("Description: {0}",description);
}
It shows that "Description: Data1 Best Known For Data2 Reviews
Data3"

If we use the reluctant machine .*? instead:
string pattern = "(?:Restaurant Description(?<description>.*?))?
(?:Best Known For(?<specialities>.*))?(?:Reviews and Awards(?
<review>.*))?";
It shows that description is empty string.
 
E

erxuan

Hello erxuan,




On Aug 7, 3:51 am, Jesse Houwing <[email protected]>
wrote:
Hello erxuan,
The situation is like this:
Given a paragraph, there are 3 subsections that I want to capture
and they have subtitles "Restaurant Description", "Best Known For"
and "Reviews" respectively.
The paragraph can be something like this: "blah blah0... Restaurant
Description blah blah1... Best Known For blah blah2... Reviews blah
blah3..."
I was trying to use
Pattern = "Restaurant Description(.*?)Best Known
For(.*?)Reviews(.*)"; to capture the 3 subsections and then assign
the values to different labels:
string Descritpion = m.Groups[0].ToString();
string Specialities = m.Group[1].ToString();
string Reviews = m.Group2.ToString();
However, any subsection can be optional. The paragraph can be "blah
blah0 ...Best Known For blah blah1..."
What Regex should I use so that I can capture the right subsection
under such situation?
Thanks a lot!
You could use:
"(?:Restaurant Description(?<description>.*))?(?:Best known
for(?<knownfor>.*))?(?:Reviews(?<review>.*))?"
I figured there is no real need for using reluctant matching .*?, but
it might be faster depending on the length of each section.
Now in you code use this in code:
if(m.Groups["description"].Success)
{
string description = m.Groups["descrption".Value;
}
Jesse- Hide quoted text -
- Show quoted text -
Thanks Jesse for the reply.
However, it does not give the expected result.
string text = @"Restaurant Description Data1 Best Known For Data2
Reviews Data3";
string pattern = "(?:Restaurant Description(?<description>.*))?(?:Best
Known For(?<specialities>.*))?(?:Reviews and Awards(?<review>.*))?";
Match m = Regex.Match(text, pattern);
if (m.Groups["description"].Success)
{
string description = m.Groups["description"].Value;
Console.WriteLine("Description: {0}",description);
}
It shows that "Description: Data1 Best Known For Data2 Reviews Data3"
If we use the reluctant machine .*? instead:
string pattern = "(?:Restaurant Description(?<description>.*?))?
(?:Best Known For(?<specialities>.*))?(?:Reviews and Awards(?
<review>.*))?";
It shows that description is empty string.

You could try making the whole groups reluctant:

"(?:Restaurant Description(?<description>.*?))??(?:Best known for(?<knownfor>.*?))??(?:Reviews(?<review>.*?))??"

I just re-installed my Windows Vista installation after a defective driver,
and havent' had time to re-install Visual Studio, so I'm unable to test this
right now.

(?:Restaurant Description(?<description>((?!Best Known For|Reviews).)*))?
(?:Best Known For(?<knownfor>((?!Reviews).)*))?
(?:Reviews(?<knownfor>((?!Reviews).)*))?

should surely do the trick. it makes sure the previous item stops if the
start of the next item has been found.

Jesse- Hide quoted text -

- Show quoted text -

Thanks for all the reply! Yes, the last one really works well (with a
little trivial change: I assume we don't need "(?!Reviews)" for the
last group).

(Restaurant Description(?<description>((?!Best Known For|
Reviews).)*))?
(Best Known For(?<knownfor>((?!Reviews).)*))?
(Reviews(?<review>.*))?
 
Top