Regex puzzle

A

Alan Pretre

Can anyone help me figure out a regex pattern for the following input
example:

xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m

I would want four matches from this:
1. xxx a=b,c=d
2. yyy e=f
3. zzz (empty)
4. www g=h,i=j,l=m

None of the letters here are single letters, but rather placeholders for
arbitrary words. For example,

LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-A
MBINC/CPTGRP-0-CPTGRP

Would result in:
1. LTG LTG=2-41-53-57
2. JOB JN=113&&116&125&&127
3. CPT CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP

Everything I've come up with so far would require me to iterate over
substrings. It'd be nice to have just a single matching operation. TIA.

-- Alan
 
B

Brian Davis

Give this one a try:

(?n)((?<item>[A-Za-z]+):(?<value>[A-Za-z]+=.*?)?(?=((,[A-
Za-z]+:)|$)))+

For your input, it gives 3 matches, each with "item"
and "value" groups for what comes before and after the
colon.


Brian Davis
www.knowdotnet.com
 
A

Alan Pretre

Brian Davis said:
Give this one a try:

(?n)((?<item>[A-Za-z]+):(?<value>[A-Za-z]+=.*?)?(?=((,[A-
Za-z]+:)|$)))+

For your input, it gives 3 matches, each with "item"
and "value" groups for what comes before and after the
colon.

Brian, *very* impressive. It works beautifully. I changed the last term to
(?=(,[A-Za-z]+:)|$))+
since it looked like there were extraneous parentheses. You gave me much to
study. Thanks again.

-- Alan
 
S

Slava Frid

something like this:
static void Main(string[] args)
{
string constant = "LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP";
Regex reg = new Regex(@"(?'code'\w\w\w):(?'value'[a-zA-Z0-9\-&/]+=[a-zA-Z0-9\-&/]+,?)*");
MatchCollection coll = reg.Matches(constant);
int i = 0;
foreach(Match match in coll)
{
Console.WriteLine(i++ + ". " + match.Groups["code"] + " -- " + match.Value);
}
}
 
D

Dino Chiesa [MSFT]

How about?
(\w+):([^:]+)?,(\w+):([^:]+)?,(\w+):([^:]+)?

Go to http://www.organicbit.com/regex/fog0000000019.html and get the regex
tool, it's handy for building these things.

The tool helps when you are coding the regex, but it is cumbersome when you
want to verify the correctness of the regex and match, across a large set of
input. For this you would be better off with a unit test app, where you
store an array of (input,output) pairs. Then run the regex on each input
and compare it to the expected output. (Example below)

-Dino


//
// emailValidation.cs
//
// uses a regexp to validate emails.
// This test program uses xml serialization to get the test input,
// including the regexp string and the various emails to test.
//
// references:
// http://homepage.stts.edu/~agushen/script/emailvalidation.html
//
// Fri, 15 Aug 2003 11:28
//

using Ionic.Test.EmailValidation;

namespace Ionic.Test.EmailValidation {

/// <remarks>
/// Represents all the input for the test, including the regex to test,
/// and an array of test cases.
/// </remarks>
[System.Xml.Serialization.XmlRootAttribute("Email.Validation.Input",
Namespace="", IsNullable=false)]
public class TestInput {

/// <remarks/>

[System.Xml.Serialization.XmlElementAttribute(Form=System.Xml.Schema.XmlSche
maForm.Unqualified)]
public string Regexp;

/// <remarks/>

[System.Xml.Serialization.XmlArrayAttribute(Form=System.Xml.Schema.XmlSchema
Form.Unqualified)]
[System.Xml.Serialization.XmlArrayItemAttribute("Case",
Form=System.Xml.Schema.XmlSchemaForm.Unqualified, IsNullable=false)]
public TestCase[] TestList;
}


/// <remarks>
/// This is the type that stores a single test case.
/// We need a bunch of these to verify that the regex works as
/// expected. Each test case has an input and an output. In our
/// case, the input is a string, and the output is a bool value,
/// which indicates whether the Regex should match or not.
/// Other tests will have different input and output.
/// </remarks>
public class TestCase {

/// <remarks/>

[System.Xml.Serialization.XmlElementAttribute(Form=System.Xml.Schema.XmlSche
maForm.Unqualified)]
public string Input;

/// <remarks/>

[System.Xml.Serialization.XmlElementAttribute(Form=System.Xml.Schema.XmlSche
maForm.Unqualified)]
public bool ExpectedOutput;
}


/// <remarks>
/// This is the test app. The main routine de-serializes from
/// an XML file, then runs the tests, comparing the expected
/// (or desired) output with the actual result.
/// </remarks>
public class Tester {

public static void Main() {
string InputPath= "EmailValidationInput.xml";

System.IO.FileStream fs = new System.IO.FileStream(InputPath,
System.IO.FileMode.Open);
System.Xml.Serialization.XmlSerializer s= new
System.Xml.Serialization.XmlSerializer(typeof(TestInput));
TestInput Input= (TestInput) s.Deserialize(fs);
fs.Close();

System.Text.RegularExpressions.Regex regex= new
System.Text.RegularExpressions.Regex (Input.Regexp);

foreach (TestCase tc in Input.TestList) {
System.Console.WriteLine(tc.Input +"\n " + tc.ExpectedOutput + " \\ " +
regex.IsMatch(tc.Input));
}
}
}
}


This is input data. Store this in the XML file that is de-serialized for
this test.

<Email.Validation.Input>
<TestList>
<!--
================================================================== -->
<!-- =================== True test cases
============================== -->
<!--
================================================================== -->

<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>true</ExpectedOutput>
</Case>

<!--
================================================================== -->
<!-- =================== False test cases
============================= -->
<!--
================================================================== -->

<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected].</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected].</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected].</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>elmo@cloud9</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>9Lives.club.org</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>@club.org</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>
<Case>
<Input>[email protected]</Input>
<ExpectedOutput>false</ExpectedOutput>
</Case>

</TestList>
<Regexp>^(\w([\.\-\w]*\w)?)@(\w([\.\-\w]*\w)*\.\w([\.\-\w]*\w)?)$</Regexp>
</Email.Validation.Input>
 
A

Alan Pretre

Dino Chiesa said:
How about?
(\w+):([^:]+)?,(\w+):([^:]+)?,(\w+):([^:]+)?

Dino,

Your regex fails (no match) with a simple test, CMD:pARM=X, and I didn't
have much luck with others I tried. For example, my OP had this example,

LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-A
MBINC/CPTGRP-0-CPTGRP

Your regex gives this result:
1 matches.
Match 1 has 7 groups.
Group 1 =
"LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-
AMBINC/CPTGRP-0-CPTGRP"
Group 2 = "LTG"
Group 3 = "LTG=2-41-53-57"
Group 4 = "JOB"
Group 5 = "JN=113&&116&125&&127"
Group 6 = "CPT"
Group 7 = "CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP"

But I was looking for something more along the lines of (Group 2 & 3 in each
match are the desired values):
3 matches.
Match 1 has 3 groups.
Group 1 = "LTG:LTG=2-41-53-57"
Group 2 = "LTG"
Group 3 = "LTG=2-41-53-57"
Match 2 has 3 groups.
Group 1 = "JOB:JN=113&&116&125&&127"
Group 2 = "JOB"
Group 3 = "JN=113&&116&125&&127"
Match 3 has 3 groups.
Group 1 = "CPT:CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP"
Group 2 = "CPT"
Group 3 = "CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP"

But thanks for your advice. I will study what you supplied to try to
understand it as well. Thanks!

-- Alan
 
E

Eric Gunnerson [MS]

Try the following:

Regex regex = new Regex(@"
( # overall repetition
(?<Item> # Capture to item
(?<Tag>.*?) # Any character, one or more times, non-greedy
: # literal :
.*? # any character, one or more times, non-greedy
) # end of capture
,? # optional "","". This eats the comma between the Items
(?= # optional zero-width lookahead. This must match at this
spot
(\w+: # one or more word characters, followed by a literal :
| # or
$ # end of line
)
)
)+ # one or more times",
RegexOptions.ExplicitCapture |
RegexOptions.Compiled |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

The key to this is the zero-width lookahead. It ensures that the part after
the match is either <xxx>:, or the end of the string, without eating any of
the characters. As you've probably found, without this there's no way to
know whether you should include a comma or break on it.



Here's the output I get from my regex workbench:

Matching:
LTG:LTG=2-41-53-57,JOB:JN=113&&116&125&&127,CPT:CODE=09789,TRATYP=AMBINC-7-A
MBINC/CPTGRP-0-CPTGRP
Item => LTG:LTG=2-41-53-57
Item => JOB:JN=113&&116&125&&127
Item => CPT:CODE=09789,TRATYP=AMBINC-7-AMBINC/CPTGRP-0-CPTGRP
Tag => LTG
Tag => JOB
Tag => CPT

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://blogs.gotdotnet.com/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top