Regular Expressions Help

  • Thread starter Thread starter Xarky
  • Start date Start date
X

Xarky

Hi,

I have this single line of text, and need to extract data from it

* 15 FETCH (FLAGS (\Seen) BODYSTRUCTURE ((("TEXT" "PLAIN" ("CHARSET"
"ISO-8859-1" "FORMAT" "flowed") NIL NIL "7BIT" 44 1 NIL NIL NIL
NIL)("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL "7BIT" 330 10 NIL
NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY"
"------------040101030402070003040902") NIL NIL NIL)("IMAGE" "JPEG"
("NAME" "Blue hills.jpg") NIL NIL "BASE64" 39084 NIL ("INLINE"
("FILENAME" "Blue hills.jpg")) NIL NIL)("IMAGE" "JPEG" ("NAME"
"Sunset.jpg") NIL NIL "BASE64" 97556 NIL ("INLINE" ("FILENAME"
"Sunset.jpg")) NIL NIL)("IMAGE" "JPEG" ("NAME" "Water lilies.jpg") NIL
NIL "BASE64" 114830 NIL ("INLINE" ("FILENAME" "Water lilies.jpg")) NIL
NIL)("IMAGE" "JPEG" ("NAME" "Winter.jpg") NIL NIL "BASE64" 144632 NIL
("INLINE" ("FILENAME" "Winter.jpg")) NIL NIL) "MIXED" ("BOUNDARY"
"------------090206040706060704050905") NIL NIL NIL))

This line can be divided in similar parts

* 15 FETCH (FLAGS (\Seen) BODYSTRUCTURE
(
(
("TEXT" "PLAIN" ("CHARSET" "ISO-8859-1" "FORMAT" "flowed") NIL NIL
"7BIT" 44 1 NIL NIL NIL
NIL)
("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL
"7BIT" 330 10 NIL NIL NIL
NIL) "ALTERNATIVE" ("BOUNDARY" "------------040101030402070003040902")
NIL NIL NIL)
("IMAGE" "JPEG" ("NAME" "Blue hills.jpg") NIL NIL
"BASE64" 39084 NIL ("INLINE" ("FILENAME" "Blue hills.jpg")) NIL
NIL)
("IMAGE" "JPEG" ("NAME" "Sunset.jpg") NIL NIL
"BASE64" 97556 NIL ("INLINE" ("FILENAME" "Sunset.jpg")) NIL
NIL)
("IMAGE" "JPEG" ("NAME" "Water lilies.jpg") NIL NIL
"BASE64" 114830 NIL ("INLINE" ("FILENAME" "Water lilies.jpg")) NIL
NIL)
("IMAGE" "JPEG" ("NAME" "Winter.jpg") NIL NIL
"BASE64" 144632 NIL ("INLINE" ("FILENAME" "Winter.jpg")) NIL
NIL) "MIXED" ("BOUNDARY" "------------090206040706060704050905") NIL
NIL NIL))

Now I need to from 'each line', I need to extract some
data(arguments). These include the argument numbers 1, 2, 3, 6.

In the data given these are 1=TEXT, 2=PLAIN, 3=("CHARSET" ISO-88591-1"
FORMAT" "flowed"), and 6=7BIT
1-IMAGE, 2=JPEG, 3=("NAME" "BLUE hills.jpg"), 6=BASe64
etc

Is this possible to do with regex?
Can someone help me out
Thanks in Advance
 
I would do two string splits.

string[] lines = YourLine.Split(")(");

Then for each element in lines do another string .Split(" ") ( because to
me each line looks like a row of fields delimited by a space ).

You now have a two dimensional array that you can easily extract the values
from. If you want to make it cleaner, delete all the remaining (s and )s
from the array values.
 
ps -- someone who knows LISP could probably do it much easier ;)



John said:
I would do two string splits.

string[] lines = YourLine.Split(")(");

Then for each element in lines do another string .Split(" ") ( because to
me each line looks like a row of fields delimited by a space ).

You now have a two dimensional array that you can easily extract the
values
from. If you want to make it cleaner, delete all the remaining (s and )s
from the array values.
Hi,

I have this single line of text, and need to extract data from it

* 15 FETCH (FLAGS (\Seen) BODYSTRUCTURE ((("TEXT" "PLAIN" ("CHARSET"
"ISO-8859-1" "FORMAT" "flowed") NIL NIL "7BIT" 44 1 NIL NIL NIL
NIL)("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL "7BIT" 330 10 NIL
NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY"
"------------040101030402070003040902") NIL NIL NIL)("IMAGE" "JPEG"
("NAME" "Blue hills.jpg") NIL NIL "BASE64" 39084 NIL ("INLINE"
("FILENAME" "Blue hills.jpg")) NIL NIL)("IMAGE" "JPEG" ("NAME"
"Sunset.jpg") NIL NIL "BASE64" 97556 NIL ("INLINE" ("FILENAME"
"Sunset.jpg")) NIL NIL)("IMAGE" "JPEG" ("NAME" "Water lilies.jpg") NIL
NIL "BASE64" 114830 NIL ("INLINE" ("FILENAME" "Water lilies.jpg")) NIL
NIL)("IMAGE" "JPEG" ("NAME" "Winter.jpg") NIL NIL "BASE64" 144632 NIL
("INLINE" ("FILENAME" "Winter.jpg")) NIL NIL) "MIXED" ("BOUNDARY"
"------------090206040706060704050905") NIL NIL NIL))

This line can be divided in similar parts

* 15 FETCH (FLAGS (\Seen) BODYSTRUCTURE
(
(
("TEXT" "PLAIN" ("CHARSET" "ISO-8859-1" "FORMAT" "flowed") NIL NIL
"7BIT" 44 1 NIL NIL NIL
NIL)
("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL
"7BIT" 330 10 NIL NIL NIL
NIL) "ALTERNATIVE" ("BOUNDARY" "------------040101030402070003040902")
NIL NIL NIL)
("IMAGE" "JPEG" ("NAME" "Blue hills.jpg") NIL NIL
"BASE64" 39084 NIL ("INLINE" ("FILENAME" "Blue hills.jpg")) NIL
NIL)
("IMAGE" "JPEG" ("NAME" "Sunset.jpg") NIL NIL
"BASE64" 97556 NIL ("INLINE" ("FILENAME" "Sunset.jpg")) NIL
NIL)
("IMAGE" "JPEG" ("NAME" "Water lilies.jpg") NIL NIL
"BASE64" 114830 NIL ("INLINE" ("FILENAME" "Water lilies.jpg")) NIL
NIL)
("IMAGE" "JPEG" ("NAME" "Winter.jpg") NIL NIL
"BASE64" 144632 NIL ("INLINE" ("FILENAME" "Winter.jpg")) NIL
NIL) "MIXED" ("BOUNDARY" "------------090206040706060704050905") NIL
NIL NIL))

Now I need to from 'each line', I need to extract some
data(arguments). These include the argument numbers 1, 2, 3, 6.

In the data given these are 1=TEXT, 2=PLAIN, 3=("CHARSET" ISO-88591-1"
FORMAT" "flowed"), and 6=7BIT
1-IMAGE, 2=JPEG, 3=("NAME" "BLUE hills.jpg"), 6=BASe64
etc

Is this possible to do with regex?
Can someone help me out
Thanks in Advance
 
Xarky said:
Hi,

I have this single line of text, and need to extract data from it
Is this possible to do with regex?

Yes.

No doubt a regex guru could write a single monster expression which
would pull all of the values out in a useful way.

I'm not a regex guru, so I'll tell you how I'd approach it. You seem to
have repeating groups, each group containing a set of data you want to
extract. As a first step, I'd work out a regex which matches each of
those. i.e.

("TEXT" "PLAIN" ("CHARSET" "ISO-8859-1" "FORMAT" "flowed") NIL NIL
"7BIT" 44 1 NIL NIL NIL
NIL)

("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL
"7BIT" 330 10 NIL NIL NIL
NIL) "ALTERNATIVE" ("BOUNDARY" "------------040101030402070003040902")
NIL NIL NIL)


("IMAGE" "JPEG" ("NAME" "Blue hills.jpg") NIL NIL
"BASE64" 39084 NIL ("INLINE" ("FILENAME" "Blue hills.jpg")) NIL NIL)


("IMAGE" "JPEG" ("NAME" "Sunset.jpg") NIL NIL
"BASE64" 97556 NIL ("INLINE" ("FILENAME" "Sunset.jpg")) NIL
NIL)


("IMAGE" "JPEG" ("NAME" "Water lilies.jpg") NIL NIL
"BASE64" 114830 NIL ("INLINE" ("FILENAME" "Water lilies.jpg")) NIL NIL)


("IMAGE" "JPEG" ("NAME" "Winter.jpg") NIL NIL
"BASE64" 144632 NIL ("INLINE" ("FILENAME" "Winter.jpg")) NIL
NIL) "MIXED" ("BOUNDARY" "------------090206040706060704050905") NIL NIL
NIL))

I would then iterate through those matches and use another regex to
parse the values out of each of them.

The difficult bit is working out how to match the start and end of each
group, which needs more knowledge of what can occur in the file. The
obvious thing that occurs to me is to match ("TEXT" | ("IMAGE" followed
by any sequence of characters which are not ("TEXT" | ("IMAGE".

So, and this is air code, you want something along the lines of

class Groups
{
ArrayList groupsCollection = new ArrayList();
const string GROUP_PATTERN = "";
public Groups(string sourceText)
{
foreach(Match m in Regex.Matches(sourceText, GROUP_PATTERN))
{
Group group = new Group(m.Value);
this.groupsCollection.Add(group);
}
}
}
class Group
{
string arg1;
string arg2;
string arg3;
string arg6;
const string PARAM_PATTERN = "";
public Group(string groupText)
{
MatchCollection matches = Regex.Matches(
groupText, PARAM_PATTERN);
this.arg1 = matches[0];
this.arg2 = matches[1];
this.arg3 = matches[2];
this.arg6 = matches[5];
}
}

GROUP_PATTERN needs to be something along the lines of "x[^x]*" where x
matches the start of a group. PARAM_PATTERN needs to match groups of
quoted text or the string "NIL".

That's how I'd do it, anyway.
 
Back
Top