RegEx - allowing for line-breaks

O

Olaf Rabbachin

Hi everyone,

I'm assembling a RegEx, that I cannot seem to get finished with.

I'm receiving a list of records. The fields within are not consequently
enclosed in quotes - some are, some aren't.
The last field of every record is started with a <CR><LF> and can itself
contain one or more rows of text (the lines being separated by <LF>).
Each record is terminated with another <CR><LF>:

+CMGL: 2,"REC READ","+49123123123",,"09/08/26,11:42:39+08"<CR><LF>
Text-Row1<LF>
Text-Row2<CR><LF>
+CMGL: 3,"REC READ","+49234234234",,"09/08/27,12:23:35+08"<CR><LF>
Text-Row1<LF>
Text-Row2<LF>
Text-Row3<CR><LF>
+CMGL: 4,"REC UNREAD","+49234234234",,"09/08/26,15:27:49+08"<CR><LF>
Text-Row1<LF>
Text-Row2<CR><LF>
+CMGL: 1,"REC UNREAD","+49123123123",,"09/08/26,15:27:49+08"<CR><LF>
Text-Row1<LF>
Text-Row2<CR><LF>

Using Expresso I've built the following RegEx (line-breaks for better
readREC READility):

(?:\+CMGL\:\s)
(?<Index>\d)
(?:\,\")
(?<Status>REC\sUNREAD|REC\sREAD|STO\sUNSENT|STO\sSENT|ALL)
(?:\"\,\")
(?<PhoneNumber>[+1234567890]*)
(?:\"\,\,\")
(?<Timestamp>
(?<Year>\d{1,2})\/(?<Month>\d{1,2})\/(?<Day>\d{1,2})\,
(?<Hour>\d{1,2})\:(?<Minute>\d{1,2})\:(?<Second>\d{1,2})
\+(?<Timezone>\d{1,2})
)
(?:\"\r\n)
(?<Text>.+)

This works for all except the last group (<Text>) - in that group, only the
first row of text will be included.
Any suggestions as to how I can achieve that the <Text>-group contains
everything that is found in the string either until the next group
(starting with "Text-Row: " in the above sample) is reached or the
record-list ends?

Cheers,
Olaf
 
J

Jesse Houwing

Hello Olaf,

The . matches everything, except NewLine. There is a special RegexOption
(SingleLine), to make . match newlines as well. The danger is though, that
the Regex will consume everything after Text as soon as it finds that .*
when you specify this option. This leads to very funny situations sometimes.
It might bebetter to say that it should match everything, until you find
a <CR>, which seems to match your format description. You can do this by
replacing the . with [^\r]+ in your last group.

Kind Regards,

Jesse Houwing
Hi everyone,

I'm assembling a RegEx, that I cannot seem to get finished with.

I'm receiving a list of records. The fields within are not
consequently
enclosed in quotes - some are, some aren't.
The last field of every record is started with a <CR><LF> and can
itself
contain one or more rows of text (the lines being separated by <LF>).
Each record is terminated with another <CR><LF>:
+CMGL: 2,"REC READ","+49123123123",,"09/08/26,11:42:39+08"<CR><LF>
Text-Row1<LF>
Text-Row2<CR><LF>
+CMGL: 3,"REC READ","+49234234234",,"09/08/27,12:23:35+08"<CR><LF>
Text-Row1<LF>
Text-Row2<LF>
Text-Row3<CR><LF>
+CMGL: 4,"REC UNREAD","+49234234234",,"09/08/26,15:27:49+08"<CR><LF>
Text-Row1<LF>
Text-Row2<CR><LF>
+CMGL: 1,"REC UNREAD","+49123123123",,"09/08/26,15:27:49+08"<CR><LF>
Text-Row1<LF>
Text-Row2<CR><LF>
Using Expresso I've built the following RegEx (line-breaks for better
readREC READility):

(?:\+CMGL\:\s)
(?<Index>\d)
(?:\,\")
(?<Status>REC\sUNREAD|REC\sREAD|STO\sUNSENT|STO\sSENT|ALL)
(?:\"\,\")
(?<PhoneNumber>[+1234567890]*)
(?:\"\,\,\")
(?<Timestamp>
(?<Year>\d{1,2})\/(?<Month>\d{1,2})\/(?<Day>\d{1,2})\,
(?<Hour>\d{1,2})\:(?<Minute>\d{1,2})\:(?<Second>\d{1,2})
\+(?<Timezone>\d{1,2})
)
(?:\"\r\n)
(?<Text>.+)
This works for all except the last group (<Text>) - in that group,
only the
first row of text will be included.
Any suggestions as to how I can achieve that the <Text>-group contains
everything that is found in the string either until the next group
(starting with "Text-Row: " in the above sample) is reached or the
record-list ends?
Cheers,
Olaf
 
O

Olaf Rabbachin

Hi Jesse,

Jesse said:
The . matches everything, except NewLine. There is a special RegexOption
(SingleLine), to make . match newlines as well. The danger is though, that
the Regex will consume everything after Text as soon as it finds that .*
when you specify this option. This leads to very funny situations sometimes.
It might bebetter to say that it should match everything, until you find
a <CR>, which seems to match your format description. You can do this by
replacing the . with [^\r]+ in your last group.

thanks - I tried that. However, the result doesn't change; still, only the
first row is being captured ... :-(

Cheers,
Olaf
 
J

Jesse Houwing

Hello Olaf,
Hi Jesse,

Jesse said:
The . matches everything, except NewLine. There is a special
RegexOption (SingleLine), to make . match newlines as well. The
danger is though, that the Regex will consume everything after Text
as soon as it finds that .* when you specify this option. This leads
to very funny situations sometimes. It might bebetter to say that it
should match everything, until you find a <CR>, which seems to match
your format description. You can do this by replacing the . with
[^\r]+ in your last group.
thanks - I tried that. However, the result doesn't change; still, only
the first row is being captured ... :-(

Can you send me the regex and a sample file?
 
O

Olaf Rabbachin

Hi,

Jesse said:
Can you send me the regex and a sample file?

No need to anymore. :)
I had to find out that I was just too dumb to realize that I can only test
that RegEx in code as, otherwise, I don't have any means to tell Expresso
to utilize the \n, instead of converting them to \r when running tests.
That is, your recommendation actually does the trick.

Dank je wel!

Cheers,
Olaf
 
J

Jesse Houwing

Hello Olaf,
Hi,


No need to anymore. :)
I had to find out that I was just too dumb to realize that I can only
test
that RegEx in code as, otherwise, I don't have any means to tell
Expresso
to utilize the \n, instead of converting them to \r when running
tests.
That is, your recommendation actually does the trick.
Dank je wel!

Cheers,
Olaf

Graag gedaan :)
 
Top