Regex: Finding strings in a source file

  • Thread starter Thread starter Bob
  • Start date Start date
B

Bob

I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?

Thanks
 
Bob said:
I need to create a Regex to extract all strings (including
quotations) from a C# or C++ source file.

Well, it's not possible. You'd need a complete C# parser to extract
strings in a foolproof way. Here's one of the simpler examples that
can't be distinguished from a real string using a regex alone:

// "I am a comment but I look like a string"

Eq.
 
Nope, it very well is possible...

Regex regex = new
Regex(@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).')",
RegexOptions.Singleline);

String result = codeRegex.Replace(input, new MatchEvaluator(MatchEval));

public String MatchEval(Match match)
{
if(match.Groups[1].Success) { } //comment
if(match.Groups[2].Success) { } //string literal
...
}

Back to my original question, if anybody knows why the regex isn't correctly
watching for back-slashes followed by a quotation, any input is appreciated.
 
Bob said:
I need to create a Regex to extract all strings (including quotations) from
a C# or C++ source file. After being unsuccessful myself, I found this
sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with a
back-slash. For example: "This is a test\\". Can anybody see how to fix
this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)",
(RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....
 
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format of
(?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en-us/cpgenref/html/cpconregularexpressionslanguageelements.asp.

Thanks again.


Ken Arway said:
Bob said:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends with
a back-slash. For example: "This is a test\\". Can anybody see how to
fix this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....
 
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your literal
string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

Bob said:
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en-us/cpgenref/html/cpconregularexpressionslanguageelements.asp.

Thanks again.


Ken Arway said:
Bob said:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....
 
I figured what it is... the <2> is a back reference to the commenting group,
and me prefixing the entire thing set the number off. I went ahead and
named it and now I have this:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?(?<comment>[""']).+?\<comment>)"

The only problem now is that it doesn't take into account escaped quotations
and double quotations when using the @ string literal prefix in C# files.


Bob said:
Also, I prepended your pattern to test for comments first:

@"(/\*.*?\*/|//.*?(?=\r|\n))|(([""']).+\<2>)"

After prefixing the commenting part, comments are picked up but your
literal string part is completely ignored. For example:

Nothing is matched (should have gotten the "C"):
String str = "extern \"C\"\r\n";

The whole line is correctly matched for a comment:
String str = "//extern \"C\"\r\n";

Strangely enough the old pattern did work in this aspect:
@"(/\*.*?\*/|//.*?(?=\r|\n))|(@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).')"

Unfortunately it fails to correctly end literal strings ending with a
back-slash (unlike yours, which does work).

Thanks

Bob said:
Your Regex works very well Ken, thanks. Can you explain what exactly the
<2> does? It looks like a grouping construct, but it isn't in the format
of (?<group>.*?). I couldn't find any reference to this at
http://msdn.microsoft.com/library/en-us/cpgenref/html/cpconregularexpressionslanguageelements.asp.

Thanks again.


Ken Arway said:
Bob wrote:
I need to create a Regex to extract all strings (including quotations)
from a C# or C++ source file. After being unsuccessful myself, I found
this sample on the internet:

@"@?""""|@?"".*?(?!\\).""|''|'.*?(?!\\).'"

I am inputting the entire source file string and using it with
RegexOptions.Singleline. This works OK with, unless the string ends
with a back-slash. For example: "This is a test\\". Can anybody see
how to fix this sample so that back-slashes are considered?

Without examples of desired behaviour, here's what I came up with, using
backreferences:

Regex regex = new Regex(@"(([""']).+\<2>)", (RegexOptions) 0);

Sample input:
"This is a test\\"
This is also a test
Here's another "test"
'Now for another\\'
Using 'single quotes'
// Here 's a comment.
// And a "quoted" one.

Sample output:
Matching: "This is a test\\"
1 =»"This is a test\\"«=
2 =»"«=

Matching: This is also a test
No Match

Matching: Here's another "test"
1 =»"test"«=
2 =»"«=

Matching: 'Now for another\\'
1 =»'Now for another\\'«=
2 =»'«=

Matching: Using 'single quotes'
1 =»'single quotes'«=
2 =»'«=

Matching: // Here 's a comment.
No Match

Matching: // And a "quoted" one.
1 =»"quoted"«=
2 =»"«=

You'd want the group 1....
 
So here is what I've gotten so far:

@"(/\*.*?\*/|//.*?(?=\r|\n))|((?:@(?<c1>[""'])(?:""""|.)*?\<c1>)|(?:(?<c2>[""'])(?:\\.|.)*?\<c2>))"

I am using non-capturing groups for a specific reason not seen here, just
ignore those.

Anyway, the first part is for comments, the second part is for literal
strings starting with @, the third part is for literal strings with
potential escape characters. Everything seems to work now exept for
supporting double-quotation marks in literal strings starting with @. For
example, this input sample:

String str = "before @\"a\"\"b\"\"c\" after \"ok\"";

Captures:
@"a"
"b"
"c"
"ok"

When it should capture:
@"a""b""c"
"ok"

I tested making the capture non-lazy, but then it captures:
@"a""b""c" after "ok"

It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?

If you know why this might be, please share...
 
Bob said:
It is like it is going to the second option instead of doing the first, even
though the first is available:
(?:""""|.).*?

I'm out of ideas on this one. Probably something to do with not considering groups/patterns available for backreferencing if they're in an OR statement.
What I'd do is try to simplify the processing -- break your parsing into more than one pass to make the resulting strings more digestible. You might even find that regex isn't the best option -- string functions could wind up being more appropriate.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top