Can Regex do this ?

  • Thread starter Juan Gabriel Del Cid
  • Start date
J

Juan Gabriel Del Cid

I have the sudden need to split a text that may have any of the
following tokens :

- Words with quotes or double quotes.
- Words with no quotes at all.
- Numbers with and without decimal points,
no commas allowed, but may contain parenthesis
which I would like to keep apart to drop later.

They may be separated by comas, spaces or semicolon.

Ok, lets supose you left out grouping functionality (i.e. qoutes and double
coutes are not grouping operators). If this were the case, this regular
expression will spilt the for you:

Regex splitter = new Regex("[\\s,;]+");
string []splitItems = splitter.Split(myString);

This is without grouping. When you throw in grouping functionality, you need
a parser. Regular expressions wont cut it. You need to think of:

- unballanced grouping chars (e.g. an unclosed quote)
- escaping grouping chars (e.g. if you want the name O'Neal in a word)
- double quotes inside single quotes and viceversa

For this to work you need to write a parser. It's really not that hard, but
it's not as easy as a regex, :).

Hope this helps,
-JG
 
C

Craig Kenisston

I have the sudden need to split a text that may have any of the
following tokens :

Words with quotes or double quotes.
Words with no quotes at all.
Numbers with and without decimal points, no commas allowed, but may
contain parenthesis which I would like to keep apart to drop later.

They may be separated by comas, spaces or semicolon.

So my string my have this content :
ProductNumber; (1234.44), "The Name", 'This, that and more'

And I would be willing to get this strings :

ProductNumber
;
(
1234.44
)
,
"The Name"
,
'This, that and more'

I have two days days with this with no luck. I've tried several
combinations and I either get one functionality working or drop
another.

Thanks in advance for your help.
 
P

Peter Koen

(e-mail address removed) (Craig Kenisston) wrote in

[...]
So my string my have this content :
ProductNumber; (1234.44), "The Name", 'This, that and more'

And I would be willing to get this strings :

ProductNumber
;
(
1234.44
)
,
"The Name"
,
'This, that and more'


What about

string result[] = s.Split(new char[]{'\"', '\'', '(', ')'});

foreach(string str in result)
{
string nextresult = str.Split(new char[]{';',','});
//do some further processing
}

that would be much faster than a regex
 
1

100

Hi Craig,
Grammars fall in different classes. Regular expressions are the smallest
one.
IMHO your case cannot be described with regular expressions . Rather you
should use a context-free grammar.
So my suggestion is to stop wasting your time. You can still use regex for
tokens like strings, numbers and identifiers, but the overall structure of
the input has to be described according to some context-free grammar
There are several techniques for parsing text which is descriped with
context-free grammars. All fo them but one need tools for generating the
parser. Recently I read a post in this news group where one was looking for
*lex* and *yacc* for C#. Such tools you need. However they are hard to be
used if you don't have experience with compilers design. They have their own
programming language and generate code (class) for parsing the input text.
What I may suggest you is to use the method that can be programmed by hand.
It is called "recursive descent parsing" and is pretty straightforward.
Unfotunatelly I can't point you to good sources, but hopefully someone on
the group can do so.
There is *interpretter design pattern* coverring this method. In GOF book
about design patterns you can find an example of using it. So you can start
there.
You can check out this article for more details as well.
http://www.cuj.com/documents/s=8230/cuj0301niebler/

HTH
B\rgds
100
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top