Regex : handling single quotes while parsing csv file

M

Mikus Sleiners

I'm using following regex expression in C#

Regex regex = new Regex("(\"(?:[^\"]|\"\")*\");?");
return regex.Match(input);

I use it to split up input like this

"LV27HABA0551004004467";"20";"01.07.2009";"RIMI Kr.Valdemara 112
Riga";"PIRKUMS 4205734302189310 29.06.2009 2.48 LVL (502500) RIMI
Kr.Valdemara 112 Riga
";"2.48";"LVL";"D";"2009070100101359";"CTX";"";"";

as you can see this is a line of comma sepparated values and this regex
works lightning fast and does the job except I have one major problem

sometimes input comes like this

"some value"; "some value"; "supermarket "The best" rozmarine street
222";"some value"; "some value";

The problem is that my current regex expression cannot handle quotes insode
quotes ... in this example "the best" comes enclosed in quotes which

causes regex engine to split up this line incorrectly and thus leads to
exception later in my code

I need a way to correct the expression so that it matches only if there is a
quote followed by ";" this way it would work correctly even with quotes.

Is there an easy way to fix this?

P.S sorry for posting this on c# group... i didn't find any regex related
place so i picked this one
 
G

Göran Andersson

Mikus said:
I'm using following regex expression in C#

Regex regex = new Regex("(\"(?:[^\"]|\"\")*\");?");
return regex.Match(input);

I use it to split up input like this

"LV27HABA0551004004467";"20";"01.07.2009";"RIMI Kr.Valdemara 112
Riga";"PIRKUMS 4205734302189310 29.06.2009 2.48 LVL (502500) RIMI
Kr.Valdemara 112 Riga
";"2.48";"LVL";"D";"2009070100101359";"CTX";"";"";

as you can see this is a line of comma sepparated values and this regex
works lightning fast and does the job except I have one major problem

sometimes input comes like this

"some value"; "some value"; "supermarket "The best" rozmarine street
222";"some value"; "some value";

The problem is that my current regex expression cannot handle quotes insode
quotes ... in this example "the best" comes enclosed in quotes which

causes regex engine to split up this line incorrectly and thus leads to
exception later in my code

I need a way to correct the expression so that it matches only if there is a
quote followed by ";" this way it would work correctly even with quotes.

Is there an easy way to fix this?

P.S sorry for posting this on c# group... i didn't find any regex related
place so i picked this one

If the input comes like that, it's invalid.

The quotation marks in the value should be escaped, so the data should
look like this:

"some value"; "some value"; "supermarket ""The best"" rozmarine street
222";"some value"; "some value";

Are you really expected to parse data that is not correct? If so, what
can you assume about the data?

Can you assume that a quotation mark in the value that is not escaped
correctly isn't followed by a semicolon? In that case you could use a
negative look-ahead so that you match anything that is not a quotation
mark, correctly escaped quotation marks, or an unescaped quotation mark
that is not followed by a semicolon:

"(\"(?:[^\"]|\"\"|\"(?!;))*\");?"
 
A

ambidexterous

Thanks Goran

Your suggestion works perfectly.

P.S yes i am aware that that it is not a valid csv format, but i am afraid I
cannot change the format of the export files that are produced by internet
banking i'm using :)
Göran Andersson said:
Mikus said:
I'm using following regex expression in C#

Regex regex = new Regex("(\"(?:[^\"]|\"\")*\");?");
return regex.Match(input);

I use it to split up input like this

"LV27HABA0551004004467";"20";"01.07.2009";"RIMI Kr.Valdemara 112
Riga";"PIRKUMS 4205734302189310 29.06.2009 2.48 LVL (502500) RIMI
Kr.Valdemara 112 Riga
";"2.48";"LVL";"D";"2009070100101359";"CTX";"";"";

as you can see this is a line of comma sepparated values and this regex
works lightning fast and does the job except I have one major problem

sometimes input comes like this

"some value"; "some value"; "supermarket "The best" rozmarine street
222";"some value"; "some value";

The problem is that my current regex expression cannot handle quotes
insode
quotes ... in this example "the best" comes enclosed in quotes which

causes regex engine to split up this line incorrectly and thus leads to
exception later in my code

I need a way to correct the expression so that it matches only if there
is a
quote followed by ";" this way it would work correctly even with quotes.

Is there an easy way to fix this?

P.S sorry for posting this on c# group... i didn't find any regex related
place so i picked this one

If the input comes like that, it's invalid.

The quotation marks in the value should be escaped, so the data should
look like this:

"some value"; "some value"; "supermarket ""The best"" rozmarine street
222";"some value"; "some value";

Are you really expected to parse data that is not correct? If so, what can
you assume about the data?

Can you assume that a quotation mark in the value that is not escaped
correctly isn't followed by a semicolon? In that case you could use a
negative look-ahead so that you match anything that is not a quotation
mark, correctly escaped quotation marks, or an unescaped quotation mark
that is not followed by a semicolon:

"(\"(?:[^\"]|\"\"|\"(?!;))*\");?"
 
Top