Regular expressions to parse a CSV line

M

MW

Dear All

Does anyone have a regular expression to parse a comma delimited line with
some fields optionally having string delimiters (text qualifiers)
I am currently testing with this regular expression and it works in almost
all my test cases. I found this on the internet in a C# solution.

,(?=([^\"]*"[^"]*")*(?![^"]*"))

However in some of my test cases it fails and I am having difficulty
interpreting it.

The VB.NET function I used is

Public Function parseCSVLine(ByVal sInputString As String) As ArrayList
Dim r As New Regex(",(?=([^\" & Chr(34) & "]*" & Chr(34) & "[^" &
Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))")
Dim iStart As Integer, m As Match
Dim oArrayList As New ArrayList()

For Each m In r.Matches(sInputString)
oArrayList.Add(sInputString.Substring(iStart, m.Index - iStart))
iStart = m.Index + 1
Next
oArrayList.Add(sInputString.Substring(iStart, sInputString.Length -
iStart))

Return oArrayList
End Function

My test cases are as follows:


#
CSV
Value 1
Value 2
Value 3
Value 4
Results

1
a,b,c
a
b
c

P

2
"a",b,c
a
b
c

P

3
'a',b,c
'a'
b
c

P

4
a , b , c
a
b
c

P

5
aa,bb;cc
aa
bb;cc


P

6





P

7
a
a



P

8
,b,

b


P

9
,,c


c

P

10
,,




P

11
"",b

b


P

12
" ",b
[SPACE]
b


P

13
"a,b"
a,b



P

14
"a,b",c
a,b
c


P

15
" a , b ", c
a , b
c


P

16
a b,c
a b
c


P

17
a"b,c
a"b
C


P

18
"a""b",c
a"b
c


P

19
a""b,c
a""b
c


P

20
a,b",c
a
b"
c

O

21
a,b"",c
a
b""
c

P

22
a,"B: ""Hi, I'm B""",c
a
B: "Hi, I'm B"
c

P

23
a,"b,c
a
"b
c

O

24
a,bc"d,e
a
bc"d
e


O

25
a,bc"d",e
a
bc"d"
e

O

26
a,"bc"d,e
a
"bc"d
e

O



Many thanks,
Wazir
 
M

MW

Apologies for the formatting of test cases, I didnt realise I was posting in
Plain Text.

Here they are again, I hope it is more readable this time.

Some of them like case 20 doesnt work with the regular expression

# CSV Value 1 Value 2 Value 3 Value 4
1 a,b,c a b c
2 "a",b,c a b c
3 'a',b,c 'a' b c
4 a , b , c a b c
5 aa,bb;cc aa bb;cc
6
7 a a
8 ,b, b
9 ,,c c
10 ,,
11 "",b b
12 " ",b [SPACE] b
13 "a,b" a,b
14 "a,b",c a,b c
15 " a , b ", c a , b c
16 a b,c a b c
17 a"b,c a"b c
18 "a""b",c a"b c
19 a""b,c a""b c
20 a,b",c a b" c
21 a,b"",c a b"" c
22 a,"B: ""Hi, I'm B""",c a B: "Hi, I'm B"
c
23 a,"b,c a "b c
24 a,bc"d,e a bc"d e
25 a,bc"d",e a bc"d" e
26 a,"bc"d,e a "bc"d e
 
G

Guest

You can use the OleDbCommand class to read a csv file

Tu-Thac
www.ongtech.co

----- MW wrote: ----

Dear Al

Does anyone have a regular expression to parse a comma delimited line wit
some fields optionally having string delimiters (text qualifiers
I am currently testing with this regular expression and it works in almos
all my test cases. I found this on the internet in a C# solution

,(?=([^\"]*"[^"]*")*(?![^"]*")

However in some of my test cases it fails and I am having difficult
interpreting it

The VB.NET function I used i

Public Function parseCSVLine(ByVal sInputString As String) As ArrayLis
Dim r As New Regex(",(?=([^\" & Chr(34) & "]*" & Chr(34) & "[^"
Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))"
Dim iStart As Integer, m As Matc
Dim oArrayList As New ArrayList(

For Each m In r.Matches(sInputString
oArrayList.Add(sInputString.Substring(iStart, m.Index - iStart)
iStart = m.Index +
Nex
oArrayList.Add(sInputString.Substring(iStart, sInputString.Length
iStart)

Return oArrayLis
End Functio

My test cases are as follows



CS
Value
Value
Value
Value
Result


a,b,







"a",b,







'a',b,
'a






a , b ,







aa,bb;c
a
bb;c





















,b







,,






1
,






1
"",






1
" ",
[SPACE





1
"a,b
a,





1
"a,b",
a,





1
" a , b ",
a ,





1
a b,
a





1
a"b,
a"





1
"a""b",
a"





1
a""b,
a""





2
a,b",

b




2
a,b"",

b"




2
a,"B: ""Hi, I'm B""",

B: "Hi, I'm B




2
a,"b,

"




2
a,bc"d,

bc"





2
a,bc"d",

bc"d




2
a,"bc"d,

"bc"






Many thanks
Wazi
 
S

Sergei Gnezdov

Dear All

Does anyone have a regular expression to parse a comma delimited line with
some fields optionally having string delimiters (text qualifiers)

I just replied to a similar post.

String object has Split call:

"a,b,c,de,fg,hegk".Split(',')
 
M

Mihai N.

Does anyone have a regular expression to parse a comma delimited line with
I just replied to a similar post.

String object has Split call:

"a,b,c,de,fg,hegk".Split(',')

This is a problem that is known for a long time to the Perl programmers.
It cannot be solved with regular expressions only.
The simple string above is works, but what about the next one, also valid
CSV:

Item 1,"string with comma, not to split",1234,"now, does it work?",done

This should be split into
Item 1
string with comma, not to split
1234
now, does it work?
done
 
M

MW

Sergei,

Split will only work for a comma delimited file and will not work with text
qualifiers (string delimiters)
 
M

MW

Uri,

The split itself will be incorrect if there is a comma in a column with a
string delimiter

For example:
123, "Mathew, Dale", 24, "london"

Thanks,
Wazir
 
S

Sergei Gnezdov

so when you iterate Split's result, replace (^"|$") with nothing...

This is true. You could parse CVS line in like this:
1) replace qualifiers with some unique id in a hashtable
2) run split
3) replace ids with with their content

Since the description above is no longer that simple, I'd go for parse
it yourself. It is not as hard as it sounds. It will involve one loop,
and a couple variables: start of the substring and a qualifier state
(on/off). When qualifier is on you search for the end of the qualifier.

I think that regular expressions are too hard to work with in this case,
but that's me.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top