Parsing Comma-delimited records?

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I'm parsing a comma-delimited record but I want it to do something if some of
the string is between "". How can I do this? With the Excel import it does
it correct. I'm using String.Split().
Basically, this is what I want to do: Use string.Split() on the whole string
UNLESS the string is in between double-quotes. The part of the string
in-between the "" will be ignored by String.Split

Thanks.
 
VMI said:
I'm parsing a comma-delimited record but I want it to do something if some of
the string is between "". How can I do this? With the Excel import it does
it correct. I'm using String.Split().
Basically, this is what I want to do: Use string.Split() on the whole string
UNLESS the string is in between double-quotes. The part of the string
in-between the "" will be ignored by String.Split

I would abandon String.Split and split the string myself, coping with
quotation marks and escape characters as necessary. It's a very simple
little loop to do it, easy to write and maintain. Sometimes the
Framework-supplied methods aren't the best way to solve a problem,
IMHO.

Unless someone else has a clever trick...?
 
VMI,

You will have to cycle through the string character by character and
take note of the state, or use regular expressions in this case.

Hope this helps.
 
Here is a method on my blog.
http://staceyw.spaces.msn.com/blog/cns!F4A38E96E598161E!352.entry?

--
William Stacey [MVP]

| I'm parsing a comma-delimited record but I want it to do something if some
of
| the string is between "". How can I do this? With the Excel import it
does
| it correct. I'm using String.Split().
| Basically, this is what I want to do: Use string.Split() on the whole
string
| UNLESS the string is in between double-quotes. The part of the string
| in-between the "" will be ignored by String.Split
|
| Thanks.
 
VMI said:
I'm parsing a comma-delimited record but I want it to do something if some
of
the string is between "". How can I do this? With the Excel import it
does it correct. I'm using String.Split().
Basically, this is what I want to do: Use string.Split() on the whole
string UNLESS the string is in between double-quotes. The part of the
string in-between the "" will be ignored by String.Split

Thanks.

Hi,

I wrote a wee lexer that did this, somewhere earlier in the newsgroup. I've
found it, and here it is:

///
public string[] GetStringParts ( string inputString )
{
List<string> retVal = new List<string>();
string currentPart = string.Empty;
int lexerState = 0;

for ( int i = 0; i < inputString.Length; i++ )
{
switch ( lexerState )
{
case 0:
if ( inputString == ',' )
{
retVal.Add( currentPart.Trim() );
currentPart = string.Empty;
}
else if ( inputString == '"' )
lexerState = 1;
else
currentPart += inputString;
break;

case 1:
if ( inputString == '"' )
lexerState = 0;
else
currentPart += inputString;
break;
}
}

return retVal.ToArray();
}
///
 
Tom said:
I wrote a wee lexer that did this, somewhere earlier in the newsgroup. I've
found it, and here it is:

Tom / VMI:

If you don't mind a bit of reworking, this version of Tom's lexer uses
StringBuilder rather than String, and so will not create so many
intermediate strings on the heap that later have to be
garbage-collected. For large volumes of data it will make a significant
difference:

public string[] GetStringParts ( string inputString )
{
List<string> retVal = new List<string>();
StringBuilder currentPart = new StringBuilder();
bool withinQuotes = false;

for ( int i = 0; i < inputString.Length; i++ )
{
char c = inputString;
if (withinQuotes)
{
if (c == '"')
{
withinQuotes = false;
}
else
{
currentPart.Append(c);
}
}
else
{
if (c == ',')
{
retVal.Add( currentPart.ToString().Trim() );
currentPart.Length = 0;
}
else if ( c == '"' )
{
withinQuotes = true;
}
else
{
currentPart.Append(c);
}
}
}
retVal.Add( currentPart.ToString().Trim() );

return retVal.ToArray();
}

This version also fixes a bug whereby the last item in the
comma-separated list wasn't being added to the return array.

Anyway, this is the kind of simple solution I was talking about: easy
to read, easy to maintain. It's also easy to add refinements like
backslash-escapes for quote characters, etc.
 
Bruce said:
Tom said:
I wrote a wee lexer that did this, somewhere earlier in the newsgroup.
I've found it, and here it is:

Tom / VMI:

If you don't mind a bit of reworking, this version of Tom's lexer uses
StringBuilder rather than String, and so will not create so many
intermediate strings on the heap that later have to be
garbage-collected. For large volumes of data it will make a significant
difference:

public string[] GetStringParts ( string inputString )
{
List<string> retVal = new List<string>();
StringBuilder currentPart = new StringBuilder();
bool withinQuotes = false;

for ( int i = 0; i < inputString.Length; i++ )
{
char c = inputString;
if (withinQuotes)
{
if (c == '"')
{
withinQuotes = false;
}
else
{
currentPart.Append(c);
}
}
else
{
if (c == ',')
{
retVal.Add( currentPart.ToString().Trim() );
currentPart.Length = 0;
}
else if ( c == '"' )
{
withinQuotes = true;
}
else
{
currentPart.Append(c);
}
}
}
retVal.Add( currentPart.ToString().Trim() );

return retVal.ToArray();
}

This version also fixes a bug whereby the last item in the
comma-separated list wasn't being added to the return array.

Anyway, this is the kind of simple solution I was talking about: easy
to read, easy to maintain. It's also easy to add refinements like
backslash-escapes for quote characters, etc.


Hi Bruce,

Thanks for spotting the bug, I actually spotted it myself, when I posted
this lexer a while ago, and posted the fix, but I neglected to include it
when I copy-and-pasted it here.
 
Hello Tom,

I have tried to use the ODBC CSV driver before to some success, but I have
used this to much success:

http://www.heikniemi.net/jhlib/


Thanks,
Shawn Wildermuth
Speaker, Author and C# MVP
http://adoguy.com
Bruce said:
Tom said:
VMI wrote:

I'm parsing a comma-delimited record but I want it to do something
if
some of
the string is between "". How can I do this? With the Excel import
it
does it correct. I'm using String.Split().
Basically, this is what I want to do: Use string.Split() on the
whole
string UNLESS the string is in between double-quotes. The part of
the
string in-between the "" will be ignored by String.Split
I wrote a wee lexer that did this, somewhere earlier in the
newsgroup. I've found it, and here it is:
Tom / VMI:

If you don't mind a bit of reworking, this version of Tom's lexer
uses StringBuilder rather than String, and so will not create so many
intermediate strings on the heap that later have to be
garbage-collected. For large volumes of data it will make a
significant difference:

public string[] GetStringParts ( string inputString )
{
List<string> retVal = new List<string>();
StringBuilder currentPart = new StringBuilder();
bool withinQuotes = false;
for ( int i = 0; i < inputString.Length; i++ )
{
char c = inputString;
if (withinQuotes)
{
if (c == '"')
{
withinQuotes = false;
}
else
{
currentPart.Append(c);
}
}
else
{
if (c == ',')
{
retVal.Add( currentPart.ToString().Trim() );
currentPart.Length = 0;
}
else if ( c == '"' )
{
withinQuotes = true;
}
else
{
currentPart.Append(c);
}
}
}
retVal.Add( currentPart.ToString().Trim() );
return retVal.ToArray();
}
This version also fixes a bug whereby the last item in the
comma-separated list wasn't being added to the return array.

Anyway, this is the kind of simple solution I was talking about: easy
to read, easy to maintain. It's also easy to add refinements like
backslash-escapes for quote characters, etc.

Hi Bruce,

Thanks for spotting the bug, I actually spotted it myself, when I
posted this lexer a while ago, and posted the fix, but I neglected to
include it when I copy-and-pasted it here.

Google first, ask later.
 
* William Stacey [MVP] wrote, On 17-7-2006 22:21:

I've got a regex lying around that does basically the same:

^((((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)"")[|])+((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)""))\r?$

It captures one line, separated by | (quoted strings may contain
newlines, unquoted ones may not).

Use this as follows:

static void Main(string[] args)
{
ParsePipeSeparatedLine(input);
}

static string input =
@"1|2
3|4";

public static List<string[]> ParsePipeSeparatedLine(string input)
{
Regex rx = new
Regex(@"^((((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)"")[|])+((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)""))\r?$",
RegexOptions.Compiled | RegexOptions.Multiline);

Match m = rx.Match(input);

List<string[]> lines = new List<string[]>();
while (m.Success)
{
int elemCount = m.Groups["value"].Captures.Count;
string[] values = new string[elemCount];

for (int i = 0; i < elemCount; i++)
{
values = m.Groups["value"].Captures.Value;
}
lines.Add(values);
m = m.NextMatch();
}
return lines;
}
 
Gotta love regex. I'll have to look at that harder. Looks interesting.

--
William Stacey [MVP]

|* William Stacey [MVP] wrote, On 17-7-2006 22:21:
| > Here is a method on my blog.
| > http://staceyw.spaces.msn.com/blog/cns!F4A38E96E598161E!352.entry?
| >
|
| I've got a regex lying around that does basically the same:
|
|
^((((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)"")[|])+((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)""))\r?$
|
| It captures one line, separated by | (quoted strings may contain
| newlines, unquoted ones may not).
|
| Use this as follows:
|
| static void Main(string[] args)
| {
| ParsePipeSeparatedLine(input);
| }
|
| static string input =
| @"1|2
| 3|4";
|
| public static List<string[]> ParsePipeSeparatedLine(string input)
| {
| Regex rx = new
|
Regex(@"^((((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)"")[|])+((?<value>[^|""\r\n]*)|""(?<value>([^""]|\\"")*)""))\r?$",
| RegexOptions.Compiled | RegexOptions.Multiline);
|
| Match m = rx.Match(input);
|
| List<string[]> lines = new List<string[]>();
| while (m.Success)
| {
| int elemCount = m.Groups["value"].Captures.Count;
| string[] values = new string[elemCount];
|
| for (int i = 0; i < elemCount; i++)
| {
| values = m.Groups["value"].Captures.Value;
| }
| lines.Add(values);
| m = m.NextMatch();
| }
| return lines;
| }
 
Shawn said:
Hello Tom,

I have tried to use the ODBC CSV driver before to some success, but I have
used this to much success:

http://www.heikniemi.net/jhlib/


Thanks,
Shawn Wildermuth
Speaker, Author and C# MVP
http://adoguy.com
Bruce said:
Tom Spink wrote:

VMI wrote:

I'm parsing a comma-delimited record but I want it to do something
if
some of
the string is between "". How can I do this? With the Excel import
it
does it correct. I'm using String.Split().
Basically, this is what I want to do: Use string.Split() on the
whole
string UNLESS the string is in between double-quotes. The part of
the
string in-between the "" will be ignored by String.Split
I wrote a wee lexer that did this, somewhere earlier in the
newsgroup. I've found it, and here it is:

Tom / VMI:

If you don't mind a bit of reworking, this version of Tom's lexer
uses StringBuilder rather than String, and so will not create so many
intermediate strings on the heap that later have to be
garbage-collected. For large volumes of data it will make a
significant difference:

public string[] GetStringParts ( string inputString )
{
List<string> retVal = new List<string>();
StringBuilder currentPart = new StringBuilder();
bool withinQuotes = false;
for ( int i = 0; i < inputString.Length; i++ )
{
char c = inputString;
if (withinQuotes)
{
if (c == '"')
{
withinQuotes = false;
}
else
{
currentPart.Append(c);
}
}
else
{
if (c == ',')
{
retVal.Add( currentPart.ToString().Trim() );
currentPart.Length = 0;
}
else if ( c == '"' )
{
withinQuotes = true;
}
else
{
currentPart.Append(c);
}
}
}
retVal.Add( currentPart.ToString().Trim() );
return retVal.ToArray();
}
This version also fixes a bug whereby the last item in the
comma-separated list wasn't being added to the return array.

Anyway, this is the kind of simple solution I was talking about: easy
to read, easy to maintain. It's also easy to add refinements like
backslash-escapes for quote characters, etc.

Hi Bruce,

Thanks for spotting the bug, I actually spotted it myself, when I
posted this lexer a while ago, and posted the fix, but I neglected to
include it when I copy-and-pasted it here.

Google first, ask later.


Hi Shawn,

Interesting, but I've provided a small code-sample that you can embed
directly into your own code, that doesn't require you to download and
reference libraries.
 
you could try this:

public static ArrayList Read(string file, bool hasHeader)
{
//returns an array list of rows which in turn contains an
arraylist of fields
ArrayList csvData = new ArrayList();
string path = Path.GetDirectoryName(file);
OleDbConnection con = new OleDbConnection();
if (hasHeader)
con.ConnectionString =
@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + path + @";Extended
Properties=""Text;HDR=Yes;FMT=Delimited""";
else
con.ConnectionString =
@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + path + @";Extended
Properties=""Text;HDR=No;FMT=Delimited""";
con.Open();
OleDbCommand cmd = new OleDbCommand("SELECT * FROM " +
Path.GetFileName(file), con);
OleDbDataReader rs = cmd.ExecuteReader();
while (rs.Read())
{
ArrayList csvRow = new ArrayList();
// loop through every field
for (int i = 0; i < rs.FieldCount; i++)
csvRow.Add(rs.GetValue(i));
csvData.Add(csvRow);
}
con.Close();
return csvData;
}

It's what I use. It's simple but can be extended to use your own
custom collection types, etc...
 
* William Stacey [MVP] wrote, On 18-7-2006 4:03:
Gotta love regex. I'll have to look at that harder. Looks interesting.

I give lectures in Regex, it's become a way of life ;)

Jesse
 
Hello VMI,

I assume the CSV file lives on disk somewhere. If so, then how about using
OleDb to talk to it..

Connection String:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\TxtFilesFolder\;Extended
Properties='text;HDR=Yes;FMT=Delimited'

CommandText:
SELECT * FROM file.csv

-Boo
 
Your sick! :-)

--
William Stacey [MVP]

|* William Stacey [MVP] wrote, On 18-7-2006 4:03:
| > Gotta love regex. I'll have to look at that harder. Looks interesting.
| >
|
| I give lectures in Regex, it's become a way of life ;)
|
| Jesse
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top