Suggestions to reduce memory use when splitting a string

K

klineb

Good Day,

I have written and utility to convert our DOS COBOL data files to a SQL
Server database. Part of the process requires parsing each line into a
sql statement and validting the data to keep the integrity of the
database. We are parsing roughl 81 files and range in size 1 kb to 65
MB files (Average of 400,000 lines in the larger files).

I have written this utility with VB.NET 2003 and when I parse all of
the files I run out of memory. The following functions seems to be the
main source of my leak. Any help optimizing this code is appreciated.

Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
_
ByRef SplitString() As String) As Boolean
'7-25-2005
'BJK
'
'--removed the use of Char replaced with: CurrentLine.Chars(i)
'
'---------------------------------------
Dim i As Integer
Dim CountDelimiter As Boolean
Dim Total As Integer
Dim lbResult As Boolean
Dim Section As New StringBuilder
Dim liLen As Integer
Dim liCommaPos As Integer
Dim liDQuotePos As Integer

Try
'We want to count the delimiter unless it is within the
text qualifier
CountDelimiter = True
Total = 0
liLen = CurrentLine.Length - 1
For i = 0 To liLen
Select Case CurrentLine.Chars(i)
Case gsDoubleQoute
If CountDelimiter Then
CountDelimiter = False
Else
CountDelimiter = True
End If
Case gsComma
If CountDelimiter Then
' Add current section to collection
SplitString(Total) = Section.ToString.Trim
Section = Nothing
Section = New StringBuilder
Total = Total + 1
Else
Section.Append(CurrentLine.Chars(i))
End If
Case Else
Section.Append(CurrentLine.Chars(i))
End Select
Next
' Get the last field - as most files will not have an
ending delimiter
If CountDelimiter Then
' Add current section to collection
SplitString(Total) = Section.ToString
End If
lbResult = True
Catch ex As Exception
ps_LastErrSource = ex.Source
ps_LastErrDesc = ex.ToString
lbResult = False
Dim loSB As New StringBuilder(ps_LastErrDesc)
UpdateLog(loSB)
End Try
Return lbResult
End Function

This function is stored in a class and is called from other function
within this class.
Thanks
Brian
 
T

tommaso.gastaldi

hi Brian,

I would do that using a buffer. You open a StreamWriter and when the
processed text (which you can store in a StringBuilder (sb) ) reaches a
given size, for instance 2MB (or whatever you like, the sb holds more
than 2GB), you can flush it to disk an start again with an empty
buffer. To empty a sb it's sufficient to set the length to 0. When
input ends, you flush to disk what remains in the sb. Remember to close
the StreamWriter.

-t

klineb ha scritto:
 
M

Marina Levit [MVP]

Have you thought about regular expressions to parse the fields out of each
line? Going character by character seems really inefficient. At the very
last, use functions like IndexOf, to find your delimiters, and parse out the
pieces that way.
 
D

David Browne

klineb said:
Good Day,

I have written and utility to convert our DOS COBOL data files to a SQL
Server database. Part of the process requires parsing each line into a
sql statement and validting the data to keep the integrity of the
database. We are parsing roughl 81 files and range in size 1 kb to 65
MB files (Average of 400,000 lines in the larger files).

I have written this utility with VB.NET 2003 and when I parse all of
the files I run out of memory. The following functions seems to be the
main source of my leak. Any help optimizing this code is appreciated.

There are a few tweaks you can apply here to reduce memory utilization (in
particular you can perhaps reuse the StringBuilders). However there's
nothing obvious and terrible here. In short this function should not cause
an out of memory error.

David
 
T

tomb

klineb said:
Good Day,

I have written and utility to convert our DOS COBOL data files to a SQL
Server database. Part of the process requires parsing each line into a
sql statement and validting the data to keep the integrity of the
database. We are parsing roughl 81 files and range in size 1 kb to 65
MB files (Average of 400,000 lines in the larger files).

I have written this utility with VB.NET 2003 and when I parse all of
the files I run out of memory. The following functions seems to be the
main source of my leak. Any help optimizing this code is appreciated.

Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
_
ByRef SplitString() As String) As Boolean
'7-25-2005
'BJK
'
'--removed the use of Char replaced with: CurrentLine.Chars(i)
'
'---------------------------------------
Dim i As Integer
Dim CountDelimiter As Boolean
Dim Total As Integer
Dim lbResult As Boolean
Dim Section As New StringBuilder
Dim liLen As Integer
Dim liCommaPos As Integer
Dim liDQuotePos As Integer

Try
'We want to count the delimiter unless it is within the
text qualifier
CountDelimiter = True
Total = 0
liLen = CurrentLine.Length - 1
For i = 0 To liLen
Select Case CurrentLine.Chars(i)
Case gsDoubleQoute
If CountDelimiter Then
CountDelimiter = False
Else
CountDelimiter = True
End If
Case gsComma
If CountDelimiter Then
' Add current section to collection
SplitString(Total) = Section.ToString.Trim
Section = Nothing
Section = New StringBuilder
Total = Total + 1
Else
Section.Append(CurrentLine.Chars(i))
End If
Case Else
Section.Append(CurrentLine.Chars(i))
End Select
Next
' Get the last field - as most files will not have an
ending delimiter
If CountDelimiter Then
' Add current section to collection
SplitString(Total) = Section.ToString
End If
lbResult = True
Catch ex As Exception
ps_LastErrSource = ex.Source
ps_LastErrDesc = ex.ToString
lbResult = False
Dim loSB As New StringBuilder(ps_LastErrDesc)
UpdateLog(loSB)
End Try
Return lbResult
End Function

This function is stored in a class and is called from other function
within this class.
Thanks
Brian
It looks to me like your field separator is always a comma.
So why not just use SplitString = split(CurrentLine,","), then use
Replace on each string in the resultant array if you want to get rid of
the quotes. I think this will be much faster than going char by char.
The parameter SplitString will have to be declared as byRef SplitString
as Array, rather than SplitString() as String.

Tom
 
M

Marina Levit [MVP]

Actually, arrays are reference types. So modifying the contents of the
array, will work just fine as far as filling it. It doesn't need to be
ByRef.
 
K

klineb

Tom,

Some of the fields are comment field that contain comma.

Ex. 1,2,"This is some, sample text",19.90,

Using SplitSting will not handle this.
 
S

Stephany Young

Try this:

Public Function SplitDelimitedLine(ByVal CurrentLine As String) As
String()

Try
Dim _wl As String = String.Empty ' work string
' Create a local copy of CurrentLine
' We don't really need to but I have for the sake of clarity
' The 3 Trim's peel of any extraneous whitespace then any leading
and/or trailing commas and then any extraneous whitespace thet might have
been any leading and/or trailing commas that might have been present
Dim _cl As String = CurrentLine.Trim.Trim(","c).Trim
' Find the first " character
Dim _pos As Integer = _cl.IndexOf(""""c)
' If _pos = -1 then there weren't any
' Loop until there are no more " characters
While _pos > -1
' Append every thing before the first " character to the work string
_wl &= _cl.Substring(0, _pos)

' Remove every thing before the first " character from the local
copy
_cl = _cl.Remove(0, _pos)
' Find the next " character
' Note that we start the find from the 2nd position because we know
that there is a " character in position 1 (index 0)
_pos = _cl.IndexOf(""""c, 1)
If _pos > -1 Then
' If we find one then we append every thing from the first " to
the 2nd " inclusive to the work string, replacing any commas in the
substring with a tilde and remove the same number of characters from the
local copy
_wl &= _cl.Substring(0, _pos + 1).Replace(","c, "~"c)
_cl = _cl.Remove(0, _pos + 1)

' Find the next " character
' Note that we are now back to finding from the beginning of what
is left of the local copy
_pos = _cl.IndexOf(""""c)
End If
End While
' There are no moe " characters so append the remainder of the local
copy to the work string
_wl &= _cl
' Split the work string using the comma as the delimiter
Dim _ss As String() = _wl.Replace("""", String.Empty).Split(","c)
' Work through the elements of the array
For _i As Integer = 0 To _ss.Length - 1
If the element contains any tilde characters then replace them with
commas
If _ss(_i).IndexOf("~"c) > -1 Then _ss(_i) = _ss(_i).Replace("~"c,
","c)
'Trim any whitespaces from the element
_ss(_i) = _ss(_i).Trim
Next
' Return the string array
Return _ss
Catch
' We hit a problem of some description so return Nothing (null)
Return Nothing
End Try

End Function

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As
System.EventArgs) Handles Button1.Click

' Measure the time for 1 million calls to the function

Dim _start As DateTime = DateTime.Now

For _i As Integer = 1 To 1000000
Dim _ss() As String = SplitDelimitedLine("1,2,""This is some, sample
text"",19.90,")
Next

Console.WriteLine(DateTime.Now.Subtract(_start).TotalSeconds)

End Sub

On my machine it takes 3.281166 seconds and I do not see any significant
impact on the memory resources.
 
B

Branco Medeiros

klineb wrote:
I have written this utility with VB.NET 2003 and when I parse all of
the files I run out of memory. The following functions seems to be the
main source of my leak. Any help optimizing this code is appreciated.

Public Function SplitDelimitedLine(ByVal CurrentLine As StringBuilder,
_
ByRef SplitString() As String) As Boolean
<snip, snip, snip>

I don't know if this is of any help, but you really don't need a
StringBuilder to "capture" the String slices. Whenever you find the
delimiter, you just need to know where the slice begins.

Disclaimer: I *didn't* test the code bellow

<AirCode>
Function SplitDelimitedLine( _
CurrentLine As StringBuilder, _
SplitString() As String _
) As Boolean

Dim Text As String = CurrentLine.ToString
Dim Max As Integer = Text.Length - 1
Dim SliceStart As Integer
DIm Total As Integer

For Index As Integer = 0 To Max

Dim IgnoreComma As Boolean

Select Case Text(Index)
Case gsDoubleQuote
IgnoreComma = Not IgnoreComma
Case gsComma
If Not IgnoreComma Then
Dim Count As Integer = Index - SliceStart
SplitString(Total) = Text.Substring(SliceStart, Count).Trim
SliceStart = Index + 1
Total += 1
End If
End Select
Next

If SliceStart <= Max Then
Dim Count As Integer = Max - SliceStart + 1
SplitString(Total) = Text.Substring(SliceStart, Count).Trim
End If

Return True
</AirCode>

Regards,

Branco
 
C

Cor Ligthert [MVP]

Brian,

I cannot imagen that this routine takes much memory. And if it would do,
what so ever.

It is a one time operation. The last thing you would think about is in my
opinion the amount of memory you use. Even if you have not enough than add
it. As I assume that you live in an North Atlantic Country, than one hour
thinking of the problem will cost probably more than 1Gb.

So you can only become in problem as you go over the 800Mb.

Just my thought,

Cor
 
K

klineb

Thank you everybody for your suggestions.

This really is not a one time issue. We still have 60 DOS clients to
convert. All of the clients are running different hardware so buying
memory is not really an option. I will try all of you suggestions and
post my results back.

Does anyone know if running some of the class in a seperate process
would help any?

Thanks again for all of the feedback.

Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top