Splitting a 31gb text file

G

Guest

I have a 31gb file that was exported from a unix box. I was trying to
bulkinsert it into sql but there seems to be something wrong with the file.
In order to isolate the issue I have tried splitting it into smaller files.
I get an out of memory on the 7th chunk of 10,000,000 lines using the code
below. The exe only seems to be consuming about 20Mb of memory. I have
tried several techniques but all yield the same results. Any suggestions?

I know the best answer is to not have a 31gb text file to begin with but
then I have to deal with the unix guys knocking windows inabilities.

Sub processfile(ByVal ReadFile As String)

Dim oread As IO.StreamReader
Dim owrite As IO.StreamWriter
Dim x As Int32
Dim readline As String

oread = IO.File.OpenText(ReadFile)
owrite = IO.File.CreateText("c:\output\" &
System.IO.Path.GetFileName(ReadFile) & "_0")

While oread.Peek <> -1
readline = oread.ReadLine
owrite.WriteLine(readline)
x = x + 1
If x Mod 10000000 = 0 Then
owrite.Close()
owrite.Dispose()
owrite = IO.File.CreateText("c:\output\" &
System.IO.Path.GetFileName(ReadFile) & "_" & (x / 10000000).ToString)
End If

If x Mod 1000000 = 0 Then
owrite.Flush()
oread.DiscardBufferedData()
TextBox1.Text = x
TextBox1.Refresh()
End If

End While

End Sub
 
J

Jon Skeet [C# MVP]

web33 said:
I have a 31gb file that was exported from a unix box. I was trying to
bulkinsert it into sql but there seems to be something wrong with the file.
In order to isolate the issue I have tried splitting it into smaller files.
I get an out of memory on the 7th chunk of 10,000,000 lines using the code
below. The exe only seems to be consuming about 20Mb of memory. I have
tried several techniques but all yield the same results. Any suggestions?

Is it possible that there's a "line" at about that point which is
actually vast (bigger than can fit into memory)?

Try writing a loop which *just* reads the lines, and doesn't write them
anywhere, maybe writing out a warning if the line is greater than a few
thousand characters in length.

Are you doing all of this within the UI thread, by the way? If so, it's
a bit nasty - and if not, you should be using Control.Invoke to access
the text box.

I'm also very unclear why you're calling DiscardBufferedData on the
StreamReader. That will mean you skip some of the data in the file.
 
C

Cowboy \(Gregory A. Beamer\)

I have worked with files that were about this large on a project and had no
problems. The first step you have to take is determine where the problem is.
Is the line too long? Are certain fields too long so they would truncate?
Did the Unix output miss a line terminator (yes, it happens, despite what
the Unix guys might tell you)?

Until you know why the SQL load failed, you will not be able to continue.

I am not against splitting files, as you can set up Bulk Insert to
multi-thread by handling multiple files. ANother option is BCP, where you
can specifiy a number of errors to bypass. If the file stopped adding line
terms, you will still overload the pipeline and it will fail, however, so it
is still wise to find out what is going wrong.

Finally, why are you putting the information in a textbox. You can split the
file without putting the data in the UI.

--
Gregory A. Beamer
MVP; MCP: +I, SE, SD, DBA

*************************************************
Think outside of the box!
*************************************************
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top