Slow re-use of datatable - Why??

J

John Dann

Using VB2005, I have a class that retrieves some data from a file and
returns a datatable (typically 5000 - 50000 rows in size). The first
time this runs it works very quickly - small fraction of a second. But
if I call the same process again on the same data it now takes 2-3
seconds (5000 rows) to return and I can't see why this should be the
case.

In outline the class code is:

=================
Public class MyClass
Dim dt as New Datatable

Public function MakeDT(args) as datatable
dt.reset ' Ensure datatable is blank before reuse

' Process to create datatable schema
dt.columns.add("col1", GetType(Byte))
' etc for 65 columns

For i as integer = 0 to NumberRowBlocks
PopulateRowBlock
Next

Return dt
End Function

Private Function PopulateRowBlock()
For j as integer = 0 to RowsInBlock
Dim dr as DataRow = dt.NewRow
PopulateDataRow(i) ' Detailed code not shown
dt.Rows.Add(dr)
Next
End Function

End Class
==================

The context is that MyClass is instantiated in the calling code
elsewhere and then the public function MakeDT is called as required,
returning a different datatable each time dependent on the args
supplied.

Timing different steps reveals that the step that takes the time
second time round (and subsequently) is creating the schema.

I don't claim to understand the details of datatables too well. Maybe
using dt.reset isn't the best way to clear the datatable of previous
data/schema. Maybe I should declare the datatable inside MakeDT and
not at class level and pass the datatable to my PopulateRowBlock
function?

I'd like to understand what's going slow/wrong here. Any thoughts
please?
 
G

Göran Andersson

John said:
Using VB2005, I have a class that retrieves some data from a file and
returns a datatable (typically 5000 - 50000 rows in size). The first
time this runs it works very quickly - small fraction of a second. But
if I call the same process again on the same data it now takes 2-3
seconds (5000 rows) to return and I can't see why this should be the
case.

In outline the class code is:

=================
Public class MyClass
Dim dt as New Datatable

Public function MakeDT(args) as datatable
dt.reset ' Ensure datatable is blank before reuse

' Process to create datatable schema
dt.columns.add("col1", GetType(Byte))
' etc for 65 columns

For i as integer = 0 to NumberRowBlocks
PopulateRowBlock
Next

Return dt
End Function

Private Function PopulateRowBlock()
For j as integer = 0 to RowsInBlock
Dim dr as DataRow = dt.NewRow
PopulateDataRow(i) ' Detailed code not shown
dt.Rows.Add(dr)
Next
End Function

End Class
==================

The context is that MyClass is instantiated in the calling code
elsewhere and then the public function MakeDT is called as required,
returning a different datatable each time dependent on the args
supplied.

Well, actually it doesn't. It returns the same data table repopulated
with new data. That means that if you call the method twice to populate
two tables, you will only get two references to the same data table, and
the first data will be overwritten by the second.
Timing different steps reveals that the step that takes the time
second time round (and subsequently) is creating the schema.

I don't claim to understand the details of datatables too well. Maybe
using dt.reset isn't the best way to clear the datatable of previous
data/schema.

Well, it's pretty much the only way, but there isn't really any reason
to hang on to the data table only to reuse it. That means among other
things that you keep it from being garbage collected when it's not
needed any more.
Maybe I should declare the datatable inside MakeDT and
not at class level and pass the datatable to my PopulateRowBlock
function?
Definitely.

I'd like to understand what's going slow/wrong here. Any thoughts
please?

I looked a bit at how the object is implemented, but I couldn't see
anything obvious.
 
J

John Dann

Definitely.

Many thanks for the reply.

I think what I was concerned about was that by passing the datatable
as an argument to another function then I'd might end up with multiple
copies of the growing and potentially large datatable in memory, which
would in itself be inefficient and maybe very slow.

This obviously reveals my lack of knowledge about how passing such
arguments actually works. So if I rewrote my code as:

=================
Public class MyClass

Public function MakeDT(args) as datatable
Dim dt as New Datatable

' Process to create datatable schema
dt.columns.add("col1", GetType(Byte))
' etc for 65 columns

For i as integer = 0 to NumberRowBlocks
PopulateRowBlock(dt)
Next

Return dt
End Function

Private Function PopulateRowBlock(ByRef dt as datatable)
For j as integer = 0 to RowsInBlock
Dim dr as DataRow = dt.NewRow
PopulateDataRow(j) ' Detailed code not shown
dt.Rows.Add(dr)
Next
End Function

End Class
==================

ie now declaring the datatable inside the main function and passing it
to the PopulateRowBlock function, then this would be preferable? (Not
sure whether ByRef is appropriate - I'm leaning in that direction to
ensure that the original datatable keeps growing as intended, but
maybe I've misunderstood how this works?)
 
G

Göran Andersson

John said:
Many thanks for the reply.

I think what I was concerned about was that by passing the datatable
as an argument to another function then I'd might end up with multiple
copies of the growing and potentially large datatable in memory, which
would in itself be inefficient and maybe very slow.

No need to fear that. There is no automatic copying of objects in .NET.
To create objects you need to use the 'new' keyword (or call a method
that does it for you, like the Clone method).
This obviously reveals my lack of knowledge about how passing such
arguments actually works. So if I rewrote my code as:

=================
Public class MyClass

Public function MakeDT(args) as datatable
Dim dt as New Datatable

' Process to create datatable schema
dt.columns.add("col1", GetType(Byte))
' etc for 65 columns

For i as integer = 0 to NumberRowBlocks
PopulateRowBlock(dt)
Next

Return dt
End Function

Private Function PopulateRowBlock(ByRef dt as datatable)
For j as integer = 0 to RowsInBlock
Dim dr as DataRow = dt.NewRow
PopulateDataRow(j) ' Detailed code not shown
dt.Rows.Add(dr)
Next
End Function

End Class
==================

ie now declaring the datatable inside the main function and passing it
to the PopulateRowBlock function, then this would be preferable?
Yes.

(Not
sure whether ByRef is appropriate - I'm leaning in that direction to
ensure that the original datatable keeps growing as intended, but
maybe I've misunderstood how this works?)

No, it's not appropriate. You would only need to use ByRef if you were
replacing the data table with a new data table in the method. You should
use ByVal instead (which is the default way of passing parameters).

A DataTable is a reference type, so what you are passing to the method
is the reference to the object, not a copy of the object. If you use
ByRef you will be passing the reference by reference, which means that
you are passing a reference to the variable that is a reference to the
object instead of a reference to the object itself.
 
T

Trevor Benedict

Have you tried this, instead of using the reset method.

Public function MakeDT(args) as datatable
'dt.reset ' Ensure datatable is blank before reuse
dt = new DataTable

I don't fully understand your logic for defining the structure of the
database, especially where args is not used in your pseudo code... If you
try to add the same column name twice, it would throw an exception [see this
line dt.columns.add("col1", GetType(Byte))].

or did you plan to use datatable.clear instead.

Regards,

Trevor Benedict
MCSD
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top