M
Mike S
I'm working on a .NET application that was under development by a
previous developer, who is no longer working with the company. The
program basically extracts records from a source database and outputs
them in an XML document, mapping the source records into a third-party
XML schema. The mapping process is handled by C# code that was
auto-generated by Altova MapForce -- code was added to extract a subset
of records from the source database into a temporary database, and the
MapForce code is then executed against this subset, to prevent
unnecessary processing. When this subset contains tens of thousands of
records (which is common for the databases the application will run
against), the performance of the auto-generated mapping code is
horrendous. The developer who had originally worked on this application
knew this was unacceptable and he modified parts of the auto-generated
code to increase performance -- this sped things up considerably.
However, when I look at the untouched auto-generated code and compare
it to his optimized version, I'm wondering how it actually runs better.
The (much simplified) original auto-generated code did something like
this:
// code starts here
using System.Xml;
// this method kicks off the mapping process
public void Run(string targetFilename)
{
XmlDocument targetDoc;
targetDoc = new XmlDocument();
targetDoc.LoadXml("<root></root>");
// call functions to do mapping
loopCustomers(targetDoc);
loopOrders(targetDoc);
// ...corresponding loop*() functions for the other DB tables...
}
public void loopCustomers(XmlDocument doc)
{
// CustomerNodeSet is a collection of XmlNodes filled by a DB query
CustomerNodeSet cs = new CustomerNodeSet("SELECT * FROM
[Customers]");
while (cs.hasNextNode())
{
doc.DocumentElement.AppendChild(cs.getNextNode());
}
}
public void loopOrders(XmlDocument doc)
{
// same business as loopCustomers()
}
In the optimized version, he changed the loop*() methods. For example,
loopCustomers() was modified as follows:
public void loopCustomers(XmlDocument doc)
{
// CustomerNodeSet is a collection of XmlNodes filled by a DB query
CustomerNodeSet cs = new CustomerNodeSet("SELECT * FROM
[Customers]");
// keep track of records we've seen
int numRecords;
// for naming temp files
int tempFileID;
while (cs.hasNextNode())
{
doc.DocumentElement.AppendChild(cs.getNextNode());
// where the magic happens
++numRecords;
if (numRecords == 2000)
{
// save what's in the XmlDocument so far to a temp file...
++tempFileID;
doc.Save("C:\tmpCustomer" + tempFileID + ".xml");
//...and try to reclaim the memory those 2000 records were
occupying
doc = null;
GC.Collect();
// reset numRecords so that this can happen again after the next
2000 records are read
numRecords = 0;
} // end if (numRecords == 2000)
} // end while (cs.hasNextNode())
}
Now, from my understanding of how garbage collection works in .NET, I
would expect that lines that attempt to reclaim memory will never
actually reclaim the memory doc was using. I base this on the simple
observation that the parameter "doc" is passed by-value. Because of
this, the call
loopCustomers(targetDoc);
that appears in Run() will pass a copy of a reference to targetDoc for
the "doc" argument in loopCustomers(). So when doc is set to null in
loopCustomers(), only the *copy* of the reference (the one on the
stack) is set to null, but targetDoc is still non-null. Thus, when the
garbage collection is forced to happen with GC.Collect(), targetDoc
will never be collected and in fact can't be collected at anytime
inside the loopCustomers() method, because targetDoc in Run() will be
non-null.
The obvious solution is to pass the XmlDocument to loopCustomers() by
reference instead of by value, so that only one reference to it will
exist while loopCustomers() is running, thereby allowing it to be
collected inside loopCustomers().
However....Since my collegue's code *did* in fact use pass-by-value
semantics, shouldn't his "optimized" version have run just as slow (or
slower) than the unoptimized version, since his XmlDocument object
would never be collected while inside of one of the loop*() functions?
The XmlDocument would keep growing and growing with each successive
call to AppendChild(), just as it did in the unoptimized version, but
it would never be freed since targetDoc in Run() would still be holding
a reference.
Am I missing something here? I would definitely expect his code to run
as least as poorly as the unoptimized code, if not slower -- I simply
can't see how it turns out that it's faster...
previous developer, who is no longer working with the company. The
program basically extracts records from a source database and outputs
them in an XML document, mapping the source records into a third-party
XML schema. The mapping process is handled by C# code that was
auto-generated by Altova MapForce -- code was added to extract a subset
of records from the source database into a temporary database, and the
MapForce code is then executed against this subset, to prevent
unnecessary processing. When this subset contains tens of thousands of
records (which is common for the databases the application will run
against), the performance of the auto-generated mapping code is
horrendous. The developer who had originally worked on this application
knew this was unacceptable and he modified parts of the auto-generated
code to increase performance -- this sped things up considerably.
However, when I look at the untouched auto-generated code and compare
it to his optimized version, I'm wondering how it actually runs better.
The (much simplified) original auto-generated code did something like
this:
// code starts here
using System.Xml;
// this method kicks off the mapping process
public void Run(string targetFilename)
{
XmlDocument targetDoc;
targetDoc = new XmlDocument();
targetDoc.LoadXml("<root></root>");
// call functions to do mapping
loopCustomers(targetDoc);
loopOrders(targetDoc);
// ...corresponding loop*() functions for the other DB tables...
}
public void loopCustomers(XmlDocument doc)
{
// CustomerNodeSet is a collection of XmlNodes filled by a DB query
CustomerNodeSet cs = new CustomerNodeSet("SELECT * FROM
[Customers]");
while (cs.hasNextNode())
{
doc.DocumentElement.AppendChild(cs.getNextNode());
}
}
public void loopOrders(XmlDocument doc)
{
// same business as loopCustomers()
}
In the optimized version, he changed the loop*() methods. For example,
loopCustomers() was modified as follows:
public void loopCustomers(XmlDocument doc)
{
// CustomerNodeSet is a collection of XmlNodes filled by a DB query
CustomerNodeSet cs = new CustomerNodeSet("SELECT * FROM
[Customers]");
// keep track of records we've seen
int numRecords;
// for naming temp files
int tempFileID;
while (cs.hasNextNode())
{
doc.DocumentElement.AppendChild(cs.getNextNode());
// where the magic happens
++numRecords;
if (numRecords == 2000)
{
// save what's in the XmlDocument so far to a temp file...
++tempFileID;
doc.Save("C:\tmpCustomer" + tempFileID + ".xml");
//...and try to reclaim the memory those 2000 records were
occupying
doc = null;
GC.Collect();
// reset numRecords so that this can happen again after the next
2000 records are read
numRecords = 0;
} // end if (numRecords == 2000)
} // end while (cs.hasNextNode())
}
Now, from my understanding of how garbage collection works in .NET, I
would expect that lines that attempt to reclaim memory will never
actually reclaim the memory doc was using. I base this on the simple
observation that the parameter "doc" is passed by-value. Because of
this, the call
loopCustomers(targetDoc);
that appears in Run() will pass a copy of a reference to targetDoc for
the "doc" argument in loopCustomers(). So when doc is set to null in
loopCustomers(), only the *copy* of the reference (the one on the
stack) is set to null, but targetDoc is still non-null. Thus, when the
garbage collection is forced to happen with GC.Collect(), targetDoc
will never be collected and in fact can't be collected at anytime
inside the loopCustomers() method, because targetDoc in Run() will be
non-null.
The obvious solution is to pass the XmlDocument to loopCustomers() by
reference instead of by value, so that only one reference to it will
exist while loopCustomers() is running, thereby allowing it to be
collected inside loopCustomers().
However....Since my collegue's code *did* in fact use pass-by-value
semantics, shouldn't his "optimized" version have run just as slow (or
slower) than the unoptimized version, since his XmlDocument object
would never be collected while inside of one of the loop*() functions?
The XmlDocument would keep growing and growing with each successive
call to AppendChild(), just as it did in the unoptimized version, but
it would never be freed since targetDoc in Run() would still be holding
a reference.
Am I missing something here? I would definitely expect his code to run
as least as poorly as the unoptimized code, if not slower -- I simply
can't see how it turns out that it's faster...