Performance over COM

L

Linan

Recently I wrote a program to help my wife to extract data from about
500 excel files. For each file, around 1000 cells are visited and the
texts are retrieved.

Firstly I wrote a python script, used win32com module. The program ran
well, but took 78 mins to accomplished the task. Of course it was a
little bit slow, so I gave csharp a go. The csharp program took 72
mins.

Both Excel and .Net are MS product, run on a microsoft operating
system. However, I don't see any performance advantages. I don't want
to fire a script/compile laugange or opensource/ms war. My question is,
any tricks to speed COM related operation up in csharp, without using
multiple threading?

My code:
<pre>
Application excelApp=new Application();

// Go through 500+ excel files
foreach(string file in fileList){

excelApp.Workbooks.Open(file,Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing);

// Go through 1000+ cells in each file
foreach(DictionaryEntry de in cellHash){
string id=(string)de.Key;
string[] val=(string[])de.Value;
string
v=(string)excelApp.Sheet[val[1]].get_Range(val[0],Missing.Value).Text;
Console.WriteLine("{0}\t{1}\t{2}\t{3}",id,day,time,v);
}

excelApp.Workbooks.Close();
}
</pre>
 
S

Stephany Young

If what you have shown us is, in fact, the code for your actual process then
some factors jump out and hit one between the eyes, including:

the retrieval of the value from the Excel cell is executed 500,000+
times

the Console.Writeline instructuion is also executed 500,000+ times

If you run the program with the Console.Writeline instruction commented out,
you will get a feel for just how long the interaction with Excel takes.

Each interaction with Excel is an expensive operation so the ideal situation
is to reduce the number of interactions to the absolute minimum.

You are reading from the same cells in each Workbook. Instead of reading
form each cell, try reading the the entire range into an array and then
reading the individual 'cells' from the array. This will reduce the number
of interactions with Excel some 1000 fold.
 
S

Sylvain Lafontaine

I will add to that that instead of opening Excell with Automation (something
that will always be slow whatever the scripting language you are using); you
can try reading the file using the JET OLEDB provider or the Excel ODBC
Provider:

http://support.microsoft.com/default.aspx?scid=kb;en-us;278973

http://msdn2.microsoft.com/en-us/library/ms175866.aspx

http://www.carlprothman.net/Default.aspx?tabid=87#OLEDBProviderForMicrosoftJetExcel

http://www.carlprothman.net/Default.aspx?tabid=90#ODBCDriverForExcel

Finally, for this type of work, probably that you should use a database
(Access or SQL-Server) instead of Excel. Excel is something like a big
calculator, not a database.

--
Sylvain Lafontaine, ing.
MVP - Technologies Virtual-PC
E-mail: http://cerbermail.com/?QugbLEWINF


Stephany Young said:
If what you have shown us is, in fact, the code for your actual process
then some factors jump out and hit one between the eyes, including:

the retrieval of the value from the Excel cell is executed 500,000+
times

the Console.Writeline instructuion is also executed 500,000+ times

If you run the program with the Console.Writeline instruction commented
out, you will get a feel for just how long the interaction with Excel
takes.

Each interaction with Excel is an expensive operation so the ideal
situation is to reduce the number of interactions to the absolute minimum.

You are reading from the same cells in each Workbook. Instead of reading
form each cell, try reading the the entire range into an array and then
reading the individual 'cells' from the array. This will reduce the
number of interactions with Excel some 1000 fold.



Linan said:
Recently I wrote a program to help my wife to extract data from about
500 excel files. For each file, around 1000 cells are visited and the
texts are retrieved.

Firstly I wrote a python script, used win32com module. The program ran
well, but took 78 mins to accomplished the task. Of course it was a
little bit slow, so I gave csharp a go. The csharp program took 72
mins.

Both Excel and .Net are MS product, run on a microsoft operating
system. However, I don't see any performance advantages. I don't want
to fire a script/compile laugange or opensource/ms war. My question is,
any tricks to speed COM related operation up in csharp, without using
multiple threading?

My code:
<pre>
Application excelApp=new Application();

// Go through 500+ excel files
foreach(string file in fileList){

excelApp.Workbooks.Open(file,Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing);

// Go through 1000+ cells in each file
foreach(DictionaryEntry de in cellHash){
string id=(string)de.Key;
string[] val=(string[])de.Value;
string
v=(string)excelApp.Sheet[val[1]].get_Range(val[0],Missing.Value).Text;
Console.WriteLine("{0}\t{1}\t{2}\t{3}",id,day,time,v);
}

excelApp.Workbooks.Close();
}
</pre>
 
O

Otis Mukinfus

Recently I wrote a program to help my wife to extract data from about
500 excel files. For each file, around 1000 cells are visited and the
texts are retrieved.

Firstly I wrote a python script, used win32com module. The program ran
well, but took 78 mins to accomplished the task. Of course it was a
little bit slow, so I gave csharp a go. The csharp program took 72
mins.

Both Excel and .Net are MS product, run on a microsoft operating
system. However, I don't see any performance advantages. I don't want
to fire a script/compile laugange or opensource/ms war. My question is,
any tricks to speed COM related operation up in csharp, without using
multiple threading?

My code:
<pre>
Application excelApp=new Application();

// Go through 500+ excel files
foreach(string file in fileList){

excelApp.Workbooks.Open(file,Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing);

// Go through 1000+ cells in each file
foreach(DictionaryEntry de in cellHash){
string id=(string)de.Key;
string[] val=(string[])de.Value;
string
v=(string)excelApp.Sheet[val[1]].get_Range(val[0],Missing.Value).Text;
Console.WriteLine("{0}\t{1}\t{2}\t{3}",id,day,time,v);
}

excelApp.Workbooks.Close();
}
</pre>
Ask your wife to save her files as CSV files and process them by reading the
text files produced by the Excel save to CSV. You will be able to read them
much faster and process them faster too.
Good luck with your project,

Otis Mukinfus
http://www.arltex.com
http://www.tomchilders.com
 
L

Linan

The OleDb solution speeds up the program. Now it takes only 35 mins.

Thank you very much!
I will add to that that instead of opening Excell with Automation (something
that will always be slow whatever the scripting language you are using); you
can try reading the file using the JET OLEDB provider or the Excel ODBC
Provider:

http://support.microsoft.com/default.aspx?scid=kb;en-us;278973

http://msdn2.microsoft.com/en-us/library/ms175866.aspx

http://www.carlprothman.net/Default.aspx?tabid=87#OLEDBProviderForMicrosoftJetExcel

http://www.carlprothman.net/Default.aspx?tabid=90#ODBCDriverForExcel

Finally, for this type of work, probably that you should use a database
(Access or SQL-Server) instead of Excel. Excel is something like a big
calculator, not a database.

--
Sylvain Lafontaine, ing.
MVP - Technologies Virtual-PC
E-mail: http://cerbermail.com/?QugbLEWINF


Stephany Young said:
If what you have shown us is, in fact, the code for your actual process
then some factors jump out and hit one between the eyes, including:

the retrieval of the value from the Excel cell is executed 500,000+
times

the Console.Writeline instructuion is also executed 500,000+ times

If you run the program with the Console.Writeline instruction commented
out, you will get a feel for just how long the interaction with Excel
takes.

Each interaction with Excel is an expensive operation so the ideal
situation is to reduce the number of interactions to the absolute minimum.

You are reading from the same cells in each Workbook. Instead of reading
form each cell, try reading the the entire range into an array and then
reading the individual 'cells' from the array. This will reduce the
number of interactions with Excel some 1000 fold.



Linan said:
Recently I wrote a program to help my wife to extract data from about
500 excel files. For each file, around 1000 cells are visited and the
texts are retrieved.

Firstly I wrote a python script, used win32com module. The program ran
well, but took 78 mins to accomplished the task. Of course it was a
little bit slow, so I gave csharp a go. The csharp program took 72
mins.

Both Excel and .Net are MS product, run on a microsoft operating
system. However, I don't see any performance advantages. I don't want
to fire a script/compile laugange or opensource/ms war. My question is,
any tricks to speed COM related operation up in csharp, without using
multiple threading?

My code:
<pre>
Application excelApp=new Application();

// Go through 500+ excel files
foreach(string file in fileList){

excelApp.Workbooks.Open(file,Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing,
Type.Missing);

// Go through 1000+ cells in each file
foreach(DictionaryEntry de in cellHash){
string id=(string)de.Key;
string[] val=(string[])de.Value;
string
v=(string)excelApp.Sheet[val[1]].get_Range(val[0],Missing.Value).Text;
Console.WriteLine("{0}\t{1}\t{2}\t{3}",id,day,time,v);
}

excelApp.Workbooks.Close();
}
</pre>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top