Dataset question

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I guess I don't know how to word it better than that.

Our company has machines that generate log file in our own proprietary
language. A while back I wrote a class that took one of these files and
loaded all the data into a dataset of some predefined tables and some dynamic
tables. This worked well for a while and gave me the ability to do searches,
although each search ended up being a lot of custom code.

Now I'm getting to the point where I need to revisit this code and I'm
trying to create a hierarchy of classes.

So the base class is Log, and that is inherited by a class called
MachineXLog. Now since i know what information would appear in MachinexLog I
could create a strongly typed dataset that defines all the fields of this new
table.

So the problem comes in at this point. I have a large table, of dynamic
fields. I want to run a select statement on this table, and grab a subset of
these fields and stick this data in a strongly typed table for easy coding
access.

One of the big sticking points is performance. I need to move ~12 Gigs of
compressed data trough this process (changed into xml it turn into 600 Gigs
of xml).

..Net 2.0 answers are fine as well.
 
Bas,

It would seem that the dataset is not a good idea for you. The first
option I would recommend is creating your own objects instead of using typed
data sets and handle the parsing yourself into these objects, into some
format that is more easily searchable. While I am a big advocate for typed
data sets, they are clearly inadequate for some situations (like this one).
Being able to query things in from data sets in general is a pain. Also,
loading this amount of data in memory is going to kill your performance.

The second option (and I dont know how feasable it is) is to store the
log information in a database. Database servers are meant to handle this
amount of data, and perform queries on them. It would save you a ton of
time in development if your store was a DB (with regular backups, for
obvious reasons).

As for the bloat you see when storing the data as XML, there is no way
around it. The very nature of XML is what you are seeing here, since the
persistence format for the infoset is text (and text is not an efficient
representation of data).

Hope this helps.
 
oops, should have clarified this.

The individual sets are ~18 mb of xml. or 600k compressed proprietary
format. I only handle ~4 of these at a time. there are however many files and
some searches requires that I look at all of them. So that is where the
300Gigs comes from.

Parsing them all once and just storing them into the database would be a
nice solution and would remove the time constraints since it would be a
onetime process, but that idea was shutdown from above.



Nicholas Paldino said:
Bas,

It would seem that the dataset is not a good idea for you. The first
option I would recommend is creating your own objects instead of using typed
data sets and handle the parsing yourself into these objects, into some
format that is more easily searchable. While I am a big advocate for typed
data sets, they are clearly inadequate for some situations (like this one).
Being able to query things in from data sets in general is a pain. Also,
loading this amount of data in memory is going to kill your performance.

The second option (and I dont know how feasable it is) is to store the
log information in a database. Database servers are meant to handle this
amount of data, and perform queries on them. It would save you a ton of
time in development if your store was a DB (with regular backups, for
obvious reasons).

As for the bloat you see when storing the data as XML, there is no way
around it. The very nature of XML is what you are seeing here, since the
persistence format for the infoset is text (and text is not an efficient
representation of data).

Hope this helps.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Bas Hamer said:
I guess I don't know how to word it better than that.

Our company has machines that generate log file in our own proprietary
language. A while back I wrote a class that took one of these files and
loaded all the data into a dataset of some predefined tables and some
dynamic
tables. This worked well for a while and gave me the ability to do
searches,
although each search ended up being a lot of custom code.

Now I'm getting to the point where I need to revisit this code and I'm
trying to create a hierarchy of classes.

So the base class is Log, and that is inherited by a class called
MachineXLog. Now since i know what information would appear in MachinexLog
I
could create a strongly typed dataset that defines all the fields of this
new
table.

So the problem comes in at this point. I have a large table, of dynamic
fields. I want to run a select statement on this table, and grab a subset
of
these fields and stick this data in a strongly typed table for easy coding
access.

One of the big sticking points is performance. I need to move ~12 Gigs of
compressed data trough this process (changed into xml it turn into 600
Gigs
of xml).

.Net 2.0 answers are fine as well.
 
Bas,

Is there any way that you could alter that thinking? What was the
reason for not doing it? After all, the performance you are going to get
out of using a DB is going to trump whatever you do in this area.

If you have to look at all of the files, the only way I can think of
searching through all of them is to open one up, filter the things you want,
copy the filtered records into an object/database, load the next file,
filter, copy to the result set, and so on, and so on.

If possible, you should write your logs directly into the database (in
addition to writing it to your logs, if you wish). That would be the
optimal solution.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Bas Hamer said:
oops, should have clarified this.

The individual sets are ~18 mb of xml. or 600k compressed proprietary
format. I only handle ~4 of these at a time. there are however many files
and
some searches requires that I look at all of them. So that is where the
300Gigs comes from.

Parsing them all once and just storing them into the database would be a
nice solution and would remove the time constraints since it would be a
onetime process, but that idea was shutdown from above.



Nicholas Paldino said:
Bas,

It would seem that the dataset is not a good idea for you. The first
option I would recommend is creating your own objects instead of using
typed
data sets and handle the parsing yourself into these objects, into some
format that is more easily searchable. While I am a big advocate for
typed
data sets, they are clearly inadequate for some situations (like this
one).
Being able to query things in from data sets in general is a pain. Also,
loading this amount of data in memory is going to kill your performance.

The second option (and I dont know how feasable it is) is to store
the
log information in a database. Database servers are meant to handle this
amount of data, and perform queries on them. It would save you a ton of
time in development if your store was a DB (with regular backups, for
obvious reasons).

As for the bloat you see when storing the data as XML, there is no
way
around it. The very nature of XML is what you are seeing here, since the
persistence format for the infoset is text (and text is not an efficient
representation of data).

Hope this helps.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Bas Hamer said:
I guess I don't know how to word it better than that.

Our company has machines that generate log file in our own proprietary
language. A while back I wrote a class that took one of these files and
loaded all the data into a dataset of some predefined tables and some
dynamic
tables. This worked well for a while and gave me the ability to do
searches,
although each search ended up being a lot of custom code.

Now I'm getting to the point where I need to revisit this code and I'm
trying to create a hierarchy of classes.

So the base class is Log, and that is inherited by a class called
MachineXLog. Now since i know what information would appear in
MachinexLog
I
could create a strongly typed dataset that defines all the fields of
this
new
table.

So the problem comes in at this point. I have a large table, of dynamic
fields. I want to run a select statement on this table, and grab a
subset
of
these fields and stick this data in a strongly typed table for easy
coding
access.

One of the big sticking points is performance. I need to move ~12 Gigs
of
compressed data trough this process (changed into xml it turn into 600
Gigs
of xml).

.Net 2.0 answers are fine as well.
 
In effect that is what I do now. I generate a number of searcher threads to
fetch open, and search files and just write the results to a file. got it
down to ~6 hours on a 3.2 GHz p4 HT running 4 threads.

I guess the database approach is the most feasable as our files are starting
to grow bothe in number and in size

Nicholas Paldino said:
Bas,

Is there any way that you could alter that thinking? What was the
reason for not doing it? After all, the performance you are going to get
out of using a DB is going to trump whatever you do in this area.

If you have to look at all of the files, the only way I can think of
searching through all of them is to open one up, filter the things you want,
copy the filtered records into an object/database, load the next file,
filter, copy to the result set, and so on, and so on.

If possible, you should write your logs directly into the database (in
addition to writing it to your logs, if you wish). That would be the
optimal solution.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Bas Hamer said:
oops, should have clarified this.

The individual sets are ~18 mb of xml. or 600k compressed proprietary
format. I only handle ~4 of these at a time. there are however many files
and
some searches requires that I look at all of them. So that is where the
300Gigs comes from.

Parsing them all once and just storing them into the database would be a
nice solution and would remove the time constraints since it would be a
onetime process, but that idea was shutdown from above.



Nicholas Paldino said:
Bas,

It would seem that the dataset is not a good idea for you. The first
option I would recommend is creating your own objects instead of using
typed
data sets and handle the parsing yourself into these objects, into some
format that is more easily searchable. While I am a big advocate for
typed
data sets, they are clearly inadequate for some situations (like this
one).
Being able to query things in from data sets in general is a pain. Also,
loading this amount of data in memory is going to kill your performance.

The second option (and I dont know how feasable it is) is to store
the
log information in a database. Database servers are meant to handle this
amount of data, and perform queries on them. It would save you a ton of
time in development if your store was a DB (with regular backups, for
obvious reasons).

As for the bloat you see when storing the data as XML, there is no
way
around it. The very nature of XML is what you are seeing here, since the
persistence format for the infoset is text (and text is not an efficient
representation of data).

Hope this helps.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

I guess I don't know how to word it better than that.

Our company has machines that generate log file in our own proprietary
language. A while back I wrote a class that took one of these files and
loaded all the data into a dataset of some predefined tables and some
dynamic
tables. This worked well for a while and gave me the ability to do
searches,
although each search ended up being a lot of custom code.

Now I'm getting to the point where I need to revisit this code and I'm
trying to create a hierarchy of classes.

So the base class is Log, and that is inherited by a class called
MachineXLog. Now since i know what information would appear in
MachinexLog
I
could create a strongly typed dataset that defines all the fields of
this
new
table.

So the problem comes in at this point. I have a large table, of dynamic
fields. I want to run a select statement on this table, and grab a
subset
of
these fields and stick this data in a strongly typed table for easy
coding
access.

One of the big sticking points is performance. I need to move ~12 Gigs
of
compressed data trough this process (changed into xml it turn into 600
Gigs
of xml).

.Net 2.0 answers are fine as well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top