Architecture question on parsing a large text file

RayLopez99 · Aug 26, 2011

I have a text file that has five pieces of information in it per
line: ASCII text and numbers delimited by commas and a line break on
each line.

What is the best way to organize this data so I can analyse it? The
files are hundreds of MB large, so loading them all into memory via
XML is impractical (I think--or is it? I have a 32 bit system with 4
GB RAM). Better to read them "serially" using XML? Or better to set
up a database that can be queried?

Which is better--or is there an even better way? I'm leaning towards
a database in SQL Server.

I plan to code in C# using Visual Studio 2010.

RL

Registered User · Aug 26, 2011

I have a text file that has five pieces of information in it per
line: ASCII text and numbers delimited by commas and a line break on
each line.

What is the best way to organize this data so I can analyse it?

What is wrong with the way the data is already organized?

The
files are hundreds of MB large, so loading them all into memory via
XML is impractical (I think--or is it? I have a 32 bit system with 4
GB RAM).
http://lmgtfy.com/?q=how+do+computers+work

Better to read them "serially" using XML?

XML? You'll have to read the CSV files to create the XML files. Then
you'll have to read the XML files before performing any analysis.
Unless changing the format of the data adds value why bother?

Or better to set up a database that can be queried?

Which is better--or is there an even better way? I'm leaning towards
a database in SQL Server.

A transactional database might be suitable for layout-led analysis,
If the intent is to extract business intelligence by data-led
discovery use SQL Server Analysis Services and an OLAP database.
Import the CSV data using SQL Server Integration Services. SQL Server
Reporting Services is another valuable tool.

I plan to code in C# using Visual Studio 2010.

Everything can be done using SQL Server's Business Intelligence Design
Studio.

regards
A.G.

James A. Fortune · Aug 26, 2011

I have a text file that has five pieces of information in it per
line: ASCII text and numbers delimited by commas and a line break on
each line.

What is the best way to organize this data so I can analyse it? The
files are hundreds of MB large, so loading them all into memory via
XML is impractical (I think--or is it? I have a 32 bit system with 4
GB RAM). Better to read them "serially" using XML? Or better to set
up a database that can be queried?

Which is better--or is there an even better way? I'm leaning towards
a database in SQL Server.

I plan to code in C# using Visual Studio 2010.

RL

I would import the CSV file into Excel and then use the 'dynamic'
typing of C# 4.0 to grab the data from the Excel object to put into
SQL Server. It might go along the lines of some modified Microsoft
sample PDC09 code that is almost certain not to work 'as is' for your
situation :-)

:

dynamic _excel = null;

try
{
_excel = AutomationFactory.CreateObject("Excel.Application");
}
catch {}
_excel.visible = true;
If (null != _excel)
{
dynamic workbook = _excel.workbooks;
workbook.Open("C:\MyXL.xlsm");

dynamic sheet = _excel.ActiveSheet;

//transfer spreadsheet data to SQL Server

}

If possible, try to use Excel Automation (perhaps by assigning a Range
object to an array) to grab the entire sheet of data with one command
for performance. Hopefully, this gets you looking in the right
direction.

James A. Fortune
(e-mail address removed)

James A. Fortune · Aug 26, 2011

I have a text file that has five pieces of information in it per
line: ASCII text and numbers delimited by commas and a line break on
each line.

What is the best way to organize this data so I can analyse it? The
files are hundreds of MB large, so loading them all into memory via
XML is impractical (I think--or is it? I have a 32 bit system with 4
GB RAM). Better to read them "serially" using XML? Or better to set
up a database that can be queried?

Which is better--or is there an even better way? I'm leaning towards
a database in SQL Server.

I plan to code in C# using Visual Studio 2010.

RL

Also see PDC09 session FT11:

Future Directions for C# and Visual Basic - Luca Bolognese

He shows a demo for reading CSV files. All this is just in case you
can't simply import a CSV file directly into SQL Server :-)

.

BTW, by serendipitously finding an alternate URL at Microsoft and
getting a propitious alignment of the stars during pre-dawn hours
Pacific time, I was able to download the PDC10 sessions. I'm about
40% done with viewing the PDC09 sessions. Maybe by 2012 I'll be able
to attend the real thing.

James A. Fortune
(e-mail address removed)

RayLopez99 · Aug 27, 2011

What is wrong with the way the data is already organized?

I want to grab the data, mash it up with other data, etc etc etc.
Where to store said data--should it be stored on the hard drive as
ASCII text files? Or in a database? If a database I can do stuff
like SQL queries on Linq-to-SQL. If a text or XML file I cannot do
SQL queries but it's easier I guess to code.

http://lmgtfy.com/?q=how+do+computers+work

Save me the work bro. Please tell me how much (many) text files can I
load into 3.6 GB of memory? Please tell me. And btw I don't know how
you can monitor your memory in Visual Studio so that you know when you
are running out of RAM--I just will let the OS figure it out and swap
stuff, etc, but I have a feeling that's not optimal from a user's
point of view. Any pointers as to how to get a program to throttle
itself if it is running out of RAM on a PC the program is loaded on
would be greatly appreciated.

My thanks in advance.

XML? You'll have to read the CSV files to create the XML files. Then
you'll have to read the XML files before performing any analysis.
Unless changing the format of the data adds value why bother?

See the above. I'm thinking (out loud) that perhaps changing the
format might add value. For example I'm thinking of converting the
string to a binary representation so I can manipulate it faster, but
perhaps string/ Stringbuilder is fast enough...I'll have to
experiment. But the main thing is that I think it might be helpful to
get the text data into a database so I can run SQL and/or Linq-to-SQL
queries.

A transactional database might be suitable for layout-led analysis,
If the intent is to extract business intelligence by data-led
discovery use SQL Server Analysis Services and an OLAP database.
Import the CSV data using SQL Server Integration Services. SQL Server
Reporting Services is another valuable tool.

OK. Is that shit free? I'm a penniless student for purposes of this
project.

Everything can be done using SQL Server's Business Intelligence Design
Studio.

regards
A.G.

Thank you A.G.

RL

RayLopez99 · Aug 27, 2011

So why don't you just save them as XML files on the hard drive and query
the XML files using Linq-2-XML? Or if you are using SQL Server 2005 or
better, you can load the XML into a SQL server table and query the XML
that way too.

http://msdn.microsoft.com/en-us/library/ms345117(v=sql.90).aspx

OK, that's one line of attack, but I'm trying to get all points of
view. Since there will be hundreds of thousands of files per
transaction period, I'm thinking the I/O access will waste too much
time--is SQL faster? That is, if you have 100000 pieces of info in a
database, querying it will be faster (my supposition) than loading
100000 XML (or .CSV, or .txt) files?

BTW thanks for that "Do Factory" book recommendation on Gang of Four
templates--I bought the book, went through the exercises, and it was
useful, though I still claim 95% of the time you will not use 95% of
the Gang of Four templates.

RL

RayLopez99 · Aug 27, 2011

I would import the CSV file into Excel and then use the 'dynamic'
typing of C# 4.0 to grab the data from the Excel object to put into
SQL Server. It might go along the lines of some modified Microsoft
sample PDC09 code that is almost certain not to work 'as is' for your
situation :

dynamic _excel = null;

try
{
_excel = AutomationFactory.CreateObject("Excel.Application");}

catch {}
_excel.visible = true;
If (null != _excel)
{
dynamic workbook = _excel.workbooks;
workbook.Open("C:\MyXL.xlsm");

dynamic sheet = _excel.ActiveSheet;

//transfer spreadsheet data to SQL Server

}

If possible, try to use Excel Automation (perhaps by assigning a Range
object to an array) to grab the entire sheet of data with one command
for performance. Hopefully, this gets you looking in the right
direction.

James A. Fortune
(e-mail address removed)

This is invaluable "low level" stuff that might come in handy later.
But since my data is in a text file, it might be easier just to read
each file, and import the data into a SQL table, no? If you go the
table route. If you go the XML route you just save to XML. Why
bother with the Excel intermediary step?

RL

Registered User · Aug 27, 2011

I want to grab the data, mash it up with other data, etc etc etc.
Where to store said data--should it be stored on the hard drive as
ASCII text files? Or in a database? If a database I can do stuff
like SQL queries on Linq-to-SQL. If a text or XML file I cannot do
SQL queries but it's easier I guess to code.

Why not start a new thread and clearly state what you're trying to do.
The best way to store the data might depend upon the etc. etc.etc.

SQL queries can't be done against an XML file but LINQ-to-XML can be
used to query the document.

Save me the work bro. Please tell me how much (many) text files can I
load into 3.6 GB of memory? Please tell me.

Anywhere from none to ~infinity depending upon the file size. How many
files actually need to be loaded at once?

And btw I don't know how
you can monitor your memory in Visual Studio so that you know when you
are running out of RAM--I just will let the OS figure it out and swap
stuff, etc, but I have a feeling that's not optimal from a user's
point of view. Any pointers as to how to get a program to throttle
itself if it is running out of RAM on a PC the program is loaded on
would be greatly appreciated.

research performance monitor counters

My thanks in advance.

See the above. I'm thinking (out loud) that perhaps changing the
format might add value. For example I'm thinking of converting the
string to a binary representation so I can manipulate it faster, but
perhaps string/ Stringbuilder is fast enough...I'll have to
experiment.

Fast enough for what? Functionality that doesn't exist can't be
optimized. With a proper enterprise architecture SOC can be used to
isolate this functionality. Then different implementations of the
functionality can be used to provide performance-related metrics,

But the main thing is that I think it might be helpful to
get the text data into a database so I can run SQL and/or Linq-to-SQL
queries.

Consider the volume of data to be queried and how long it will take to
execute a complex query against that data in a transactional database.
For serious data analysis purposes, data management is needed that a
transactional database cannot provide.

OK. Is that shit free? I'm a penniless student for purposes of this
project.

There is an SQL Server 2008 R2 trial version which is good for a
180-day evaluation period.
http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx

A good text to follow is Delivering Business Intelligence with MS SQL
2008
http://www.amazon.com/Delivering-Business-Intelligence-Microsoft-Server/dp/0071549447

Even if an OLAP database is not used SSIS can be used to gather,
extract, manipulate, and import data into an OLTP database.

regards
A.G.

Arne Vajhøj · Aug 28, 2011

http://lmgtfy.com/?q=how+do+computers+work

So you have learned how to use lmgtfy.

Next exercise: learn when it is relevant to post.

XML? You'll have to read the CSV files to create the XML files. Then
you'll have to read the XML files before performing any analysis.
Unless changing the format of the data adds value why bother?

XML has a DOM in memory structure.

I have never heard about a CSV DOM in memory structure.

Arne

Arne Vajhøj · Aug 28, 2011

Save me the work bro. Please tell me how much (many) text files can I
load into 3.6 GB of memory?

You will only be able to us 2 GB of virtual address space in 32 bit
Windows (and you can do that with 512 MB RAM having 2+ GB of RAM just
make it perform decent - or at least potentially perform decent)

Please tell me. And btw I don't know how
you can monitor your memory in Visual Studio so that you know when you
are running out of RAM--I just will let the OS figure it out and swap
stuff, etc, but I have a feeling that's not optimal from a user's
point of view. Any pointers as to how to get a program to throttle
itself if it is running out of RAM on a PC the program is loaded on
would be greatly appreciated.

You can watch memory usage from the outside with Windows task manager
performance.

You can watch memory usage from the inside by using the Process class
(in System.Diagnostics).

Arne

Arne Vajhøj · Aug 28, 2011

OK, that's one line of attack, but I'm trying to get all points of
view. Since there will be hundreds of thousands of files per
transaction period, I'm thinking the I/O access will waste too much
time--is SQL faster? That is, if you have 100000 pieces of info in a
database, querying it will be faster (my supposition) than loading
100000 XML (or .CSV, or .txt) files?

The database will most likely be faster.

Opening files is a relative expensive operation.

Arne

Arne Vajhøj · Aug 28, 2011

I have a text file that has five pieces of information in it per
line: ASCII text and numbers delimited by commas and a line break on
each line.

What is the best way to organize this data so I can analyse it? The
files are hundreds of MB large, so loading them all into memory via
XML is impractical (I think--or is it? I have a 32 bit system with 4
GB RAM). Better to read them "serially" using XML? Or better to set
up a database that can be queried?

Which is better--or is there an even better way? I'm leaning towards
a database in SQL Server.

If you just have a few hundred MB, then you should be able to load
everything in memory.

300 MB on disk with a 1:5 ratio disk:memory will become 1.5 GB
(the 1:5 is just a guess, but the 5 is better than 10 and 2!).

There are many possibilities, but I think I would go for
one of these:

A) If data is <300 MB (now and in the future) and speed is
critical and access paths is well known (now and in the future),
then loading everything in memory in a custom data structure
utilizing Dictionary's for fast lookup.

B) If you want a flexible solution both regarding data size and
access paths, then stuff it in a database and query that.

If in doubt pick #B.

I plan to code in C# using Visual Studio 2010.

The optimal solution is most likely language independent
and certainly IDE independent.

Arne

RayLopez99 · Aug 29, 2011

If you just have a few hundred MB, then you should be able to load
everything in memory.

300 MB on disk with a 1:5 ratio disk:memory will become 1.5 GB
(the 1:5 is just a guess, but the 5 is better than 10 and 2!).

There are many possibilities, but I think I would go for
one of these:

A) If data is <300 MB (now and in the future) and speed is
critical and access paths is well known (now and in the future),
then loading everything in memory in a custom data structure
utilizing Dictionary's for fast lookup.

B) If you want a flexible solution both regarding data size and
access paths, then stuff it in a database and query that.

If in doubt pick #B.

The optimal solution is most likely language independent
and certainly IDE independent.

Arne

This is good !

Turns out the data is in ASCII text files that are about 250k to 400k
per transaction period (a day's worth of sales) and each line in the
file is a sale (in the exact same sequence of customer ID, date, sale
price, discount, etc), with about 5000 to 10000 sales per day.

The thing is I am trying to do is statistical manipulations on each
line of the data, comparing it to previous days for say the same
customer ID (to spot trends in the customer's buying habits).

So I need to do complex covariance and statistical stuff: use home
grown methods or SQL? I am leaning towards the latter since this does
not have to be "real time" analysis so speed is not important--the
program can run overnight to find an answer. On the other hand,
dictionary is fast as greased lightening--10 ms per lookup as I
recall--so perhaps a custom data structure in memory is best. Or
maybe a hybrid approach (which is maybe best): have your sales
transactions in the database, load select ones into a custom data
structure in memory, do your complex statistical stuff in memory, then
the results output either back to the database or another place.

RL

RayLopez99 · Aug 29, 2011

On 8/27/2011 11:17 AM, RayLopez99 wrote:

Programmers take bits a pieces of things they have learned over the
years and apply them in development. It's more along then lines that 95%
of programmers don't even know who or what the GOF is.

Yes, I feel smarter having read the DoFactory book and going through
the GOF patterns. Now I feel superior, and if anybody talks bad about
the GOF or worse does not know that it is, I can look on them with
disdain.

RL

RayLopez99 · Aug 29, 2011

Why not start a new thread and clearly state what you're trying to do.
The best way to store the data might depend upon the etc. etc.etc.

See my reply to Arne in this thread.

SQL queries can't be done against an XML file but LINQ-to-XML can be
used to query the document.

Anywhere from none to ~infinity depending upon the file size. How many
files actually need to be loaded at once?

Up to 10 years of data, each data file per day is about 300 kb
average, so that's 365*10*300kb = 1.1 GB. So I guess I'm OK with it
all in memory, right? At least my machine is OK.

research performance monitor counters

OK thanks. I'll do that someday. Right now what I do is I figure a
"worse case" scenario, a "worse case" user's machine, then hard code
stuff to work within those limits and work backwards, if I expect my
app to be run in somebody else's machine. Pretty piss poor solution
but probably not uncommon.

Fast enough for what? Functionality that doesn't exist can't be
optimized. With a proper enterprise architecture SOC can be used to
isolate this functionality. Then different implementations of the
functionality can be used to provide performance-related metrics,

Consider the volume of data to be queried and how long it will take to
execute a complex query against that data in a transactional database.
For serious data analysis purposes, data management is needed that a
transactional database cannot provide.

Ah yes, but I was expecting a rule of thumb like what Arne helpfully
provided. You sound like a consultant: "pay me and I'll tell you".
LOL.

I have no idea what you are talking about. Let me Google this...
http://en.wikipedia.org/wiki/Online_analytical_processing
Holy shit that's exactly what I want to do, but the language they are
using is completely foreign--and I've read many a book on relational
databases. Seems to me they are using different words for simple
relational database stuff like 'inner join'. In any event, I am not
going to take a 3 to 6 month hiatus to relearn to speak in their kooky
language, since this project is for me not for a customer. But thanks
for the heads up.

There is an SQL Server 2008 R2 trial version which is good for a
180-day evaluation period.http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx

I have S S 08 R2 that came with Visual Studio 2010. But I just
noticed my other machine, an XP, is running the trial version which
means within 6 months it will expire. When that happens I'm going to
piratebay to get a new copy or reinstall.

A good text to follow is Delivering Business Intelligence with MS SQL
2008http://www.amazon.com/Delivering-Business-Intelligence-Microsoft-Serv....

Even if an OLAP database is not used SSIS can be used to gather,
extract, manipulate, and import data into an OLTP database.

Thanks but I'm not going there. I mastered David Sceppa's book on
databases, took 6 months to understand it, and that's sufficient for
me.

RL

Registered User · Aug 29, 2011

See my reply to Arne in this thread.

Up to 10 years of data, each data file per day is about 300 kb
average, so that's 365*10*300kb = 1.1 GB. So I guess I'm OK with it
all in memory, right? At least my machine is OK.

Is it really necessary to have 3650 files in memory at the same time?
Such a design doesn't pass the smell test.

OK thanks. I'll do that someday. Right now what I do is I figure a
"worse case" scenario, a "worse case" user's machine, then hard code
stuff to work within those limits and work backwards, if I expect my
app to be run in somebody else's machine. Pretty piss poor solution
but probably not uncommon.

You have no idea of what you're talking about and couldn't be further
from the truth.

Ah yes, but I was expecting a rule of thumb like what Arne helpfully
provided. You sound like a consultant: "pay me and I'll tell you".
LOL.

Rule of thumb : There is a direct relationship between the size of the
database and the time it takes complex queries execute. Complex
database queries can execute in seconds when the database is lightly
populated. As more and more data gets added it will take longer and
longer for complex queries to execute. Near the end of the long tail a
complex query can take hours to execute.

I have no idea what you are talking about. Let me Google this...
http://en.wikipedia.org/wiki/Online_analytical_processing
Holy shit that's exactly what I want to do, but the language they are
using is completely foreign--and I've read many a book on relational
databases. Seems to me they are using different words for simple
relational database stuff like 'inner join'.

The databases you've learned about have all been designed for
transactional purposes. Databases used for analytical processing are
highly denormalized by using a star or snowflake schema.

In any event, I am not
going to take a 3 to 6 month hiatus to relearn to speak in their kooky
language, since this project is for me not for a customer. But thanks
for the heads up.

Come back in six months and let us know how you're progressing on your
own. And why chatter about a user's machine being the worst case
scenario when you will be the only user.

I have S S 08 R2 that came with Visual Studio 2010. But I just
noticed my other machine, an XP, is running the trial version which
means within 6 months it will expire. When that happens I'm going to
piratebay to get a new copy or reinstall.
Pirate huh?

Thanks but I'm not going there. I mastered David Sceppa's book on
databases, took 6 months to understand it, and that's sufficient for
me.

All three of Sceppa's books are concerned with programming ADO.NET
which is much different from learning how to design, create and
maintain transactional or analytical databases.

regards
A.G.

RayLopez99 · Aug 30, 2011

Is it really necessary to have 3650 files in memory at the same time?
Such a design doesn't pass the smell test.

??? really now. We are not on the same page or you are incompetent.
Why would it not pass the smell test to have data in memory comprising
10 years of sales? What passes the smell test for you, a week's worth
of sales? Maybe you still code for 1970s era PCs with limited memory?

You have no idea of what you're talking about and couldn't be further
from the truth.

Nope. Wrong again. Why don't you research why nobody writes articles
for 'research performance monitor counters' in all the usual online
programming tutorial websites? Because for 99% of the time you let the
OS handle memory problems. The program expands until the OS throttles
it. Even my very demanding chess program, which is one of the hardest
type programs to write, and written by a team of expert developers,
and which hogs the PC's resources when running, does that. I bet only
1% of programs will "self-throttle" by checking memory.

Rule of thumb : There is a direct relationship between the size of the
database and the time it takes complex queries execute. Complex
database queries can execute in seconds when the database is lightly
populated. As more and more data gets added it will take longer and
longer for complex queries to execute. Near the end of the long tail a
complex query can take hours to execute.

Too general to be of any use, but thanks for the conversation.

The databases you've learned about have all been designed for
transactional purposes. Databases used for analytical processing are
highly denormalized by using a star or snowflake schema.

I see. So that's the little secret that you transactional database
freaks like to brag about and keep close to your vest. Your databases
are not Cobbs-normalized and therefore there is redundant information
in them, hence the "star" schema where I suppose (without Googling it)
that you attempt to normalize and/or set up some sort of central
depository or tree where the redundant data can be filtered to and
found easily.

Come back in six months and let us know how you're progressing on your
own. And why chatter about a user's machine being the worst case
scenario when you will be the only user.

Two questions: two answers: I might, and it was an off-topic
discussion about performance monitor counters, see above, now
resolved.

Pirate huh?

Yes, sometimes the best and only kind of software in 'one licensed
copy per country' areas like SE Asia where I'm at now. You pay for
development, since you are in a developed country. I reap the
benefits. Life is unfair.

All three of Sceppa's books are concerned with programming ADO.NET
which is much different from learning how to design, create and
maintain transactional or analytical databases.

'Much different' meaning Cobbs-normalized vs. non-normalized. Seems
you are impressed by relatively minor distinctions.

regards
A.G.

Regards,

RL

Registered User · Aug 30, 2011

??? really now. We are not on the same page or you are incompetent.

The sentence above actually makes sense.

Why would it not pass the smell test to have data in memory comprising
10 years of sales? What passes the smell test for you, a week's worth
of sales? Maybe you still code for 1970s era PCs with limited memory?

Can you explain the design that makes it necessary to have all the
files open at once?

I've done this before but on somewhat larger scale with daily sales
data from over five thousand locations. Having the proper tools and
knowing how to use them correctly made the task relatively simple.
There is no need to have all the files open at once.

Nope. Wrong again. Why don't you research why nobody writes articles
for 'research performance monitor counters' in all the usual online
programming tutorial websites?

Because the subject is well beyond the scope of the introductory
tutorials.

Because for 99% of the time you let the
OS handle memory problems. The program expands until the OS throttles
it. Even my very demanding chess program, which is one of the hardest
type programs to write, and written by a team of expert developers,
and which hogs the PC's resources when running, does that. I bet only
1% of programs will "self-throttle" by checking memory.

You're making things up again but that is no surprise.

Too general to be of any use, but thanks for the conversation.

Rules of thumb are generalizations.

I see. So that's the little secret that you transactional database
freaks like to brag about and keep close to your vest. Your databases
are not Cobbs-normalized and therefore there is redundant information
in them, hence the "star" schema where I suppose (without Googling it)
that you attempt to normalize and/or set up some sort of central
depository or tree where the redundant data can be filtered to and
found easily.

Don't try to explain the hows and whys of OLAP databases without an
understanding of how they work and are used.

Two questions: two answers: I might, and it was an off-topic
discussion about performance monitor counters, see above, now
resolved.

Zero answers.

Yes, sometimes the best and only kind of software in 'one licensed
copy per country' areas like SE Asia where I'm at now. You pay for
development, since you are in a developed country. I reap the
benefits. Life is unfair.

S.E. Asia huh? Yesterday in your "which is better" thread you claimed
Phoenix AZ. Did you move overnight so there would be an excuse for
your thievery?

'Much different' meaning Cobbs-normalized vs. non-normalized. Seems
you are impressed by relatively minor distinctions.

No, much different as in
programming ADO.NET
versus
designing, creating and maintaining databases

I've tried to point you in the right direction to accomplish your
task. The response shows a snot-nosed, know-it-all attitude. Proceed
with your project as you wish. It's your time, waste it as you wish.

regards
A.G.

Arne Vajhøj · Aug 31, 2011

Turns out the data is in ASCII text files that are about 250k to 400k
per transaction period (a day's worth of sales) and each line in the
file is a sale (in the exact same sequence of customer ID, date, sale
price, discount, etc), with about 5000 to 10000 sales per day.

The thing is I am trying to do is statistical manipulations on each
line of the data, comparing it to previous days for say the same
customer ID (to spot trends in the customer's buying habits).

So I need to do complex covariance and statistical stuff: use home
grown methods or SQL? I am leaning towards the latter since this does
not have to be "real time" analysis so speed is not important--the
program can run overnight to find an answer. On the other hand,
dictionary is fast as greased lightening--10 ms per lookup as I
recall--so perhaps a custom data structure in memory is best. Or
maybe a hybrid approach (which is maybe best): have your sales
transactions in the database, load select ones into a custom data
structure in memory, do your complex statistical stuff in memory, then
the results output either back to the database or another place.

You can easily write a ton of code yourself to get this
project of the ground.

Much better to bet on something COTS.

A standard relational database plus a reporting tool.

Or a specialized statistical analysis package like SAS or SPSS.

Arne

RayLopez99 · Aug 31, 2011

On 8/29/2011 2:40 PM, RayLopez99 wrote:

You can easily write a ton of code yourself to get this
project of the ground.

Much better to bet on something COTS.

A standard relational database plus a reporting tool.

Or a specialized statistical analysis package like SAS or SPSS.

Arne

Hey thanks Arne. If you have links handy and *if* these packages and
Off-The-Shelf libraries are free, please do feel free to post them
here on your free time, pun intended.

I'm not doing this commercially so I don't want to pay for anything.
If the OTS stuff is popular I might get a copy from Piratebay.org if I
can find them being seeded, since again this program is for my own
internal use only and not commercial.

RL

parsing text file into DataTable	6	Jun 14, 2006
Parse large xml files	2	Jul 13, 2006
Create and sort on XML File	4	May 6, 2010
Parsing Large Text file	1	Dec 13, 2003
parsing text file	2	Oct 17, 2007
Reading Text Files w/commas in the data	5	May 5, 2010
Import large text file to a MS Access database	14	Mar 3, 2007
Best way to insert Rich TextBox	4	Oct 28, 2009

Architecture question on parsing a large text file

RayLopez99

Registered User

James A. Fortune

James A. Fortune

RayLopez99

RayLopez99

RayLopez99

Registered User

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

Arne Vajhøj

RayLopez99

RayLopez99

RayLopez99

Registered User

RayLopez99

Registered User

Arne Vajhøj

RayLopez99

Ask a Question

Similar Threads