PC Review


Reply
Thread Tools Rating: Thread Rating: 2 votes, 1.00 average.

Architecture question on parsing a large text file

 
 
RayLopez99
Guest
Posts: n/a
 
      26th Aug 2011
I have a text file that has five pieces of information in it per
line: ASCII text and numbers delimited by commas and a line break on
each line.

What is the best way to organize this data so I can analyse it? The
files are hundreds of MB large, so loading them all into memory via
XML is impractical (I think--or is it? I have a 32 bit system with 4
GB RAM). Better to read them "serially" using XML? Or better to set
up a database that can be queried?

Which is better--or is there an even better way? I'm leaning towards
a database in SQL Server.

I plan to code in C# using Visual Studio 2010.

RL
 
Reply With Quote
 
 
 
 
Registered User
Guest
Posts: n/a
 
      26th Aug 2011
On Thu, 25 Aug 2011 21:46:58 -0700 (PDT), RayLopez99
<(E-Mail Removed)> wrote:

>I have a text file that has five pieces of information in it per
>line: ASCII text and numbers delimited by commas and a line break on
>each line.
>
>What is the best way to organize this data so I can analyse it?


What is wrong with the way the data is already organized?

>The
>files are hundreds of MB large, so loading them all into memory via
>XML is impractical (I think--or is it? I have a 32 bit system with 4
>GB RAM).


http://lmgtfy.com/?q=how+do+computers+work

>Better to read them "serially" using XML?


XML? You'll have to read the CSV files to create the XML files. Then
you'll have to read the XML files before performing any analysis.
Unless changing the format of the data adds value why bother?

>Or better to set up a database that can be queried?
>
>Which is better--or is there an even better way? I'm leaning towards
>a database in SQL Server.
>

A transactional database might be suitable for layout-led analysis,
If the intent is to extract business intelligence by data-led
discovery use SQL Server Analysis Services and an OLAP database.
Import the CSV data using SQL Server Integration Services. SQL Server
Reporting Services is another valuable tool.

>I plan to code in C# using Visual Studio 2010.


Everything can be done using SQL Server's Business Intelligence Design
Studio.

regards
A.G.
 
Reply With Quote
 
James A. Fortune
Guest
Posts: n/a
 
      26th Aug 2011
On Aug 26, 12:46*am, RayLopez99 <raylope...@gmail.com> wrote:
> I have a text file that has five pieces of information in it per
> line: *ASCII text and numbers delimited by commas and a line break on
> each line.
>
> What is the best way to organize this data so I can analyse it? *The
> files are hundreds of MB large, so loading them all into memory via
> XML is impractical (I think--or is it? *I have a 32 bit system with 4
> GB RAM). *Better to read them "serially" using XML? *Or better to set
> up a database that can be queried?
>
> Which is better--or is there an even better way? *I'm leaning towards
> a database in SQL Server.
>
> I plan to code in C# using Visual Studio 2010.
>
> RL


I would import the CSV file into Excel and then use the 'dynamic'
typing of C# 4.0 to grab the data from the Excel object to put into
SQL Server. It might go along the lines of some modified Microsoft
sample PDC09 code that is almost certain not to work 'as is' for your
situation :-):

dynamic _excel = null;

try
{
_excel = AutomationFactory.CreateObject("Excel.Application");
}
catch {}
_excel.visible = true;
If (null != _excel)
{
dynamic workbook = _excel.workbooks;
workbook.Open("C:\MyXL.xlsm");

dynamic sheet = _excel.ActiveSheet;

//transfer spreadsheet data to SQL Server

}

If possible, try to use Excel Automation (perhaps by assigning a Range
object to an array) to grab the entire sheet of data with one command
for performance. Hopefully, this gets you looking in the right
direction.

James A. Fortune
(E-Mail Removed)
 
Reply With Quote
 
James A. Fortune
Guest
Posts: n/a
 
      26th Aug 2011
On Aug 26, 12:46*am, RayLopez99 <raylope...@gmail.com> wrote:
> I have a text file that has five pieces of information in it per
> line: *ASCII text and numbers delimited by commas and a line break on
> each line.
>
> What is the best way to organize this data so I can analyse it? *The
> files are hundreds of MB large, so loading them all into memory via
> XML is impractical (I think--or is it? *I have a 32 bit system with 4
> GB RAM). *Better to read them "serially" using XML? *Or better to set
> up a database that can be queried?
>
> Which is better--or is there an even better way? *I'm leaning towards
> a database in SQL Server.
>
> I plan to code in C# using Visual Studio 2010.
>
> RL


Also see PDC09 session FT11:

Future Directions for C# and Visual Basic - Luca Bolognese

He shows a demo for reading CSV files. All this is just in case you
can't simply import a CSV file directly into SQL Server :-).

BTW, by serendipitously finding an alternate URL at Microsoft and
getting a propitious alignment of the stars during pre-dawn hours
Pacific time, I was able to download the PDC10 sessions. I'm about
40% done with viewing the PDC09 sessions. Maybe by 2012 I'll be able
to attend the real thing.

James A. Fortune
(E-Mail Removed)
 
Reply With Quote
 
RayLopez99
Guest
Posts: n/a
 
      27th Aug 2011
On Aug 26, 7:11*pm, Registered User <n4...@ix.netcom.com> wrote:
> On Thu, 25 Aug 2011 21:46:58 -0700 (PDT), RayLopez99
>
> <raylope...@gmail.com> wrote:
> >I have a text file that has five pieces of information in it per
> >line: *ASCII text and numbers delimited by commas and a line break on
> >each line.

>
> >What is the best way to organize this data so I can analyse it? *

>
> What is wrong with the way the data is already organized?


I want to grab the data, mash it up with other data, etc etc etc.
Where to store said data--should it be stored on the hard drive as
ASCII text files? Or in a database? If a database I can do stuff
like SQL queries on Linq-to-SQL. If a text or XML file I cannot do
SQL queries but it's easier I guess to code.



>
> >The
> >files are hundreds of MB large, so loading them all into memory via
> >XML is impractical (I think--or is it? *I have a 32 bit system with 4
> >GB RAM). *

>
> http://lmgtfy.com/?q=how+do+computers+work


Save me the work bro. Please tell me how much (many) text files can I
load into 3.6 GB of memory? Please tell me. And btw I don't know how
you can monitor your memory in Visual Studio so that you know when you
are running out of RAM--I just will let the OS figure it out and swap
stuff, etc, but I have a feeling that's not optimal from a user's
point of view. Any pointers as to how to get a program to throttle
itself if it is running out of RAM on a PC the program is loaded on
would be greatly appreciated.

My thanks in advance.

>
> >Better to read them "serially" using XML? *

>
> XML? You'll have to read the CSV files to create the XML files. Then
> you'll have to read the XML files before performing any analysis.
> Unless changing the format of the data adds value why bother?


See the above. I'm thinking (out loud) that perhaps changing the
format might add value. For example I'm thinking of converting the
string to a binary representation so I can manipulate it faster, but
perhaps string/ Stringbuilder is fast enough...I'll have to
experiment. But the main thing is that I think it might be helpful to
get the text data into a database so I can run SQL and/or Linq-to-SQL
queries.

>
> >Or better to set up a database that can be queried?

>
> >Which is better--or is there an even better way? *I'm leaning towards
> >a database in SQL Server.

>
> A transactional database might be suitable for layout-led analysis,
> If the intent is to extract business intelligence by data-led
> discovery use SQL Server Analysis Services and an OLAP database.
> Import the CSV data using SQL Server Integration Services. SQL Server
> Reporting Services is another valuable tool.


OK. Is that **** free? I'm a penniless student for purposes of this
project.

>
> >I plan to code in C# using Visual Studio 2010.

>
> Everything can be done using SQL Server's Business Intelligence Design
> Studio.
>
> regards
> A.G.


Thank you A.G.

RL
 
Reply With Quote
 
RayLopez99
Guest
Posts: n/a
 
      27th Aug 2011
On Aug 26, 8:12*pm, Big Steel <bigonea...@big1aaaaa.com> wrote:
> On 8/26/2011 12:46 AM, RayLopez99 wrote:
>
> > I have a text file that has five pieces of information in it per
> > line: *ASCII text and numbers delimited by commas and a line break on
> > each line.

>
> > What is the best way to organize this data so I can analyse it? *The
> > files are hundreds of MB large, so loading them all into memory via
> > XML is impractical (I think--or is it? *I have a 32 bit system with 4
> > GB RAM). *Better to read them "serially" using XML? *Or better to set
> > up a database that can be queried?

>
> > Which is better--or is there an even better way? *I'm leaning towards
> > a database in SQL Server.

>
> > I plan to code in C# using Visual Studio 2010.

>
> So why don't you just save them as XML files on the hard drive and query
> the XML files using Linq-2-XML? Or if you are using SQL Server 2005 or
> better, you can load the XML into a SQL server table and query the XML
> that way too.
>
> http://msdn.microsoft.com/en-us/libr...v=sql.90).aspx


OK, that's one line of attack, but I'm trying to get all points of
view. Since there will be hundreds of thousands of files per
transaction period, I'm thinking the I/O access will waste too much
time--is SQL faster? That is, if you have 100000 pieces of info in a
database, querying it will be faster (my supposition) than loading
100000 XML (or .CSV, or .txt) files?

BTW thanks for that "Do Factory" book recommendation on Gang of Four
templates--I bought the book, went through the exercises, and it was
useful, though I still claim 95% of the time you will not use 95% of
the Gang of Four templates.

RL
 
Reply With Quote
 
RayLopez99
Guest
Posts: n/a
 
      27th Aug 2011
On Aug 27, 1:30*am, "James A. Fortune" <CDMAPos...@FortuneJames.com>
wrote:
> On Aug 26, 12:46*am, RayLopez99 <raylope...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > I have a text file that has five pieces of information in it per
> > line: *ASCII text and numbers delimited by commas and a line break on
> > each line.

>
> > What is the best way to organize this data so I can analyse it? *The
> > files are hundreds of MB large, so loading them all into memory via
> > XML is impractical (I think--or is it? *I have a 32 bit system with 4
> > GB RAM). *Better to read them "serially" using XML? *Or better to set
> > up a database that can be queried?

>
> > Which is better--or is there an even better way? *I'm leaning towards
> > a database in SQL Server.

>
> > I plan to code in C# using Visual Studio 2010.

>
> > RL

>
> I would import the CSV file into Excel and then use the 'dynamic'
> typing of C# 4.0 to grab the data from the Excel object to put into
> SQL Server. *It might go along the lines of some modified Microsoft
> sample PDC09 code that is almost certain not to work 'as is' for your
> situation :-):
>
> dynamic _excel = null;
>
> try
> {
> * _excel = AutomationFactory.CreateObject("Excel.Application");}
>
> catch {}
> _excel.visible = true;
> If (null != _excel)
> {
> * dynamic workbook = _excel.workbooks;
> * workbook.Open("C:\MyXL.xlsm");
>
> * dynamic sheet = _excel.ActiveSheet;
>
> * //transfer spreadsheet data to SQL Server
>
> }
>
> If possible, try to use Excel Automation (perhaps by assigning a Range
> object to an array) to grab the entire sheet of data with one command
> for performance. *Hopefully, this gets you looking in the right
> direction.
>
> James A. Fortune
> CDMAPos...@FortuneJames.com


This is invaluable "low level" stuff that might come in handy later.
But since my data is in a text file, it might be easier just to read
each file, and import the data into a SQL table, no? If you go the
table route. If you go the XML route you just save to XML. Why
bother with the Excel intermediary step?

RL
 
Reply With Quote
 
Registered User
Guest
Posts: n/a
 
      27th Aug 2011
On Sat, 27 Aug 2011 08:13:51 -0700 (PDT), RayLopez99
<(E-Mail Removed)> wrote:

>On Aug 26, 7:11*pm, Registered User <n4...@ix.netcom.com> wrote:
>> On Thu, 25 Aug 2011 21:46:58 -0700 (PDT), RayLopez99
>>
>> <raylope...@gmail.com> wrote:
>> >I have a text file that has five pieces of information in it per
>> >line: *ASCII text and numbers delimited by commas and a line break on
>> >each line.

>>
>> >What is the best way to organize this data so I can analyse it? *

>>
>> What is wrong with the way the data is already organized?

>
>I want to grab the data, mash it up with other data, etc etc etc.
>Where to store said data--should it be stored on the hard drive as
>ASCII text files? Or in a database? If a database I can do stuff
>like SQL queries on Linq-to-SQL. If a text or XML file I cannot do
>SQL queries but it's easier I guess to code.
>

Why not start a new thread and clearly state what you're trying to do.
The best way to store the data might depend upon the etc. etc.etc.

SQL queries can't be done against an XML file but LINQ-to-XML can be
used to query the document.
>
>>
>> >The
>> >files are hundreds of MB large, so loading them all into memory via
>> >XML is impractical (I think--or is it? *I have a 32 bit system with 4
>> >GB RAM). *

>>
>> http://lmgtfy.com/?q=how+do+computers+work

>
>Save me the work bro. Please tell me how much (many) text files can I
>load into 3.6 GB of memory? Please tell me.


Anywhere from none to ~infinity depending upon the file size. How many
files actually need to be loaded at once?

>And btw I don't know how
>you can monitor your memory in Visual Studio so that you know when you
>are running out of RAM--I just will let the OS figure it out and swap
>stuff, etc, but I have a feeling that's not optimal from a user's
>point of view. Any pointers as to how to get a program to throttle
>itself if it is running out of RAM on a PC the program is loaded on
>would be greatly appreciated.
>


research performance monitor counters

>My thanks in advance.
>
>>
>> >Better to read them "serially" using XML? *

>>
>> XML? You'll have to read the CSV files to create the XML files. Then
>> you'll have to read the XML files before performing any analysis.
>> Unless changing the format of the data adds value why bother?

>
>See the above. I'm thinking (out loud) that perhaps changing the
>format might add value. For example I'm thinking of converting the
>string to a binary representation so I can manipulate it faster, but
>perhaps string/ Stringbuilder is fast enough...I'll have to
>experiment.


Fast enough for what? Functionality that doesn't exist can't be
optimized. With a proper enterprise architecture SOC can be used to
isolate this functionality. Then different implementations of the
functionality can be used to provide performance-related metrics,

> But the main thing is that I think it might be helpful to
>get the text data into a database so I can run SQL and/or Linq-to-SQL
>queries.
>

Consider the volume of data to be queried and how long it will take to
execute a complex query against that data in a transactional database.
For serious data analysis purposes, data management is needed that a
transactional database cannot provide.

>>
>> >Or better to set up a database that can be queried?

>>
>> >Which is better--or is there an even better way? *I'm leaning towards
>> >a database in SQL Server.

>>
>> A transactional database might be suitable for layout-led analysis,
>> If the intent is to extract business intelligence by data-led
>> discovery use SQL Server Analysis Services and an OLAP database.
>> Import the CSV data using SQL Server Integration Services. SQL Server
>> Reporting Services is another valuable tool.

>
>OK. Is that **** free? I'm a penniless student for purposes of this
>project.
>

There is an SQL Server 2008 R2 trial version which is good for a
180-day evaluation period.
http://www.microsoft.com/sqlserver/e...er/try-it.aspx

A good text to follow is Delivering Business Intelligence with MS SQL
2008
http://www.amazon.com/Delivering-Bus.../dp/0071549447

Even if an OLAP database is not used SSIS can be used to gather,
extract, manipulate, and import data into an OLTP database.

regards
A.G.
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      28th Aug 2011
On 8/26/2011 8:11 AM, Registered User wrote:
> On Thu, 25 Aug 2011 21:46:58 -0700 (PDT), RayLopez99
> <(E-Mail Removed)> wrote:
>
>> The
>> files are hundreds of MB large, so loading them all into memory via
>> XML is impractical (I think--or is it? I have a 32 bit system with 4
>> GB RAM).

>
> http://lmgtfy.com/?q=how+do+computers+work


So you have learned how to use lmgtfy.

Next exercise: learn when it is relevant to post.

>> Better to read them "serially" using XML?

>
> XML? You'll have to read the CSV files to create the XML files. Then
> you'll have to read the XML files before performing any analysis.
> Unless changing the format of the data adds value why bother?


XML has a DOM in memory structure.

I have never heard about a CSV DOM in memory structure.

Arne
 
Reply With Quote
 
Arne Vajhøj
Guest
Posts: n/a
 
      28th Aug 2011
On 8/27/2011 11:13 AM, RayLopez99 wrote:
> On Aug 26, 7:11 pm, Registered User<n4...@ix.netcom.com> wrote:
>> On Thu, 25 Aug 2011 21:46:58 -0700 (PDT), RayLopez99
>> <raylope...@gmail.com> wrote:
>>> The
>>> files are hundreds of MB large, so loading them all into memory via
>>> XML is impractical (I think--or is it? I have a 32 bit system with 4
>>> GB RAM).

>>
>> http://lmgtfy.com/?q=how+do+computers+work

>
> Save me the work bro. Please tell me how much (many) text files can I
> load into 3.6 GB of memory?


You will only be able to us 2 GB of virtual address space in 32 bit
Windows (and you can do that with 512 MB RAM having 2+ GB of RAM just
make it perform decent - or at least potentially perform decent)

> Please tell me. And btw I don't know how
> you can monitor your memory in Visual Studio so that you know when you
> are running out of RAM--I just will let the OS figure it out and swap
> stuff, etc, but I have a feeling that's not optimal from a user's
> point of view. Any pointers as to how to get a program to throttle
> itself if it is running out of RAM on a PC the program is loaded on
> would be greatly appreciated.


You can watch memory usage from the outside with Windows task manager
performance.

You can watch memory usage from the inside by using the Process class
(in System.Diagnostics).

Arne
 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Question about parsing a column of cells with a comma in text Larry C Microsoft Excel Programming 3 8th Oct 2007 01:38 PM
Large text file import: MVP question =?Utf-8?B?QXZp?= Microsoft VB .NET 12 8th May 2006 07:28 PM
In file parsing, taking the first few characters of a text file after a readfile or streamreader file read... .Net Sports Microsoft ASP .NET 11 17th Jan 2006 12:44 AM
Help - parsing large text files ArunPrakash Microsoft C# .NET 6 23rd Aug 2004 06:45 PM
Parsing Large Text file Rahul Microsoft C# .NET 1 13th Dec 2003 11:49 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 02:50 PM.