Parsing Data Problem ...

S

shapper

Hello,

I am trying to parse some data that comes in the following format:

One DataEntry contains N Dimensions. In this case N = 2;

Dimension class contains two properties: Name and Value:

IList<Object> dims = new List<Object>();

foreach (DataEntry entry in feed.Entries) {

Int32 i = entry.Dimensions.Count; // Equals 2

foreach (Dimension dim in entry.Dimensions)
dims.Add(new { Name = dim.Name, Value = dim.Value });

}

When I loop through dimensions, e.g. "foreach (Dimension dim in
entry.Dimension)", I have the following:

1st item: ROW 1 with Dim1.Name, Dim1.Value

2nd item: ROW 1 with Dim2.Name, Dim2.Value

3st item: ROW 2 with Dim1.Name, Dim1.Value

4nd item: ROW 2 with Dim2.Name, Dim2.Value

This is why I am having troubles in parsing ...


This is because the data should be:

ROW 1: Dim1.Name + Dim1.Value AND Dim2.Name and Dim2.Value

ROW 2: Dim1.Name + Dim1.Value AND Dim2.Name and Dim2.Value

My objective would be to have N columns (2 in this case)

Column 1: Dimension 1

Column 2: Dimension 2

....

And, for example, in Dimension 1 column each row would be an anonymous
object:

{ Name = Dimension.Name, Value = Dimension.Value }

Just like matrix ...

And I need to use this as follows:

I am doing this on my class constructor.

So I would like to save this data on a property or variable on the
same class so that the methods of the class can access it.

How can I do this?

Thank You,

Miguel
 
S

shapper

I am trying to parse some data that comes in the following format:
One DataEntry contains N Dimensions. In this case N = 2;
Dimension class contains two properties: Name and Value:
IList<Object>  dims = new List<Object>();

List<T> where T is System.Object is pointless.  The whole point of using
a generic type like List<T> is so that you can specify a real type.  If
the type is System.Object, then anything can go into the list, and you
have no type safety at all.
foreach (DataEntry entry in feed.Entries) {
   Int32 i = entry.Dimensions.Count;  // Equals 2
   foreach (Dimension dim in entry.Dimensions)
     dims.Add(new { Name = dim.Name, Value = dim.Value });

[...]
This is because the data should be:
   ROW 1: Dim1.Name + Dim1.Value AND Dim2.Name and Dim2.Value
   ROW 2: Dim1.Name + Dim1.Value AND Dim2.Name and Dim2.Value

I assume you mean, for example, "Dim1.Name + Dim1.Value AND Dim2.Name +
Dim2.Value", whatever that means.

It's hard to say for sure, but it looks to me as though you really
should be using a projection and then a GroupBy().

For example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TestGroupByRowNumber
{
     class Program
     {
         class Dimension
         {
             public string Name { get; set; }
             public int Value { get; set; }

             public Dimension(string name, int value)
             {
                 Name = name;
                 Value = value;
             }
         }

         class DataEntry
         {
             public List<Dimension> Dimensions { get; private set; }

             public DataEntry(IEnumerable<Dimension> dimensions)
             {
                 Dimensions = new List<Dimension>(dimensions);
             }
         }

         static void Main(string[] args)
         {
             List<DataEntry> entries = new List<DataEntry>
             {
                 new DataEntry(new Dimension[] { new Dimension("A", 1),
new Dimension("B", 2) }),
                 new DataEntry(new Dimension[] { new Dimension("C", 3),
new Dimension("D", 4) }),
                 new DataEntry(new Dimension[] { new Dimension("E", 5),
new Dimension("F", 6) }),
                 new DataEntry(new Dimension[] { new Dimension("G", 7),
new Dimension("H", 8) })
             };

             var result = entries.SelectMany((entry, i) =>
               entry.Dimensions.Select(dim => new { Row= i, Name =
dim.Name, Value = dim.Value }))
               .GroupBy(a => a.Row, (row, g) =>
                   {
                       StringBuilder sb = new StringBuilder();

                       sb.AppendFormat("Row: {0}", row);
                       foreach (var a in g)
                       {
                           sb.AppendFormat(";{0}+{1}", a.Name, a.Value);
                       }

                       return sb.ToString();
                   });

             foreach (string str in result)
             {
                 Console.WriteLine(str);
             }

             Console.ReadLine();
         }
     }

}

The above just emits an IEnumerable<string>, but of course you can
construct the GroupBy() result selector to return whatever kind of
object that aggregates the data the way you want.

All that said, it's not clear why the LINQ would be preferable to just
enumerating the data and building up whatever data you want.  I find the
LINQ less readable than something more straight-forward, such as:

             StringBuilder sb = new StringBuilder();
             List<string> output = new List<string>(entries.Count);

             for (int index = 0; index < entries.Count; index++)
             {
                 DataEntry entry = entries[index];

                 sb.AppendFormat("Row: {0}", index);

                 foreach (Dimension dim in entry.Dimensions)
                 {
                     sb.AppendFormat("; {0}+{1}", dim.Name, dim.Value);
                 }

                 output.Add(sb.ToString());
                 sb.Remove(0, sb.Length);
             }
My objective would be to have N columns (2 in this case)
   Column 1: Dimension 1
   Column 2: Dimension 2

So, where above I construct a string, you should just create an instance
of whatever your "row" object is, with each column populated by the data
in the appropriate Dimension object.
And, for example, in Dimension 1 column each row would be an anonymous
object:
   { Name = Dimension.Name, Value = Dimension.Value }
Just like matrix ...

Using anonymous types in your row object is going to severely limit the
utility of the output.  You can do it using the techniques described in
this post and in the other thread, but if you're dealing with a data
table, having anonymous data inside it is going to be awkward unless you
can do everything in a single method.

Doing the initialization in a constructor and then expecting it to be
useful later on is going to be a real headache.  C# doesn't have type
inference for generic types yet.  :)

Pete

Hello Pete,

I was trying anonymous types but after trying other options.

The first one I tried and that seemed more logic was an Array and then
Dictionary ... With arrays I always get that problem of not being able
to use a .Add method.

I was looking at this and I think I can give a better an example.
Consider the following:

foreach (Row in data.Rows) {

foreach (DimensionColumn in data.Dimensions) {
}

foreach (MetricColumn in data.Metrics) {
}

}

Each DimensionColumn has 2 properties: Name and Type
Each MetricColumn has 2 properties: Name and Value

All rows have the same number of Dimension Columns and Metric Columns.

I would like to create some kind of data variable that would contains
all this values.

Dictionary, Array, etc?

Consider I have two Rows, One Dimension Column and Two Metrics
columns.

ROW 1:

DIMENSION METRIC 1 METRIC 2
DIM.Name,DIM.Type MET1.Name,MET1.Val MET2.Name,MET2.Val

ROW 2:
DIMENSION METRIC 1 METRIC 2
DIM.Name,DIM.Type MET1.Name,MET1.Val MET2.Name,MET2.Val

Basically that is it ...

Of course I could have 4 dimensions and 10 metrics.

Thank You,
Miguel
 
S

shapper

On 11/2/10 5:27 PM, shapper wrote:
List<T> where T is System.Object is pointless.  The whole point of using
a generic type like List<T> is so that you can specify a real type.  If
the type is System.Object, then anything can go into the list, and you
have no type safety at all.
foreach (DataEntry entry in feed.Entries) {
   Int32 i = entry.Dimensions.Count;  // Equals 2
   foreach (Dimension dim in entry.Dimensions)
     dims.Add(new { Name = dim.Name, Value = dim.Value });
}
[...]
This is because the data should be:
   ROW 1: Dim1.Name + Dim1.Value AND Dim2.Name and Dim2.Value
   ROW 2: Dim1.Name + Dim1.Value AND Dim2.Name and Dim2.Value
I assume you mean, for example, "Dim1.Name + Dim1.Value AND Dim2.Name +
Dim2.Value", whatever that means.
It's hard to say for sure, but it looks to me as though you really
should be using a projection and then a GroupBy().
For example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace TestGroupByRowNumber
{
     class Program
     {
         class Dimension
         {
             public string Name { get; set; }
             public int Value { get; set; }
             public Dimension(string name, int value)
             {
                 Name = name;
                 Value = value;
             }
         }
         class DataEntry
         {
             public List<Dimension> Dimensions { get; private set; }
             public DataEntry(IEnumerable<Dimension> dimensions)
             {
                 Dimensions = new List<Dimension>(dimensions);
             }
         }
         static void Main(string[] args)
         {
             List<DataEntry> entries = new List<DataEntry>
             {
                 new DataEntry(new Dimension[] { new Dimension("A", 1),
new Dimension("B", 2) }),
                 new DataEntry(new Dimension[] { new Dimension("C", 3),
new Dimension("D", 4) }),
                 new DataEntry(new Dimension[] { new Dimension("E", 5),
new Dimension("F", 6) }),
                 new DataEntry(new Dimension[] { new Dimension("G", 7),
new Dimension("H", 8) })
             };
             var result = entries.SelectMany((entry, i)=>
               entry.Dimensions.Select(dim => new { Row = i, Name =
dim.Name, Value = dim.Value }))
               .GroupBy(a => a.Row, (row, g) =>
                   {
                       StringBuilder sb = newStringBuilder();
                       sb.AppendFormat("Row: {0}", row);
                       foreach (var a in g)
                       {
                           sb.AppendFormat("; {0}+{1}", a.Name, a.Value);
                       }
                       return sb.ToString();
                   });
             foreach (string str in result)
             {
                 Console.WriteLine(str);
             }
             Console.ReadLine();
         }
     }

The above just emits an IEnumerable<string>, but of course you can
construct the GroupBy() result selector to return whatever kind of
object that aggregates the data the way you want.
All that said, it's not clear why the LINQ would be preferable to just
enumerating the data and building up whatever data you want.  I find the
LINQ less readable than something more straight-forward, such as:
             StringBuilder sb = new StringBuilder();
             List<string> output = new List<string>(entries.Count);
             for (int index = 0; index < entries.Count;index++)
             {
                 DataEntry entry = entries[index];
                 sb.AppendFormat("Row: {0}", index);
                 foreach (Dimension dim in entry.Dimensions)
                 {
                     sb.AppendFormat("; {0}+{1}",dim.Name, dim.Value);
                 }
                 output.Add(sb.ToString());
                 sb.Remove(0, sb.Length);
             }
So, where above I construct a string, you should just create an instance
of whatever your "row" object is, with each column populated by the data
in the appropriate Dimension object.
Using anonymous types in your row object is going to severely limit the
utility of the output.  You can do it using the techniques described in
this post and in the other thread, but if you're dealing with a data
table, having anonymous data inside it is going to be awkward unless you
can do everything in a single method.
Doing the initialization in a constructor and then expecting it to be
useful later on is going to be a real headache.  C# doesn't have type
inference for generic types yet.  :)

Hello Pete,

I was trying anonymous types but after trying other options.

The first one I tried and that seemed more logic was an Array and then
Dictionary ... With arrays I always get that problem of not being able
to use a .Add method.

I was looking at this and I think I can give a better an example.
Consider the following:

foreach (Row in data.Rows) {

 foreach (DimensionColumn in data.Dimensions) {
 }

 foreach (MetricColumn in data.Metrics) {
 }

}

Each DimensionColumn has 2 properties: Name and Type
Each MetricColumn has 2 properties: Name and Value

All rows have the same number of Dimension Columns and Metric Columns.

I would like to create some kind of data variable that would contains
all this values.

Dictionary, Array, etc?

Consider I have two Rows, One Dimension Column and Two Metrics
columns.

ROW 1:

DIMENSION              METRIC 1                    METRIC 2
DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val

ROW 2:
DIMENSION              METRIC 1                    METRIC 2
DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val

Basically that is it ...

Of course I could have 4 dimensions and 10 metrics.

Thank You,
Miguel

I am trying to use Dictionary. Does it make sense?

My idea would be something like:

Dictionary<String, IDictionary<String, String[]>
So the first String is "Metric" or "Dimension". This would separate
both types.

Consider just the code for Metric.

IDictionary<String, IDictionary<String, String[]>> data;
data.Add("Metrics", new Dictionary<String, String[]>());

foreach (DataEntry entry in feed.Entries) {
foreach (Metric metric in entry.Metrics) {
data["Metrics"].Add( // ...
}
}

Inside the Metric loop I would like to check if the metric's name
(metric.Name) exists in the data["Metrics"].

If it does not exist then add a new Dictionary<String, String[]> to
data["Metrics"] where:

String = metric.Name and String[] will have one va,lue:
metric.Value;

If there is already a Dictionary inside data["Metrics"] that contains
a key = metric.Name then:

Just add metric.Value to the existing array.

How can I do this?

Thank You,
Miguel
 
S

shapper

[...]
Consider I have two Rows, One Dimension Column and Two Metrics
columns.
ROW 1:
DIMENSION              METRIC 1                     METRIC 2
DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
ROW 2:
DIMENSION              METRIC 1                     METRIC 2
DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
[...]
I am trying to use Dictionary. Does it make sense?
My idea would be something like:
Dictionary<String, IDictionary<String, String[]>
So the first String is "Metric" or "Dimension". This would separate
both types.

How can I do this?

I guess I don't really understand the point of the dictionaries.  In
your previous post, you wrote:

 > Consider I have two Rows, One Dimension Column and Two Metrics
 > columns.
 >
 > ROW 1:
 >
 > DIMENSION              METRIC 1                     METRIC 2
 > DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
 >
 > ROW 2:
 > DIMENSION              METRIC 1                     METRIC 2
 > DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val

Why are all the data not just stored in each row object (whatever that
might be)?  For example:

   class RowObject
   {
     public Dimension Dimension { get; set; }
     public Metric[] Metrics { get; set; }
   }

Or something like that.

If you do see a need to use dictionaries, why have a dictionary of
dictionaries?  Is there ever actually going to be more than just the
"dimension" and "metric" dictionaries?  If not, why not just have two
variables, one for each kind of dictionary?

Pete

Let me explain. This data is coming from Google Data API, in this case
Google Analytics. Google delivers this data into a XML format with
much more information than this.

The C# library that Google supports transforms the XML note into
Entries (Row) and for each row I have N metrics and M Dimensions.

This format is quite strange since what matters is all row values from
one metric to display in a chart.

I have a Service that on the constructor authenticates do Google and
runs a query that gets the data. Then I want to transform the data as
described (by column) and hold it in a private field of the service
class. Then each method will use part of the data. For example:
GetVisits will return all the values from Dimension "Date" and from
the Metric "Visit".

So I would access:

String[] dates = data["Dimensions"]["Dates"];
String[] visits = data["Metrics"]["Visits"];

Does this make sense?
 
S

shapper

On 11/5/10 11:10 AM, shapper wrote:
[...]
Consider I have two Rows, One Dimension Column and Two Metrics
columns.
ROW 1:
DIMENSION              METRIC 1                     METRIC 2
DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
ROW 2:
DIMENSION              METRIC 1                     METRIC 2
DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
[...]
I am trying to use Dictionary. Does it make sense?
My idea would be something like:
Dictionary<String, IDictionary<String, String[]>
So the first String is "Metric" or "Dimension". This would separate
both types.
[...]
How can I do this?
I guess I don't really understand the point of the dictionaries.  In
your previous post, you wrote:
 > Consider I have two Rows, One Dimension Column and Two Metrics
 > columns.
 >
 > ROW 1:
 >
 > DIMENSION              METRIC 1                     METRIC 2
 > DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
 >
 > ROW 2:
 > DIMENSION              METRIC 1                     METRIC 2
 > DIM.Name,DIM.Type  MET1.Name,MET1.Val   MET2.Name,MET2.Val
Why are all the data not just stored in each row object (whatever that
might be)?  For example:
   class RowObject
   {
     public Dimension Dimension { get; set; }
     public Metric[] Metrics { get; set; }
   }
Or something like that.
If you do see a need to use dictionaries, why have a dictionary of
dictionaries?  Is there ever actually going to be more than just the
"dimension" and "metric" dictionaries?  If not, why not just have two
variables, one for each kind of dictionary?

Let me explain. This data is coming from Google Data API, in this case
Google Analytics. Google delivers this data into a XML format with
much more information than this.

The C# library that Google supports transforms the XML note into
Entries (Row) and for each row I have N metrics and M Dimensions.

This format is quite strange since what matters is all row values from
one metric to display in a chart.

I have a Service that on the constructor authenticates do Google and
runs a query that gets the data. Then I want to transform the data as
described (by column) and hold it in a private field of the service
class. Then each method will use part of the data. For example:
GetVisits will return all the values from Dimension "Date" and from
the Metric "Visit".

So I would access:

String[] dates = data["Dimensions"]["Dates"];
String[] visits = data["Metrics"]["Visits"];

Does this make sense?

Hello,

Did my previous explanation helped out?

I created the following code that is working:

IDictionary<String, IDictionary<String, IList<String>>> data =
new Dictionary<String, IDictionary<String, IList<String>>>();
data.Add("Dimensions", new Dictionary<String,
IList<String>>());
data.Add("Metrics", new Dictionary<String, IList<String>>());

foreach (DataEntry entry in feed.Entries) {

foreach (Dimension dimension in entry.Dimensions) {
if (!data["Dimensions"].ContainsKey(dimension.Name))
data["Dimensions"].Add(dimension.Name, new
List<String>());
data["Dimensions"][dimension.Name].Add(dimension.Value);
}

foreach (Metric metric in entry.Metrics) {
if (!data["Metrics"].ContainsKey(metric.Name))
data["Metrics"].Add(metric.Name, new List<String>());
data["Metrics"][metric.Name].Add(metric.Value);
}

}

Then I could, for example, access a metric as follows: data["Metrics"]
["Visits"]

What do you think about my code?

Any suggestion just let me know.

Thank You,
Miguel
 
S

shapper

So, does that mean that even though your examples all have just one
dimension in a row, a given row could have multiple dimensions?

Yes, it could have 10 dimensions and 20 metrics.

I wrote at start that it could have N dimensions ... But I admit that
my post was a little bit confusing as myself has been trying to figure
this out along the way.
If so, does that also mean that the metrics within that row are each
associated with one specific dimension in the row?

Yes, one row, e.g. XML node contains the value for each dimension and
for each metric.
[...]
I created the following code that is working:
         IDictionary<String, IDictionary<String, IList<String>>>  data =
new Dictionary<String, IDictionary<String, IList<String>>>();
         data.Add("Dimensions", new Dictionary<String,
IList<String>>());
         data.Add("Metrics", new Dictionary<String, IList<String>>());
         foreach (DataEntry entry in feed.Entries) {
           foreach (Dimension dimension in entry.Dimensions) {
             if (!data["Dimensions"].ContainsKey(dimension.Name))
               data["Dimensions"].Add(dimension.Name, new
List<String>());
             data["Dimensions"][dimension.Name].Add(dimension.Value);
           }
           foreach (Metric metric in entry.Metrics) {
             if (!data["Metrics"].ContainsKey(metric.Name))
               data["Metrics"].Add(metric.Name, new List<String>());
             data["Metrics"][metric.Name].Add(metric.Value);
           }
         }
Then I could, for example, access a metric as follows: data["Metrics"]
["Visits"]
What do you think about my code?
Any suggestion just let me know.

It seems fine to me, but I still (as in my previous reply) don't
understand why you want to store the dimension and metric dictionaries
in another dictionary.  Why not just have two variables, referencing the
dictionaries for dimension and metric?

Yes, that seems logic. But Dimensions and Metrics are not all there
is.
I think, I didn't read a lot about this part, that there are
aggregates stats and others.

So instead of adding more fields I just have one Dictionary.
For me it seems better ... But again this is just a personal opinion.
The other thing I still don't understand is that your description seems
to imply that there would be some correlation between dimension and
metric, but the above code doesn't address that at all.  Are metrics not
associated with dimensions in any way?

Yes, when you send this query:
Dimensions = "ga:date",
Metrics = "ga:visits,ga:bounces",

You will get Visits and Bounces by Date.
More more metrics and dimensions can be added but at the end the
number of rows will be the same for all.

So for each XML Node, e.g., Data Entry I will have one date (Dimension
ga:date), and two metrics (ga:visits and ga:bounces)
And this will repeat for each Data Entry.

What is confusing is, at least for me, I would expect to get 3
vectors: one for ga_date, one for ga:visits and one for ga:bounces.

Wouldn't you?

But now I figured this out and it makes some sense since this coming
from a XML file.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top