Come on guys!! Surely someone here knows something about custom serialization?

Marc Gravell · Mar 31, 2008

Binary formatter is largely a field (reflection) formatter - and yes: it
tends to include a lot of type information... but the dupliaction does mean
that you might consider compression algorithms such as GZip. Rather than
write a formatter, you can do your own serialization *within* the formatter
by simply implementing ISerializable, and providing GetObjectData and the
custom ctor.

However; another option might be the xml-based serializations; done
correctly, these can contain less type information - and you can use
TypeConverter (and in particular, Convert[To|From]InvariantString) within
custom xml serialization (IXmlSerializable). Another possibility (if space
is an issue) is to combine the xml approach with the JSon formatter - or
(perhaps a better option) simply combine regular xml (from XmlSerializer)
with Gzip.

Either way, the formatter will handle *many* (but not all) collection
types - some may be unsuited to this type of use...

Marc

Jon Slaughter · Mar 31, 2008

I am trying to write a custom method to serialize the class

[Serializable]
public class RTree<T> : IEnumerable<RTree<T>>
{
public T Value; // Value at this node
public RTree<T> Parent; // Contains the parent node
public List<RTree<T>> Nodes; // Container for Nodes
public RTree() { Nodes = new List<RTree<T>>(); }
public void Add(RTree<T> node)
{
Nodes.Add(node);
node.Parent = this;
}
#region Enumeration
public IEnumerator<RTree<T>> GetEnumerator()
{
for(int i = 0; i < Nodes.Count; i++)
yield return Nodes;
}
// Non-Generic version
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
#endregion
}

I am trying to write a custom formatter that will write data out the "old
way". That is, you knew the length of things by definition of whatever
structure you were going to read them into. In this case the formatter is
"generic" which is really just the passed type so one can reflect on it and
get its "structure".

The problem is that is that I have no way of deal with value types or
enumerable types AFAIK. For value types I have to manually deal with them
by testing for it or somehow use another formatter. I'm not sure if, say,
the binaryformatter actually compares the type against each value type and
deals with every one manually? (if so then I suppose I could go that route)

The other, more serious problem, is deal with "collection" types. I have no
idea how to get at the types contained in the collection to parse them. I'm
thinking that I can compare against an IEnumerable interface and then use
foreach but I'm not sure if that will work for all Enumerable types(like
keyed collection).

So how does BinaryFormatter handle dealing with any and every type? Does it
actually manually handle them or is there some method to use such as
reflection? I know reflection can get me the fields that I need but I don't
know how to get anything more than that to solve my problem.

All I want to do is write the above type to disk and save space ;/ My "raw
formatter" which just writes the tree to disk directly(not using the built
in serialization stuff) gives me a filesize of 113 bytes for a simple "test"
tree. The binary formatter is over 3k bytes because it adds all the type
info(and since its recursive it does it for each "level" bloating the size
for no reason). I imagine when I put this in practice that it will turn a 1M
file into 30 to 100 megs or even more(depending on how many "levels" it
has).

In fact, I'd rather it use a lookup table for the type info stuff so they
are not duplicated and that was one thing I was thinking about trying. (so
maybe at the end of the file will be a list of the types and the actually
serialization use a value into that list instead)

Any ideas?

Thanks,
Jon

Marc Gravell · Mar 31, 2008

Example of xml vs binary serialization coupled with GZip compression. Note
that I tweaked a few things, and really a List<T> isn't going to cut it
(since the parent can be bypassed); output lengths first for comparison:

Xml 1638
Xml (GZip) 381
Bin 4846
Bin (GZip) 2518

using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Runtime.Serialization.Formatters.Binary;
using System.Xml;
using System.Xml.Serialization;
using System.Collections.ObjectModel;
using System.Text;
static class Program
{
static void Main(string[] args)
{
Node<int> tree = new Node<int>();
Random rand = new Random(123456);
for (int i = 0; i < 5; i++)
{
tree.Add(rand.Next(50));
tree.Add(rand.Next(50)).Add(rand.Next(50));
Node<int> node = tree.Add(rand.Next(50));
node.Add(rand.Next(50));
node.Add(rand.Next(50));
}

byte[] xml, bin, xmlZip, binZip;
using (MemoryStream ms = new MemoryStream())
{
using (XmlWriter writer = XmlWriter.Create(ms))
{
XmlSerializer xs = new XmlSerializer(tree.GetType());
xs.Serialize(writer, tree);
writer.Close();
}
xml = ms.ToArray();
}

using (MemoryStream ms = new MemoryStream())
{
using(GZipStream zip = new GZipStream(ms,
CompressionMode.Compress, true))
using (XmlWriter writer = XmlWriter.Create(zip))
{
XmlSerializer xs = new XmlSerializer(tree.GetType());
xs.Serialize(writer, tree);
writer.Close();
zip.Close();
}
xmlZip = ms.ToArray();
}

using (MemoryStream ms = new MemoryStream())
{
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(ms, tree);
bin = ms.ToArray();
}
using (MemoryStream ms = new MemoryStream())
{
using (GZipStream zip = new GZipStream(ms,
CompressionMode.Compress, true))
{
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(zip, tree);
zip.Close();
}
binZip = ms.ToArray();
}
Console.WriteLine("Xml {0}", xml.Length);
Console.WriteLine("Xml (GZip) {0}", xmlZip.Length);
Console.WriteLine("Bin {0}", bin.Length);
Console.WriteLine("Bin (GZip) {0}", binZip.Length);

// show the actual xml
Console.WriteLine();
Console.WriteLine(Encoding.UTF8.GetString(xml));
Console.WriteLine();

// prove we can decompress the Xml GZip
using (MemoryStream ms = new MemoryStream(xmlZip))
using (GZipStream zip = new GZipStream(ms,
CompressionMode.Decompress))
using (XmlReader reader = XmlReader.Create(zip))
{
XmlSerializer xs = new XmlSerializer(typeof(Node<int>));
Node<int> node = (Node<int>) xs.Deserialize(reader);
WriteNode(node, 0);
}
}
static void WriteNode<T>(Node<T> node, int indent)
{
string prefix = new string(' ', indent * 2);
Console.WriteLine("{0}Value: {1}", prefix, node.Value);
foreach (Node<T> child in node.Nodes)
{
WriteNode(child, indent + 1);
}
}
}
[Serializable, XmlRoot("Nodes")]
public class NodeCollection<T> : Collection<Node<T>>
{
private readonly Node<T> parent;
[XmlIgnore]
public Node<T> Parent { get { return parent; } }
public NodeCollection(Node<T> parent) {
this.parent = parent;
}
protected override void InsertItem(int index, Node<T> item)
{
base.InsertItem(index, item);
this[index].Parent = this.Parent;
}
protected override void RemoveItem(int index)
{
this[index].Parent = null;
base.RemoveItem(index);
}
protected override void SetItem(int index, Node<T> item)
{
this[index].Parent = null;
base.SetItem(index, item);
this[index].Parent = this.Parent;
}
}

[Serializable, XmlRoot("Tree")]
public class Node<T>
{
[XmlAttribute]
public T Value { get; set; } // Value at this node
private Node<T> parent;// Contains the parent node
[XmlIgnore]
public Node<T> Parent
{
get { return parent; }
internal set { parent = value; }
}
private NodeCollection<T> nodes; // Container for Nodes
public NodeCollection<T> Nodes
{
get
{
if (nodes == null) nodes = new NodeCollection<T>(this);
return nodes;
}
}

public Node<T> Add(T value)
{
Node<T> node = new Node<T>();
node.Value = value;
Nodes.Add(node);
return node;
}

private static readonly List<Node<T>> Empty = new List<Node<T>>();
// note still allows foreach ;-p
public IEnumerator<Node<T>> GetEnumerator()
{
if (nodes == null) return Empty.GetEnumerator();
return nodes.GetEnumerator();
}
}

Marc Gravell · Mar 31, 2008

(oh, and you'd want to add parent detection to NodeCollection - i.e. decide
what to do if you add a node that is already parented. And in particular
prevent loops).

Jon Slaughter · Apr 1, 2008

Marc Gravell said:
Binary formatter is largely a field (reflection) formatter - and yes: it
tends to include a lot of type information... but the dupliaction does
mean that you might consider compression algorithms such as GZip. Rather
than write a formatter, you can do your own serialization *within* the
formatter by simply implementing ISerializable, and providing
GetObjectData and the custom ctor.

However; another option might be the xml-based serializations; done
correctly, these can contain less type information - and you can use
TypeConverter (and in particular, Convert[To|From]InvariantString) within
custom xml serialization (IXmlSerializable). Another possibility (if space
is an issue) is to combine the xml approach with the JSon formatter - or
(perhaps a better option) simply combine regular xml (from XmlSerializer)
with Gzip.

Either way, the formatter will handle *many* (but not all) collection
types - some may be unsuited to this type of use...

Thanks. I'm not sure what I'm going to do. I guess I have to implement all
the cases if I want to write one myself. I suppose for now I can just use
binary formatter or the xml formatter and maybe compress as you have shown
with your code.

I think I might try to implement all the value types then just loop over the
ref types that are enumerable and hope to get most of the cases that I'll
need. I realy don't want to end up using a bloated file and I'm not sure how
efficient compression will be(Since I've read the GZipStream isn't really
that great).

Thanks again,
Jon

Marc Gravell · Apr 1, 2008

Thanks. I'm not sure what I'm going to do. I guess I have to implement all

the cases if I want to write one myself.

Yeuch. That really is the hard way... I don't recommend it

I think I might try to implement all the value types then just loop over
the ref types that are enumerable

Well, to put things back in you'll probably want IList/IList<T>, or check
(reflection) for an Add... and watch out for string (which is reference-type
and enumerable).

I realy don't want to end up using a bloated file and I'm not sure how
efficient compression will be

My view: first write down (on paper : not here) what is acceptable. Then
benchmark. Otherwise you have no sensible idea of how good / bad it is...

(Since I've read the GZipStream isn't really that great).

Well, it managed to get the file down to sub-25%, and in larger examples
~10% is not uncommon (for xml) in my experience. But without a target size,
it is a pointless debate...

Marc

Jon Slaughter · Apr 1, 2008

Marc Gravell said:
Yeuch. That really is the hard way... I don't recommend it

Well, to put things back in you'll probably want IList/IList<T>, or check
(reflection) for an Add... and watch out for string (which is
reference-type and enumerable).

My view: first write down (on paper : not here) what is acceptable. Then
benchmark. Otherwise you have no sensible idea of how good / bad it is...

From my little tests its pretty bad. I did not do any testing with xml
though and it looks like it works better. The other issue is not size but
speed. I just don't see why its necessary that I sacrifice both and get
nothing in return except it being much easier to code. (of course I didn't
know it was going to be so much trouble in the first place but I kinda feel
I might as well try to write my own now and see were it goes...)

Well, it managed to get the file down to sub-25%, and in larger examples
~10% is not uncommon (for xml) in my experience. But without a target
size, it is a pointless debate...

Yes, it did seem to do a decent job in your examples but from what I've read
one would be lucky, in general, to get it down below 50% since it doesn't
take the whole stream into account. Maybe in these cases theres enough
redundency to make it work.

Come on guys!! Surely someone here knows something about custom serialization?

Marc Gravell

Jon Slaughter

Marc Gravell

Marc Gravell

Jon Slaughter

Marc Gravell

Jon Slaughter