J
Jon Skeet [C# MVP]
I suppose. I did recently misinterpret an ASCII file for Unicode and
had all sorts of trouble. Thank goodness StreamReader is smarter than
I am.
Right. And then there's the issue when someone sends you a file and
you have to *guess* what encoding it's in. That's only a problem in
XML if they've actually encoded it improperly (i.e. the declaration
doesn't match reality).
Indeed you are right. The standard method for handling embedding
spaces (or even commas in CSVs) is to use quotes or some other
character. Again, Regex handles this very well.
Except that in my experience, there are subtly different rules for
different flavours of CSV. Then you've got the issue of escaping
newlines etc.
It is just the same as not knowing which XML format you are using.
No, it's not. XML has a standard. Where's the official definition for
CSV?
XML can represent the same data in many/unlimited? different ways. If code
is written correctly, the amount of code aware of the data source is
minimal.
I believe you're confusing synactic format with semantic format. Yes,
you need to understand the semantics - but the syntax is well defined.
You can validate an XML document on its own (or with its DTD) with no
effort. That's not true of CSV - you have to know whether or not to
ignore the first line, how various things are being escaped etc.
Unless you are working with an extremely deleveloper-
unfriendly format, I say it is just as easy to pull data out no matter
what. The only benefit to XML is that someone already did most of the
parsing work for you, via XPath for example. However, writing a custom
parser can be as simple as a regex expression, so why force your input
into XML when SSV or CSV is readily available?
As I keep saying, it depends on *exactly* which flavour of CSV you're
using. Or perhaps it's actually fixed record. Or some horrible mixture
of the two. Maybe there's a header line or maybe there isn't. You have
to change the code to cope.
An XML parser doesn't try to understand the data, but it *will*
validate the syntax and let you get at the data at a higher level, for
free.
They don't "have" to be fixed-length records. There just needs to be
an agreement that there is one record per line. The additional benefit
is that you aren't forced to send a 20MB file if all your user wants
are the child records, which is a common scenario.
So if you know that the record you want is on line 300,354 how do you
get to that quickly, if the lines aren't all the same number of bytes?
You have to read every line in until you get there.
Now, XML doesn't try to solve that - but databases do. If you want a
database, use a database IMO.
I'm not arguing with you. I am just pointing out that it is a matter
of the environment, and so far what you are said doesn't intice me to
rely on XML. I still don't see how XML can benefit me *here*.
If you've got a format which already exists and you've already got
code for, there's probably not a lot of point in changing. But you do
seem overly resistant to XML just because it doesn't help in one
particular already-solved situation.
In my environment, we work primarily with CSV and SSV. Relationships
are typically formed on the database or are fixed. Again, XML repeats
the data definition every time. I would call this the biggest waste
generator when using XML. It's just not that practical in my
environment.
If everything you've got already uses CSV, then moving to a different
format is likely to be a pain. That's not the fault of XML at all
though.
Repeating the data definition is a mixed blessing - added space
(although as others have said, it compresses really well) but easier
validation.
You make it sound like your parser "knows" what to do with what it
extracts.
No, I don't. I make it sound like the parser knows how to do the
*parsing* part, so you don't need to worry about the escaping level.
That's all it *should* do. That's the syntactic part, not the semantic
part.
What does it extract, where does it go? How does it
magically make your code know what to do? At some point you the
developer must know what the data *means*. You need to know where to
find the data. You need to know where to put it. That requires knowing
the syntax *and* the symantics. Unless you have found a way to
overcome this need for developer intervention? Did ya? 'Cause that
would be something I would like to have in my possession!
The developer doesn't need to know enough to write an XML parser -
they just need to know how to deal with the higher level
representation (the DOM or whatever API you're using to read it in).
They don't need to know how to unescape character entities etc - that
will be done for you by the XML parser. There's a clear distinction
between understanding the syntactic format and understanding what the
data's meant to mean.
But if you think about it, all of these applications of XML are due to
someone deciding to jump on the XML train. They didn't have to use
XML. What about the "in-between". Anyone can send a format over a
network. Does XML provide a benefit that couldn't have been achieved
otherwise?
It means I don't need to write something to do all the escaping, and
the other party doesn't need to write a parser - because they're
already there. The standard is agreed by everyone.
or do these applications use XML because it makes their
code seem more up-to-date? or do they use XML because of the pre-built
tools available?
Well, if you include parsers built into things like Java and .NET
framework libraries as "tools" then yes - and that's a very good
reason!
I would venture that someone could create a CSV
standard document format and the tools to work with it that could
rival any XML-based platform.
Only for two dimensional data though - less flexible. You'd then also
need something equivalent to xsd/dtd for extra validation - to
describe that a particular column could only contain numbers, for
example.
By the time you'd added all of that kind of thing - and the ability to
embed one document within another, etc, I think you'd find all the
attractive simplicity of CSV had disappeared.
Create the tools and you will have
people who use your format. Especially if you can convince them that
it is "the way" to do data tranfer.
Right, you go ahead. You tell everyone to drop XML and adopt your tool
chain instead. You make it an international standard, and make it
available for just about every development platform in existence. Let
us know how you get on
When was the last time you had to create your own XML format and use
it in your own code? Are we just working with tools that make our
lives easier, and it is just coincidental that they used XML? I can't
say. I feel as though there is some reason for all the hype. I just
wish I knew how to utilize it.
Yes, we're working with tools to make life easier. XML happens to be a
data format which meets all my needs most of the time (which CSV
wouldn't, by the way - simply because it only deals with two
dimensional data, when I often want hierarchies in a single document)
and which is well supported by tools. That makes it very useful, IMO!
Not perfect for every occasion, of course, but still not something to
be dismissed as just hype.
Jon