Help with Streams

J

Jonathan Wood

Greetings,

I'm still pretty new to .NET streams and wondered if someone could offer
some suggestions.

Currently, I'm using StreamReader to reading a text file line by line. When
I reach a particular line, I'd like to compute an MD5 hash for the entire
file from that line on, and then I need to continue reading line by line
from the same place I left off.

I see that MD5.ComputerHash() takes a stream object, which I can obtain
using StreamReader.BaseStream. However, because StreamReader uses buffered
read, the seek position in BaseStream does not represent the current
position in the StreamReader. And, since StreamReader, has no Position or
Seek members, I don't see how I could return to my previous position.

Note that these files can be around 10MB so I'd prefer not to load the
entire file into memory.

Any suggestions?

Thanks.

Jonathan
 
J

Jonathan Wood

Hi Peter,
The usual approach, when trying to mix high-level (e.g. StreamReader) with
low-level (Stream) access, is to set the Position (or call Seek()) on the
BaseStream, and then call the StreamReader.DiscardBufferedData() method.

So, the order of operations would be something like this:

-- Use StreamReader to read to the point where you want to start
-- Call DiscardBufferedData() (*)
-- Compute the hash using the BaseStream
-- Set the Position of the BaseStream back to where you left off
-- Start reading from the StreamReader again

(*) The one caveat, I don't know for sure that calling
DiscardBufferedData() will also immediately cause a seek in the
BaseStream. If it does, then you can record your current position from
the BaseStream after calling DiscardBufferedData(). If it doesn't, you'll
need to figure out some way to identify the stream position for the
current reader position. Unfortunately, I don't have time at the moment
to double-check what StreamReader actually does, but hopefully you can
figure that out yourself.

Yeah, it looks like DiscardBufferedData() does not change the position in
the BaseStream. I'm reading a single line of about 35 characters and
BaseStream.Position is 0x400 (obviously, the size of the buffer), which
doesn't change after calling DiscardBufferedData().

Since StreamReader has no Position method, I'm not sure how I'd determine
the actual position. I'd have to guess the position given the length of
lines read (files may or may not include a carriage return in the newline,
and there's also some whitespace I'm discarding).

The only other thing I can think of is abandoning StreamReader (but I need
to read line by line), or if there is a way to clone an open stream on the
same file.

Thanks

Jonathan
 
J

Jonathan Wood

Hi Peter,
Hmmm...well, looking at the StreamReader class in Reflector, it looks to
me as though it doesn't store any state information regarding the current
reader position versus the stream's position. DiscardBufferedData()
simply causes any data retrieved from the stream but not yet read as
characters to be lost.

Yup, that's how it appears.
Unfortunately, thinking about it more I think that you are running into a
more fundamental issue. Consider why StreamReader might not support this
particular kind of usage. In particular, it's actually extremely
inconvenient to maintain a mapping between the reader and stream
positions, and doing so would perform very poorly in any case, because you
would have to decode the bytes to characters one at a time. You could
still buffer the stream data into a byte buffer, but even the overhead of
having to call the encoder one character at time would be very
noticeable. This is especially problematic for character encodings where
you have variable-length characters (e.g. UTF-8, which is the default
encoding for .NET), since for characters longer than one byte, you'd have
to try to decode the one character as many times as there are bytes in the
character (i.e. keep trying until you get a completed character).

Well, the C libraries handled this, although there were some quiks in text
mode, which did stuff like translate \r\n to \n. I assumed the reason it
wasn't supported was because the underlying stream might not know where the
data started or ended, depending on the source. At any rate, it doesn't
support it.
So, you could reimplement a special-purpose reader that supported the
functionality you wanted. It wouldn't even be that hard. But, it would
be at least a little awkward to write, and could perform very poorly as
well. It might perform so poorly that you'd find yourself deciding to
just reimplement the decoder too, so that you can combine the stream i/o
and decoding into a single operation where there is always direct access
to information about the stream position for the current character being
decoded.

That might be a little beyond my current knowledge of the .NET frameworks.
I'd hate to implement my own buffering and line routines. It'd probably be
easier to just open the file twice and have my hash routine figure out where
it needs to go.
You aren't specific about how you're using the data, but depending on your
needs you might be able to approach the problem from a different angle and
achieve the same results. In particular, it's not clear from your
question whether you are actually doing something with the characters you
read, or if you are just reading them to find a particular spot in the
file. If it's the latter, then you could actually encode the search
string itself into the bytes representing that string, and then scan the
stream bytes for a matching sequence of bytes. That way, you're never
actually decoding the bytes from the stream at all.

The first line is a header, which contains a hash code on the rest of the
file. I need to verify the hash code. If the code is good, I need to read
the rest of the lines, which I would be doing something with.

At this point, I'm thinking the best approach might be to open the file (not
using StreamReader) and manually parse out the first line and extract the
hash code, and then run the hash on the rest of the file. Then close the
file and reopen is using StreamReader. That's not ideal but pretty straight
forward.

Thanks!

Jonathan
 
M

Maxwell

Jonathan said:
Hi Peter,


Yup, that's how it appears.


Well, the C libraries handled this, although there were some quiks in
text mode, which did stuff like translate \r\n to \n. I assumed the
reason it wasn't supported was because the underlying stream might not
know where the data started or ended, depending on the source. At any
rate, it doesn't support it.


That might be a little beyond my current knowledge of the .NET
frameworks. I'd hate to implement my own buffering and line routines.
It'd probably be easier to just open the file twice and have my hash
routine figure out where it needs to go.


The first line is a header, which contains a hash code on the rest of
the file. I need to verify the hash code. If the code is good, I need to
read the rest of the lines, which I would be doing something with.

In order to verify the hash code, you'll need to read the entire file
once anyway.

Then if it checks out, you use the bytes in the file after the position
of the end-of-first-line as strings.

You could either:

A)
1. Read the first line of the file, parse the header, and store the
location of the end-of-first-line (EOFL)/beginning-of-bytes (BOB) (EOFL+1)
2. Read the bytes of the file starting from BOB, and compute the
hash (also, considering other data for the hash such as salt from the
header, etc)
3. Compare expected hash value to actual hash value.
4. If it matches, then read the lines from the file,
skipping/discarding the first/header.

B)
1. Read the entire file using StreamReader
2. Discard the first line/header
3. Assuming you know the encoding that you just read, and that
encoding the string to bytes again will produce the same results,
perform a hash compute on the encoded string back to the original
encoding. (using this approach will not guard against collisions if
different byte contents of the file decode to the same string)
4. If the expected hash matches the actual hash, keep the string.
Otherwise, discard it.
At this point, I'm thinking the best approach might be to open the file
(not using StreamReader) and manually parse out the first line and
extract the hash code, and then run the hash on the rest of the file.
Then close the file and reopen is using StreamReader. That's not ideal
but pretty straight forward.

Method A will open the file twice, sure, but method B will allocate
memory for the entire string, which if it is a large file, might not be
what you're looking for. If your hash algorithm is operating on a byte
array and not a stream, however, there might not be many disadvantages
to method B, as you'll have just read all those bytes in to memory for
method A anyway.
Thanks!

Jonathan

Cheers,

-- Maxwell
 
J

Jonathan Wood

Hi Peter,
I'm curious...what C library handled this? Are you sure that it did? I'm
not aware of CRT support for variable-length character encodings (the
Microsoft CRT supports Unicode, but as UTF-16 and so it's trivial to map
from character position to byte position and back).

No, I wasn't discussing Unicode issues. I was referring to the fact that
fseek() will work with files opened in text mode. Although, as I mentioned,
there are some quirks. This ability to seek a line-oriented, text stream was
related to the issue being discussed.
If you're dealing with data that is of a fixed-size character encoding,
the problem is much easier. So far, nothing about your question suggests
that's the case.

It probably is, but I don't want to write code that assumes it will always
be.
I don't see how opening the file a second time helps.

I can open it with a seekable stream and determine the position where I need
to start the hash.
Other than the "close the file and reopen" part, that's basically what I
suggested. You shouldn't have to reopen the file. Just read the data for
hashing as a FileStream, and then when you're ready to process the file as
text, reposition the Stream, create your StreamReader and start reading
the text.

Well, I wasn't even thinking about the details of the implementation yet.
But, yeah, there doesn't appear to be's any reason to literally close and
reopen the file.

Jonathan
 
J

Jonathan Wood

Hi Maxwell,
In order to verify the hash code, you'll need to read the entire file once
anyway.

Then if it checks out, you use the bytes in the file after the position of
the end-of-first-line as strings.
Correct.

You could either:

A)
1. Read the first line of the file, parse the header, and store the
location of the end-of-first-line (EOFL)/beginning-of-bytes (BOB) (EOFL+1)
2. Read the bytes of the file starting from BOB, and compute the hash
(also, considering other data for the hash such as salt from the header,
etc)
3. Compare expected hash value to actual hash value.
4. If it matches, then read the lines from the file,
skipping/discarding the first/header.

I think that's my only choice. Note that I don't want to load the entire
file into memory.

It would be a little easier to process the first line using StreamReader but
it doesn't really seem that this is practical. No big deal, I guess.

Thanks.

Jonathan
 
J

Jonathan Wood

1. Read the first line of the file, parse the header, and store the
location of the end-of-first-line (EOFL)/beginning-of-bytes (BOB) (EOFL+1)

BTW, this is the issue that there doesn't appear to be any direct way to
accomplish. I could look at the length of the first line, but there are
other considerations. For example, is the file Unicode or multibyte, and do
the newlines include carriage returns in addition to line feeds.

Jonathan


__________ Information from ESET Smart Security, version of virus signature database 4173 (20090620) __________

The message was checked by ESET Smart Security.

http://www.eset.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top