[...]
I didn't investigate managed alternatives nor measured
their performance. I also haven't abstracted i/o out very well, so it
may be awkward to replace memory mapping with FileStream and measure
how it performs. I think I suffer from some kind of "premature
optimization" syndrome
That could be. It's a common enough problem. I find myself occasionally
*paralyzed* by it, when I get stuck trying to decide the most optimal
solution and fail to get any progress toward ANY solution.
Well, you can pass hints about your usage of file to memory mapping
function, so I think that OS caches it appropriately.
It will cache as best as it can. But if you are jumping around the file,
the OS simply cannot correct predict what to buffer for you. This is
especially bad when going backwards in the file.
Note that with respect to the hints you can provide, the docs say that best
performance is when you are accessing the file sequentially, and sparsely,
and provide the sequential access hint. That doesn't mean that if you
provide some other hint and are accessing the file differently, you will get
similar performance.
The OS doesn't know what you're doing with the file, and it has no way to
predict when you might go backwards in the file. If you are accessing the
file in a sparse manner (as it appears you may be), then you may find that
often when you go backwards, that data hasn't been read yet. Going back
just one byte might incur another disk read.
It's hard to say for sure without all the details...I'm just pointing out
that these caching issues exist whether you're using a memory-mapped file or
just reading normally.
[...]
Yes, pure advanatage of FileStream I see so far, is that it enables
file access at any offset, so tearing problem can be prevented. Tearing
problem is born because you have to map file at offsets aligned to
allocation block boundary. But that would not be really much if I knew
that I could solve decoding problems reliably.
From this, I think that I may still not fully understand the question.
It is true that a file must be mapped to an aligned memory address. But
this should only affect the virtual address used to locate the file in the
virtual address space. That is, the first byte of the file will be on an
aligned address, but the rest of the file is contiguous from there.
Likewise, even if you are mapping sections of the file into different
virtual address locations (why? is this to allow more of the file to be
mapped in spite of virtual address space fragmentation?), resulting in those
sections of the file having to each be aligned, you can still access the
virtual address for the data in a byte-wise fashion.
All that the alignment requirement affects is where the data winds up in
virtual memory. I don't see how it affects your access of the data.
Now, that said, there do seem to be one or two different issues related to
this. I say "one or two" because they are either the exact same problem or
not, depending on how you look at it.

That is, the inability to map the
entire file to a single contiguous section of your virtual address space at
once. This causes the secondary problem that you may have to jump from one
spot in virtual memory to another as you traverse (forward or backward) the
data in the file. It also may limit how much data you can have mapped at
once.
Judging from this:
That's what I wrote, except for the part "looking for a means around".
Well, depends on what you mean by this, but I'd not rather disband
memory mapping. So I am not looking for "means around memory mapping"
but: living within memory mapping walls, how can I solve the "tearing"
problem?
I'm guessing that both of those issues are really just the same problem for
you. That is, that you have to address the file using non-contiguous
pointers.
Certainly it is

That's how I wanted to implement fallback buffer. Each time I detect
"torn" char I reposition file pointer, probe bytes backward till I find
valid char and provide replacement.
Perhaps you could clarify under what situation you "detect a 'torn' char".
That is, it's unclear to me whether you are referring to simply jumping into
an offset that's not the start of a character, or if this somehow
specifically relates to the sectioning of the file caused by your memory
mapped i/o.
The former would be an issue even if you could map the entire file to a
single contiguous virtual address range. The latter is obviously only an
issue because of the sectioning of the file. I'm confused as to which it
is.
[...]
Yeah, nice solution. Even though performance hit may be noticeable, if
I restrict these operations to fallback times only and extend my index
structure to cache "torn" characters I should not need to execute that
code very often. Seems good to me. Yet ... :-( Can I be sure whether
decoder cannot mistake characters?
Well, as I mentioned...I can't help you with that question.

That
depends on the nature of the data you're decoding, and I don't know enough
to be able to answer that.
Well, yes, please. If you are able to show me how to solve that, then I
can mix memory mapping with direct file access at fallback times and be
perfectly happy.
Okay, let's see if this makes sense. First, keep in mind that my comment
was assuming a general solution to the file i/o problem. I think you should
be able to apply it as a "fallback" solution, but it may or may not be
better than just falling back to reading a few bytes at a time if that's
your approach.
Also, keep in mind that this is just a simple example of what I mean. I
don't mean to imply that this would be the best implementation...just that
it's a sample of the general idea.
Finally, keep in mind that this doesn't remove the issue of sectioning the
file. It just abstracts it out a bit. Since I didn't realize before that
you may be trying to get rid of the whole issue of having to jump your data
reads from one block of memory to another, I proposed the idea not realizing
it may be exactly the opposite of what you're looking for.
That said:
What I meant was that you can read from the file a few blocks at a time,
keeping that buffer centered on where you are currently accessing. You'll
need to keep track of:
-- current file offset
-- an array of blocks read from the file
-- the file offsets those blocks came from
-- the current block
The general idea is to maintain the array of blocks such that there is an
odd number of blocks, at least three, and they are centered on the current
offset within the file you're reading. Normally, you'll be reading from the
middle block. If you skip over to another block, you drop one block from
the far end of the array, and read another adding it to the near end of the
array.
Basically, you're windowing the file in a fixed set of buffers. If you read
new data asynchronously to your use of the data in the buffers you currently
have, then when you drop a block at one end and fill it for use at the other
end, the file i/o can happen while you're still processing the data that you
do have.
Obviously if you jump to a completely different point in the file, you'll
have to wait for the surrounding data to be read, but that's an issue even
if memory mapped files or just reading directly with a FileStream.
[...]
After little afterthought I've found that it is the most significant
question. But let me rephrase what you wrote: it is no problem to find
characters when reading byte sequence forward and every sane encoding
must adhere to this in order to be usable. But is it the same case when
looking backward?
I still don't know.

I suspect that it is, because it was true with the
basic MBCS I've seen. But I also realize that there are a LOT of different
ways to encode text, and some may be context-sensitive.
Some of this really depends on what you mean by "encoding" and "decoding".
The word "encoding" is applied in a variety of ways. Two that could apply
here are the basic idea of text encoding, which mostly just has to do with
the character set, or some actual conversion of data, which has to do with
compressing the data, or translating it into a more portable format (MIME,
for example). I don't even know which of these meanings you're addressing,
making it even harder for me to know the answer.
[...]
So, summing up. I think that question reduces to the one about encoding
characteristics. You showed us very good solution using FileStream. It
can be extended to mix these two approches which may be faster but I
still do not know whether it is realiable.
Indeed, that is a question you should probably figure out. Earlier rather
than later.

Sorry I can't be of more help on that front.
Pete