Decoding strategy

Peter Duniho · Oct 12, 2006

[...]

Here's a dumb question: is there any particular reason you're NOT mapping
the entire file at once? I've mentioned the possibility in previous
messages, making assumptions that you have your reasons for not doing so.
But if you could, all of these issues just go away. Are you genuinely
concerned that you won't have enough contiguous virtual address space to
map
the whole file?

Click to expand...

Well, there are two issues involved, and I do not know which one are
you reffering to. Let me explain. Mapping is actually two-step process,
first of all you reserve VM for mapping and then you commit, which
result in bringing contents of file to memory.

That's not the process of memory-mapped file i/o I'm familiar with. That
is, while I know you can use MapViewOfFileEx() to provide a specific virtual
address at which to map the file, this isn't necessary, nor does it to my
knowledge require an explicit commit of the entire file.

The usual method of memory-mapping that I use is this:

* open the file (CreateFile)
* create the file mapping (CreateFileMapping)
* assign virtual address space to file mapping (MapViewOfFile)

When MapViewOfFile returns, the code now has a virtual address that
represents the beginning of the data of the file. Physical RAM is committed
only as the data is actually accessed, and can be reclaimed through the
usual page aging process (older pages get tossed as needed if something else
needs physical RAM that's not available).

So, when it comes to
reservation step, I map entire file at once, code I pasted does not
show this step. What is shown is the commitment step, and I commit only
small portion of reserved memory at once.

The code you posted calls only MapViewOfFile. This doesn't reserve any
physical RAM for the data. It just reserves room in the virtual address
space for it.

This app is not going to be
server app, running on high end machines with many gigs of ram. It is
rather intended to be desktop app. So, I do not want to reserve like
500 MB of memory for just one file because it could easily cause
constant swapping and overall performance degradation on user machine.

Negative. That's one of the nice benefits of memory mapping: you can map an
entire file, even a large one, and use only the physical RAM required to
process the parts you're looking at. In addition, because the physical RAM
being used is backed by the mapped file, it doesn't get swapped out to the
swap file...the file itself can be used for the backing store (this doesn't
necessarily help the physical RAM side of things, but it does ease the
pressure on the swap file itself).

There is no reason that I can think of that would cause mapping a large file
into virtual address space to cause any more swapping than processing that
file would cause in any case. The OS certainly does not read all 500MB of a
mapped 500MB file into physical RAM just because you've mapped the file.

[...]

Specifically: what if you modified your code that maps the file, so that
it
maps a range *around* the starting point, the way I suggested with the
buffers? At certain points (perhaps only when you got right to the very
edge and attempted to read a byte outside your mapped range), you would
remap the file, shifting the window so that the bytes you want to deal
with
are within the mapped range.

Click to expand...

It does. Well, I am sorry, because I stripped this code of mapping
logic, but when you see sth like firstBufferIndex it is, almost in all
cases, carefully computed index of a portion which contains requested
data but also its neighbourhood, so that near "jumps" should not cause
remapping. Actually user may as well use enumerated acces, in that case
I know in advance tha data is going to be read forward, then I can, if
I must, map from where previous mapping ends.

That's not what I mean. If you were doing what I was suggesting already,
then the only issue remaining for you would be figuring out when you need to
back up in the data. The actual backing up would be trivial...you'd just
decrement your pointer and read the byte you want to read. You would have
moments when the mapped section of the file would have to change, but that
would be a momentary diversion and you'd get right back to just reading the
bytes from the mapped address space.

[...] Then you translate
that to the actual offset within the mapped range as necessary. That
way,
you can be changing the mapped range on the fly without affecting how the
higher-level code that actually processes the data works.

Click to expand...

Well, that is not going to work for me unfortunately. Interfaces I have
to implement imply that data access uses "string coordinates" - so
client code specifies - I want 5th char, not 5th byte, and reckoning
that encoding-hell I would not be able to compute that easily, so I
decided to use only "string coordinates".

I don't think you got my meaning. I don't mean that the highest level of
your code has to use a byte offset within the file. Just that the decoder
part need not concern itself with anything other than the byte offset. As
it read bytes, it would ask the file mapping layer of your code for a byte
offset within the file, and the file mapping layer would then translate that
into an offset within the mapped view you're using.

That said, so far I haven't seen an indication that you actually need to be
mapping sections of the file. You seem to be concerned about committing too
much physical RAM at once to the mapping, but unless you're doing something
really odd that you haven't posted in code, your concern is unfounded.

There are reasons that you might not be able to map an entire file into your
virtual address space, but 500MB ought to be within the usual limitations.
It seems to me that you should look at just mapping the entire file all at
once, and if you run into problems with that, then start worrying about
windowing the file.

The reason you might not be able to map the whole file at once is that you
don't have a contiguous range of virtual address space large enough for the
file. That can happen for two reasons: insufficient virtual address space
left or fragmented virtual address space. How much virtual address space
you might have will vary, but even the theoretical 2GB maximum (and of
course, this never comes close to being available) is smaller than some
files. Fragmentation is harder to predict, and could limit your available
virtual address space to something significantly smaller than the actual
virtual address space left. But IMHO, if 500MB is a typical file size for
you, you ought to be able to map that without problems.

Pete

marcin.rzeznicki · Oct 12, 2006

Peter said:
[...]

Here's a dumb question: is there any particular reason you're NOT mapping
the entire file at once? I've mentioned the possibility in previous
messages, making assumptions that you have your reasons for not doing so.
But if you could, all of these issues just go away. Are you genuinely
concerned that you won't have enough contiguous virtual address space to
map
the whole file?

Click to expand...

Well, there are two issues involved, and I do not know which one are
you reffering to. Let me explain. Mapping is actually two-step process,
first of all you reserve VM for mapping and then you commit, which
result in bringing contents of file to memory.

Click to expand...

That's not the process of memory-mapped file i/o I'm familiar with. That
is, while I know you can use MapViewOfFileEx() to provide a specific virtual
address at which to map the file, this isn't necessary, nor does it to my
knowledge require an explicit commit of the entire file.

The usual method of memory-mapping that I use is this:

* open the file (CreateFile)
* create the file mapping (CreateFileMapping)
* assign virtual address space to file mapping (MapViewOfFile)

That's the same, but under different names. CreateFileMapping reserves
VM range. It is not yet committed, and you pay almost no
resources-usage/performance price. MapViewOfFile commits some part of
previously reserved VM and brings contents of file (maybe lazily, I
don't know for sure)

When MapViewOfFile returns, the code now has a virtual address that
represents the beginning of the data of the file. Physical RAM is committed
only as the data is actually accessed, and can be reclaimed through the
usual page aging process (older pages get tossed as needed if something else
needs physical RAM that's not available).

The code you posted calls only MapViewOfFile. This doesn't reserve any
physical RAM for the data. It just reserves room in the virtual address
space for it.

Well, actually, if I understand docs correctly, CreateFileMap reserves
virtual memory address range and establishes associacion between VM
addresses and file. MapViewOfFile brings contents of file to RAM

Negative. That's one of the nice benefits of memory mapping: you can map an
entire file, even a large one, and use only the physical RAM required to
process the parts you're looking at. In addition, because the physical RAM
being used is backed by the mapped file, it doesn't get swapped out to the
swap file...the file itself can be used for the backing store (this doesn't
necessarily help the physical RAM side of things, but it does ease the
pressure on the swap file itself).

Positivie with respect to "swapping" definition :-)

It does not get
swapped to swap file, true, but still it may be swapped to the mapped
file. So, though you are right that memory pressure is removed from
page file, you still pay the price of swapping if lot of RAM is
occupied by file view

There is no reason that I can think of that would cause mapping a large file
into virtual address space to cause any more swapping than processing that
file would cause in any case. The OS certainly does not read all 500MB of a
mapped 500MB file into physical RAM just because you've mapped the file.

I think that when I've established a view then RAM gets occupied. So,
as I said, I map whole file at once as docs assure me that there is
nothing wrong with that, but I restrict myself to moderately sized
views.

[...]

Specifically: what if you modified your code that maps the file, so that
it
maps a range *around* the starting point, the way I suggested with the
buffers? At certain points (perhaps only when you got right to the very
edge and attempted to read a byte outside your mapped range), you would
remap the file, shifting the window so that the bytes you want to deal
with
are within the mapped range.

Click to expand...

It does. Well, I am sorry, because I stripped this code of mapping
logic, but when you see sth like firstBufferIndex it is, almost in all
cases, carefully computed index of a portion which contains requested
data but also its neighbourhood, so that near "jumps" should not cause
remapping. Actually user may as well use enumerated acces, in that case
I know in advance tha data is going to be read forward, then I can, if
I must, map from where previous mapping ends.

Click to expand...

That's not what I mean. If you were doing what I was suggesting already,
then the only issue remaining for you would be figuring out when you need to
back up in the data. The actual backing up would be trivial...you'd just
decrement your pointer and read the byte you want to read. You would have
moments when the mapped section of the file would have to change, but that
would be a momentary diversion and you'd get right back to just reading the
bytes from the mapped address space.

Sorry Peter, I don't get it then. Could you explain it to me, it seems
to be interesting idea, but now I feel that I 've got lost.

[...] Then you translate
that to the actual offset within the mapped range as necessary. That
way,
you can be changing the mapped range on the fly without affecting how the
higher-level code that actually processes the data works.

Click to expand...

Well, that is not going to work for me unfortunately. Interfaces I have
to implement imply that data access uses "string coordinates" - so
client code specifies - I want 5th char, not 5th byte, and reckoning
that encoding-hell I would not be able to compute that easily, so I
decided to use only "string coordinates".

Click to expand...

I don't think you got my meaning. I don't mean that the highest level of
your code has to use a byte offset within the file. Just that the decoder
part need not concern itself with anything other than the byte offset. As
it read bytes, it would ask the file mapping layer of your code for a byte
offset within the file, and the file mapping layer would then translate that
into an offset within the mapped view you're using.

Isn't that what ReadPage in my code does? It is asked to bring contents
indexed by block offset, it computes "real" offset and establishes a
view. Decoder part does not even have to think of byte offsets because
it operates on current page only, and pointer to it is constant in time
when decoder operates.

That said, so far I haven't seen an indication that you actually need to be
mapping sections of the file. You seem to be concerned about committing too
much physical RAM at once to the mapping, but unless you're doing something
really odd that you haven't posted in code, your concern is unfounded.

There are reasons that you might not be able to map an entire file into your
virtual address space, but 500MB ought to be within the usual limitations.
It seems to me that you should look at just mapping the entire file all at
once, and if you run into problems with that, then start worrying about
windowing the file.

The reason you might not be able to map the whole file at once is that you
don't have a contiguous range of virtual address space large enough for the
file. That can happen for two reasons: insufficient virtual address space
left or fragmented virtual address space. How much virtual address space
you might have will vary, but even the theoretical 2GB maximum (and of
course, this never comes close to being available) is smaller than some
files. Fragmentation is harder to predict, and could limit your available
virtual address space to something significantly smaller than the actual
virtual address space left. But IMHO, if 500MB is a typical file size for
you, you ought to be able to map that without problems.

So, if I understand correctly what you wrote, I am not concerned with
mapping file at once, I reserve all VM I will need for one file
(CreateFileMapping). But I am concerned when it comes to commit
(MapViewOfFile) because that's where memory resources are really
consumed. Am I missing something?

Peter Duniho · Oct 13, 2006

That's the same, but under different names. CreateFileMapping reserves
VM range.

That is incorrect. The virtual memory range is not reserved until you call
MapViewOfFile.

[...] It is not yet committed, and you pay almost no
resources-usage/performance price. MapViewOfFile commits some part of
previously reserved VM and brings contents of file (maybe lazily, I
don't know for sure)

That is also incorrect. MapViewOfFile reserves the virtual address space.
There may be some caching, but otherwise committing the file data to
physical RAM does not occur until a specific portion of the reserved virtual
address space is referenced.

I'm offline right now, otherwise I'd provide a link to the MSDN web site.
However, you can easily look those functions up yourself, and the
documentation explicitly describes the behavior as I do above.

From the documentation for CreateFileMapping:

Creating a file mapping object creates the potential for
mapping a view of the file, but does not map the view. The
MapViewOfFile and MapViewOfFileEx functions map a view of
a file into a process address space

If CreateFileMapping was what allocated virtual address space, it would not
make sense for MapViewOfFileEx to even exist, since the main reason for that
function is to allow the program to provide a specific virtual memory
address at which to map the file.

[...]
Well, actually, if I understand docs correctly, CreateFileMap reserves
virtual memory address range and establishes associacion between VM
addresses and file. MapViewOfFile brings contents of file to RAM

What can I say? You don't understand the docs correctly.

[...]
Positivie with respect to "swapping" definition It does not get
swapped to swap file, true, but still it may be swapped to the mapped
file. So, though you are right that memory pressure is removed from
page file, you still pay the price of swapping if lot of RAM is
occupied by file view

My point is that the amount of data in physical RAM will be related to your
use of that data. The OS will keep the data in physical RAM based on your
access of that data, not based on how much of it there is. This is true
whether you use memory mapping or not.

With either technique, you can limit the *maximum* amount of physical RAM
potentially consumed. Using memory mapping, you do this by mapping only a
small range of the file at a time. Using conventional file i/o, you do this
by limiting your own buffers that are used to store data you've read from
the file.

In either case, the OS has the final say on how much physical RAM is
actually used. Using memory mapping, if there are other depends on physical
RAM, then only a portion of mapped virtual address space will actually be
resident at any given time. Likewise, using conventional file i/o, only a
portion of your own program buffers will be resident in physical RAM at any
given time.

But memory mapped file i/o will not in and of itself increase memory
swapping. The only way it could do that is if you not only map the entirety
of a very large file in RAM, but you wind up *accessing* the totality of
that file more frequently than you access anything else. In that case, the
OS would be chasing you trying to keep all of the file data you're
referencing resident, at the same time that other stuff needs to be swapped
in and back out.

This is not a typical case, and doesn't seem relevant to your own situation.
In any case, the OS is pretty smart. If your use of a memory mapped file
starts pressuring other users of physical RAM, the OS is not going to bother
trying to keep all of the memory mapped file in RAM. Even better, as long
as you open the file as read-only, you're assured to never have to have the
cost of writing any data back to the disk if a physical page of RAM used by
the file mapping has to get discarded and used for something else.

Your worries about memory mapping the entire file causing some serious
problem with disk swapping are unfounded.

I think that when I've established a view then RAM gets occupied. So,
as I said, I map whole file at once as docs assure me that there is
nothing wrong with that, but I restrict myself to moderately sized
views.

But it's not true that when you establish a view then RAM gets occupied.
The "view" is an allocation of virtual address space, not physical RAM.

Sorry Peter, I don't get it then. Could you explain it to me, it seems
to be interesting idea, but now I feel that I 've got lost.

Assume you have some code that attempts to retrieve a byte from a specific
file offset. Assume also that you have some code that translates this into
access from your mapped view of the file. Finally, assume that the
higher-level code is trying to access a byte that is just before the lowest
file offset currently being mapped.

In pseudocode then:

// The desired byte offset from the file
long ibFileOffset;
// This is the mapped range, "Min" inclusive, "Mac" exclusive
long ibMappedMin, ibMappedMac;
// The resulting offset within the mapped range
long ibMappedOffset;

if (ibFileOffset < ibMappedMin || ibFileOffset >= ibMappedMac)
{
// remap file so that ibMappedMin < ibFileOffset and
// ibFileOffset < ibMappedMac. Don't forget to make sure
// that ibMappedMin and ibMappedMac remain between 0 and
// the total file length.
}

ibMappedOffset = ibFileOffset - ibMappedMin;
return *(pbMappedData + ibMappedOffset);

Basically, in the normal case, all that the code is doing is translating the
file offset to the mapping offset and returning the data at that offset.
When the requested data falls outside the range, you just shift the offset
enough to accomodate the new request for data.

Most likely, you'd try to center the newly-mapped range on the request file
offset. When you get near the beginning or end of the file, you'll
necessarily wind up at least trimming the mapped range as appropriate
(making it smaller than normal), if not just pinning the range to the
relevant boundary (preserving the total size of the mapping).

Isn't that what ReadPage in my code does? It is asked to bring contents
indexed by block offset, it computes "real" offset and establishes a
view. Decoder part does not even have to think of byte offsets because
it operates on current page only, and pointer to it is constant in time
when decoder operates.

IMHO, there's no reason for the decoder to have to think of pages within the
file. As near as I can tell, that's an arbitrary choice affected by the
implementation of your file i/o. In particular, if I understand correctly
(and maybe I don't), part of the issue of "tearing" that you're worried
about comes about because of the potential for data being read to cross one
of these page boundaries.

The decoder should be concerning itself only with the entire file. That's
why you have the tearing issue. If you allowed the decoder to simply use an
offset relative to the beginning of the file, then the decoder would never
have to worry about whether the data falls outside the currently mapped
range. The i/o code would take care of that instead, and always return
whatever byte it is the decoder wants to handle.

Of course, if you simply map the entire file all at once, the issue becomes
trivial. So this may or may not be a moot point. You don't seem to be
basing your architectural decisions on correct information about how file
mapping works, so maybe understanding correctly how file mapping works you
will find all of this "map a subset of the file" stuff becomes irrelevant.

[...]
So, if I understand correctly what you wrote, I am not concerned with
mapping file at once, I reserve all VM I will need for one file
(CreateFileMapping). But I am concerned when it comes to commit
(MapViewOfFile) because that's where memory resources are really
consumed. Am I missing something?

Yes, I think so. See above.

Pete

marcin.rzeznicki · Oct 16, 2006

Peter Duniho napisal(a):

That's the same, but under different names. CreateFileMapping reserves
VM range.

Click to expand...

That is incorrect. The virtual memory range is not reserved until you call
MapViewOfFile.

[...] It is not yet committed, and you pay almost no
resources-usage/performance price. MapViewOfFile commits some part of
previously reserved VM and brings contents of file (maybe lazily, I
don't know for sure)

Click to expand...

That is also incorrect. MapViewOfFile reserves the virtual address space.
There may be some caching, but otherwise committing the file data to
physical RAM does not occur until a specific portion of the reserved virtual
address space is referenced.

Methinks that we are giving the same subject different names. Here is
quotation from MSDN:

the address range is reserved with the function CreateFileMapping until
portions are requested via a call to function MapViewOfFile. This
permits applications to map a large file (it is possible to load a file
1 GB in size in Windows NT) to a specific range of addresses without
having to load the entire file into memory. Instead, portions (views)
of the file can be loaded on demand directly to the reserved address
space.

[...]

Your worries about memory mapping the entire file causing some serious
problem with disk swapping are unfounded.

Yes, it seems that I was wrong. I'll have to rethink design once again.

[...]

Assume you have some code that attempts to retrieve a byte from a specific
file offset. Assume also that you have some code that translates this into
access from your mapped view of the file. Finally, assume that the
higher-level code is trying to access a byte that is just before the lowest
file offset currently being mapped.

In pseudocode then:

// The desired byte offset from the file
long ibFileOffset;
// This is the mapped range, "Min" inclusive, "Mac" exclusive
long ibMappedMin, ibMappedMac;
// The resulting offset within the mapped range
long ibMappedOffset;

if (ibFileOffset < ibMappedMin || ibFileOffset >= ibMappedMac)
{
// remap file so that ibMappedMin < ibFileOffset and
// ibFileOffset < ibMappedMac. Don't forget to make sure
// that ibMappedMin and ibMappedMac remain between 0 and
// the total file length.
}

ibMappedOffset = ibFileOffset - ibMappedMin;
return *(pbMappedData + ibMappedOffset);

But, is there any difference if it seems that mapping whole file at
once will do?

[...]

Of course, if you simply map the entire file all at once, the issue becomes
trivial. So this may or may not be a moot point. You don't seem to be
basing your architectural decisions on correct information about how file
mapping works, so maybe understanding correctly how file mapping works you
will find all of this "map a subset of the file" stuff becomes irrelevant.

Yes, and that's the whole point. You are perfectly right about MMF, I
shouldn't have worried about tearing, because I am able to map file at
once and rely on OS when it comes to swapping.

[...]
So, if I understand correctly what you wrote, I am not concerned with
mapping file at once, I reserve all VM I will need for one file
(CreateFileMapping). But I am concerned when it comes to commit
(MapViewOfFile) because that's where memory resources are really
consumed. Am I missing something?

Click to expand...

Yes, I think so. See above.

Thank you very much. You clarified me this whole mapping issue :-)

Thanks once again

Peter Duniho · Oct 17, 2006

Methinks that we are giving the same subject different names. Here is
quotation from MSDN: [...]

I'm not sure of that. You still seem to believe that CreateFileMapping
affects the use of the virtual address space, and you still seem to believe
that calling MapViewOfFile affects the use of physical memory.

As far as this specific misunderstanding goes, IMHO you should be very
careful about believing a statement found in a general article, rather than
specific comments found in the documentation for the functions you're trying
to understand. In particular, the comments found in the documentation for
CreateFileMapping, MapViewOfFile, and MapViewOfFileEx trump any other
documentation you might find, unless you have independent confirmation that
suggests otherwise.

In this case, I am aware of no other independent confirmation. It seems
most likely to me that the article is simply mentioning in passing some
behavior of the functions that may or may not be relevant to your use.

If you look at the documentation for CreateFileMapping, you'll note that
there is a way of calling it for a mapping not backed by a specific disk
file. In this use, it may be true that CreateFileMapping reserves a virtual
address range. However, that doesn't mean that that's what the function
does in all cases.

As I've pointed out, the behavior of CreateFileMapping and MapViewOfFile(Ex)
are specifically documented contrary to your understanding and contrary to
the article you've referenced. In particular, it would make no sense:

1) That CreateFileMapping could reserve any virtual address space, when
the whole point of the MapViewOfFileEx function is to specify a specific
virtual address at which to map the file. If CreateFileMapping had already
reserved virtual address space, then there would be no way to ask for a
specific virtual address later, as CreateFileMapping would have already
determined the mapped virtual address (you can't reserve a range of virtual
address space without knowing where in the virtual address space it is), or

2) That CreateFileMapping can reserve virtual address space before
knowing how much address space to reserve. When you call CreateFileMapping,
you tell it the full extent of the file you wish to map. It is perfectly
legal for this extent to be larger than 2GB. How would CreateFileMapping
reserve virtual adddress space in this case? Does it pick an arbitrary
length for the range? What happens when you ask to map more than the
arbitrary length it chose? No, I think it much more likely that the
documentation is correct and that virtual address space is not reserved
until you call MapViewOfFile(Ex).

By the way, you should be able to use the VirtualXXX functions or possibly
performance counters to confirm the behavior. I haven't looked closely at
what's available, but I'm sure there's some mechanism for querying the state
of the process's virtual memory. In particular, if you call
CreateFileMapping and the available virtual memory before the call and after
the call is reduced by the size of the mapping you've requested, then that
would support your interpretation that CreateFileMapping is reserving
virtual address space.

I suspect you'll find that a large change in the virtual memory available
happens only after MapViewOfFile.

[...]
But, is there any difference if it seems that mapping whole file at
once will do?

No, I don't think so. If it is suitable for your needs to map the entire
file at once, then any issues related to windowing the file simply go away.

Yes, and that's the whole point. You are perfectly right about MMF, I
shouldn't have worried about tearing, because I am able to map file at
once and rely on OS when it comes to swapping.

Indeed.

Thank you very much. You clarified me this whole mapping issue
Thanks once again

You're very welcome. I only regret that it seems as though none of this
thread has anything to do with C#.

Pete

Dealing with textfiles with multiple encodings	2	Oct 21, 2006
Decoding DOS formatted file	3	Aug 28, 2007
Fonts and character encodings	9	Sep 26, 2010
base64 encode/decode issue	2	Jan 7, 2009
Protect input data stored in external files (or Deserialize encoded xml file? (serialized and encode	2	Apr 5, 2006
share a structure array containing multidimensional char array C#/	9	Apr 10, 2008
Extract body from MIME message	0	May 5, 2006
encoding question	7	Jan 11, 2005

Decoding strategy

Peter Duniho

marcin.rzeznicki

Peter Duniho

marcin.rzeznicki

Peter Duniho

Ask a Question

Similar Threads