Concurrently streaming a file to HttpResponse and file IO

A

Anders Borum

Hello!

I'm implementing support for disk based caching of binary resources (blobs)
residing in a SQL database. This post is about choosing the right strategy.

Because of the web environment, there are potentially many concurrent
requests to a resource. I would like to keep the application responsive
(continue to serve requests for resources) while streaming resources to
disk.

The case scenario is a number of concurrent requests (e.g. 10 or 20) asking
for a resource (e.g. with a size of e.g. 32 MB) while the resource has not
yet been cached on disk. In order to serve each request (in keeping the
application responsive), an approach is to start streaming the resource from
DB to the client request - and simultaneously queue a task to the threadpool
that streams the resource to disk.

I'm thinking about implementing a producer / consumer pattern here; a
producer creates tasks that the consumer picks up and starts streaming to
disk.

Another approach would be to receive the request and check for the file
existense (using a cache for quick lookups). If not cached, then check
whether a "streaming register" contains information about the current file
currently being streamed to disk. If not in the "streaming register", then
queue a task to the thread pool (each worker thread is resposible for
registering / unregistering the current streaming process).

Locking semantics should ensure that only a single thread is able to stream
the same file to disk (e.g. it makes no sense that two threads are trying to
stream the same file to disk in parallel).

I've already got all the major parts in place;

1. Authentication of requests to resources.
2. Managing http headers (result codes, mime types etc.).
3. Streaming large resources to disk from the DB (files are stored with a
unique identifier (guid) plus a .cache extension).
4. Streaming large resources to a client response from the DB (using the
chunk pattern).
5. Transmitting disk based resources to a client response (using IIS
infrastructure for high performance).
6. Scavenging of disk based resources not requested for a certain threshold.

I guess what I'm asking for are guidelines (or "do's" and "don'ts"). Am I
working in the right direction? :)
 
P

Peter Bromberg [C# MVP]

I'm not convinced the proposed strategy makes a lot of sense -- It would take
about the same time to read a resource from the disk as it does to select it
out of a SQL server database table. So by having the extra work of storing
requested resources in disk files, you might find that you haven't
particularly bought yourself much increase in scalability - possibly even a
decrease in same.
Just my 2cents.
-- Peter
To be a success, arm yourself with the tools you need and learn how to use
them.

Site: http://www.eggheadcafe.com
http://petesbloggerama.blogspot.com
http://ittyurl.net
 
A

Anders Borum

Hi Peter,
I'm not convinced the proposed strategy makes a lot of sense -- It would
take
about the same time to read a resource from the disk as it does to select
it
out of a SQL server database table.

Well, as far as I'm concerned there's actually a big difference between the
two strategies. First of all, the SQL server may be busy / bandwidth limited
while serving requests from other web frontend servers. In addition, the IIS
is extremely efficient at transmitting binary files (uses a kernel level
process).

Perhaps the right question really is whether it actually makes sense to
stream the binary resourse to the client in parallel while saving it to
disk. It might be a better strategy to just save to disk and then continue
as usual (naturally with locking semantics).
So by having the extra work of storing
requested resources in disk files, you might find that you haven't
particularly bought yourself much increase in scalability - possibly even
a
decrease in same.

I've already benchmarked the two strategies and there's quite a big
difference. In addition, I think it makes sense to keep the workload in the
outermost physical tiers if possible, because it allows for a much more
scalable architecture.
 
J

Jeroen Mostert

Anders said:
I'm implementing support for disk based caching of binary resources
(blobs) residing in a SQL database.

You mean file-based caching, or tiered caching. Calling it "disk based
caching" borders on paradoxical -- it's not as if the SQL database isn't
stored on disk.

Just for completeness:
http://blogs.msdn.com/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

Of course, I wouldn't hop on the SQL Server 2008 bandwagon just yet, but
when it's a stable solution it might very well solve a few problems like this.

Please make sure you've eliminated possible sources of slowness in the blob
solution first, like disk fragmentation, a database server that's running
out of memory or I/O bandwidth, network congestion, poorly chosen network
settings (like the packet size for the SQL connection), DB drivers with
known performance issues and probably other things I'm forgetting. What
you're doing in a case like this is essentially second-guessing the system
that was designed to handle scalable access to your data. That's the sort of
thing that should be completely defensible.
The case scenario is a number of concurrent requests (e.g. 10 or 20)
asking for a resource (e.g. with a size of e.g. 32 MB) while the
resource has not yet been cached on disk.

Again, in a file. Or better yet, "as a file".

Note that, as with all caching strategies, you're facing concurrency and
data integrity issues w.r.t. updates to the blob data. The only way you're
not facing these issues is if the data is never or hardly ever updated, but
that raises the legitimate question of whether you should be storing it as a
blob to begin with if you know separate files will give better performance
(that has the advantage of simplifying backup strategies, but in many cases
that's just not important, or not as important as performance).
In order to serve each request (in keeping the application responsive),
an approach is to start streaming the resource from DB to the client
request - and simultaneously queue a task to the threadpool that streams
the resource to disk.
But you're caching the thing to get a speedup improvement. While it's being
written as a file, every client has to use the direct stream from the DB,
thus not giving you a speedup improvement (in fact, probably a penalty
because of the additional I/O for the caching). Subsequent requests from
other clients might go faster, but you'd still need a good deal of measuring
to find out if this is worth it. I/O generally doesn't parallelize very well
unless you plan for it.
Another approach would be to receive the request and check for the file
existense (using a cache for quick lookups). If not cached, then check
whether a "streaming register" contains information about the current
file currently being streamed to disk. If not in the "streaming
register", then queue a task to the thread pool (each worker thread is
resposible for registering / unregistering the current streaming process).
The thread pool doesn't sound like a good way of doing this. It's intended
for short-lived tasks that preferably can survive indefinite queueing (if
the TP is occupied) to maximize the potential for parallelism. Trouble is,
sequentially writing a large file to disk is neither short-lived nor
parallelizable. This is true even if you split the worker tasks up in
asynchronous bits.

Assuming lots of clients want the same file, which is presumably the case
you're optimizing for (as there's little point in optimizing the case where
everyone wants a different file), it might make more sense to have a
dedicated thread for the caching, writing one file at a time. You could make
that X threads if your I/O subsystem can handle it, and you could also still
use the thread pool, but then you should probably still keep track of how
many requests you're issuing yourself.

What you don't want is your double quadcore server using 8 threads to pester
the harddisk with chunks from 20 different files -- if you like
responsiveness that much better than request time, forget the whole caching
thing and directly stream from the DB, saving yourself the overhead.
I guess what I'm asking for are guidelines (or "do's" and "don'ts"). Am
I working in the right direction? :)
There's an easy and definite "do" in all this, and that's profile, profile,
profile, with realistic and consistent setups, never profiling individual
bits, always the whole system. Your intuition on what ought to work better
is going to fail you sooner or later (usually sooner) and it's just too easy
to waste time on complicated mechanisms that enable new ways of failure
without offering concrete improvements.

In that vein, establish exactly what sort of performance baselines you're
looking for/expecting before you go off to fight the neverending battle for
truth, justice and improved performance, because it shouldn't be neverending.
 
A

Arne Vajhøj

Peter said:
I'm not convinced the proposed strategy makes a lot of sense -- It would take
about the same time to read a resource from the disk as it does to select it
out of a SQL server database table. So by having the extra work of storing
requested resources in disk files, you might find that you haven't
particularly bought yourself much increase in scalability - possibly even a
decrease in same.

For many small files it would certainly decrease performance
sine file opening is an expensive process.

But for 32 MB BLOB's it may still be beneficial. At least
if the files are served by IIS not by an ASP.NET page.

Arne
 
A

Arne Vajhøj

Anders said:
I'm implementing support for disk based caching of binary resources
(blobs) residing in a SQL database. This post is about choosing the
right strategy.

Because of the web environment, there are potentially many concurrent
requests to a resource. I would like to keep the application responsive
(continue to serve requests for resources) while streaming resources to
disk.

The case scenario is a number of concurrent requests (e.g. 10 or 20)
asking for a resource (e.g. with a size of e.g. 32 MB) while the
resource has not yet been cached on disk. In order to serve each request
(in keeping the application responsive), an approach is to start
streaming the resource from DB to the client request - and
simultaneously queue a task to the threadpool that streams the resource
to disk.

I'm thinking about implementing a producer / consumer pattern here; a
producer creates tasks that the consumer picks up and starts streaming
to disk.

Another approach would be to receive the request and check for the file
existense (using a cache for quick lookups). If not cached, then check
whether a "streaming register" contains information about the current
file currently being streamed to disk. If not in the "streaming
register", then queue a task to the thread pool (each worker thread is
resposible for registering / unregistering the current streaming process).

Locking semantics should ensure that only a single thread is able to
stream the same file to disk (e.g. it makes no sense that two threads
are trying to stream the same file to disk in parallel).

I've already got all the major parts in place;

1. Authentication of requests to resources.
2. Managing http headers (result codes, mime types etc.).
3. Streaming large resources to disk from the DB (files are stored with
a unique identifier (guid) plus a .cache extension).
4. Streaming large resources to a client response from the DB (using the
chunk pattern).
5. Transmitting disk based resources to a client response (using IIS
infrastructure for high performance).
6. Scavenging of disk based resources not requested for a certain
threshold.

I am a bit skeptical about the approach.

1) Is the disk cache really giving a significant performance
improvement ?

(should be measured)

2) If yes - then why not just have all the files on disk instead
of copying them out when actually used ?

3) If you want to proceed then I think the easiest approach would
be to have the first request read from DB and write to both
response and disk. Because then it will only need to be read
once.

4) You need to consider how this solution will work in a cluster
(local file caches and local repositories) would not scale
very well.

5) Are you sure that the security provided by IIS serving files
on disk are sufficient ?

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top