High volume data access architecture

Guest · Oct 20, 2006

Hi,

We currently have a very high volume ASP.NET application. The web server is
processing anywhere between 500-750 web hits per second. These hits include
webservices and .aspx pages. For performance reasons, we cache most
everything in a bunch of in-memory object caches. We are now looking to
scale out to multiple servers, so need to start pushing some of the data to a
database (for reasons I can't get into, use of a persisted session object is
not an option.. it is not session information that will need to be shared
across servers). Anyway, the model we are moving to is as follows:

1. We are going to start storing a number of core data in the database,
primarily in 3 tables.

2. These tables are going to be updated directly via our datalayer/ADO.NET
calls.

3. We are going to keep an in-memory snapshot of each of these tables on
each server.

4. We are going to have a background thread update this snapshot every 10
seconds. (the server app can actually survive without real-time data,
near-time 10 second works fine).

5. Because these tables can hold upwards of 10,000 rows, the update thread
is going to be doing optimized updates that query only for new data since the
last update, then merge that information with the in-memory table. Then,
every 60 seconds, we are going to do a refresh of the full table and reload
the entire thing. Again, this process works fine for us.

So, here's the question and the rub: Updates to the data aren't going to be
a problem. The application will never update the in memory table directly.
However, my concern is in the reads. While we don't do a lot of updates, the
servers are going to be accessing these in memory tables pretty much on every
one of the 750/calls per second. What I need some advice on, is what are
the best practices for retrieving information from the table to manage
concurrency, and assure that rows dont get deleted or updated while the
application tries to read information from these in-memory tables.

Should we create an interim layer that lets the app query the in-mem table
and return a DataRow? should it be a copy of the DataRow or the original?
should it return variable data (not in a DataRow format) so we don't have to
worry about access? What portions of the table access, searching, DataRow
information gathering should we place lock() around? do we have to lock the
entire table? etc. etc.

Any suggestions would be incredibly helpful.

thanks,
Jasen

David Browne · Oct 20, 2006

Jasen said:
Hi,

We currently have a very high volume ASP.NET application. The web server
is
processing anywhere between 500-750 web hits per second. These hits
include
webservices and .aspx pages. For performance reasons, we cache most
everything in a bunch of in-memory object caches. We are now looking to
scale out to multiple servers, so need to start pushing some of the data
to a
database (for reasons I can't get into, use of a persisted session object
is
not an option.. it is not session information that will need to be shared
across servers). Anyway, the model we are moving to is as follows:

1. We are going to start storing a number of core data in the database,
primarily in 3 tables.

2. These tables are going to be updated directly via our datalayer/ADO.NET
calls.

3. We are going to keep an in-memory snapshot of each of these tables on
each server.

4. We are going to have a background thread update this snapshot every 10
seconds. (the server app can actually survive without real-time data,
near-time 10 second works fine).

5. Because these tables can hold upwards of 10,000 rows, the update
thread
is going to be doing optimized updates that query only for new data since
the
last update, then merge that information with the in-memory table. Then,
every 60 seconds, we are going to do a refresh of the full table and
reload
the entire thing. Again, this process works fine for us.

So, here's the question and the rub: Updates to the data aren't going to
be
a problem. The application will never update the in memory table
directly.
However, my concern is in the reads. While we don't do a lot of updates,
the
servers are going to be accessing these in memory tables pretty much on
every
one of the 750/calls per second. What I need some advice on, is what are
the best practices for retrieving information from the table to manage
concurrency, and assure that rows dont get deleted or updated while the
application tries to read information from these in-memory tables.

Should we create an interim layer that lets the app query the in-mem table
and return a DataRow? should it be a copy of the DataRow or the original?
should it return variable data (not in a DataRow format) so we don't have
to
worry about access? What portions of the table access, searching, DataRow
information gathering should we place lock() around? do we have to lock
the
entire table? etc. etc.

Any suggestions would be incredibly helpful.

Instead of updating the existing dataset, copy it (or create a new one),
update that, and then switch it out for the one the clients are reading.
That way sessions continue to have read access to the old data until the new
data is ready, and you cut over with a simple, atomic variable assignment.

David

David

Frans Bouma [C# MVP] · Oct 21, 2006

Jasen said:
Hi,

We currently have a very high volume ASP.NET application. The web
server is processing anywhere between 500-750 web hits per second.
These hits include webservices and .aspx pages. For performance
reasons, we cache most everything in a bunch of in-memory object
caches. We are now looking to scale out to multiple servers, so need
to start pushing some of the data to a database (for reasons I can't
get into, use of a persisted session object is not an option.. it is
not session information that will need to be shared across servers).
Anyway, the model we are moving to is as follows:

1. We are going to start storing a number of core data in the
database, primarily in 3 tables.

2. These tables are going to be updated directly via our
datalayer/ADO.NET calls.

3. We are going to keep an in-memory snapshot of each of these tables
on each server.

4. We are going to have a background thread update this snapshot
every 10 seconds. (the server app can actually survive without
real-time data, near-time 10 second works fine).

5. Because these tables can hold upwards of 10,000 rows, the update
thread is going to be doing optimized updates that query only for new
data since the last update, then merge that information with the
in-memory table. Then, every 60 seconds, we are going to do a
refresh of the full table and reload the entire thing. Again, this
process works fine for us.

So, here's the question and the rub: Updates to the data aren't
going to be a problem. The application will never update the in
memory table directly. However, my concern is in the reads. While
we don't do a lot of updates, the servers are going to be accessing
these in memory tables pretty much on every one of the 750/calls per
second. What I need some advice on, is what are the best practices
for retrieving information from the table to manage concurrency, and
assure that rows dont get deleted or updated while the application
tries to read information from these in-memory tables.

Should we create an interim layer that lets the app query the in-mem
table and return a DataRow? should it be a copy of the DataRow or the
original? should it return variable data (not in a DataRow format)
so we don't have to worry about access? What portions of the table
access, searching, DataRow information gathering should we place
lock() around? do we have to lock the entire table? etc. etc.

Why are you caching at such a low level? Isn't it more efficient to
cache at a more higher level?

Take for example this approach:
Say your hardware/software can render the whole site in 3 seconds. So
if you render it every 4 seconds, and cache everything, it will be
responsive no matter what. The cached results are served to the
visitors, the rendering is done at a scheduled interval. If the site
can't keep up, you enlarge this interval, say to every 5 or 10 seconds.

If you do this modularly, so you cache on elements inside a page, you
can also decide which elements to render every 10 seconds or which
elements to render every minute.

The reason why this is way more efficient is that you also save the
data processing time with caching the end-result. This is for example
how high profile sites like slashdot work (partly).

The main issue you'll run into with caching data in a multi-server
setup is that the cache intervals have to be running the same otherwise
you can have different values for the same entity in different cached
sets on different servers: when you cache data in the middle-tier, you
effectively make the middle-tier the habitat of the application state,
which is a bit cumbersome as the application state is scattered across
multiple systems.

Most websites are reading a lot and writing rarely. This means that if
you have such a website, caching the processed data in the form of
rendered elements, is much more efficient than any caching scenario you
will use because you won't perform processing actions on out-of-sync
data and you'll save the processing time when processing the same data
over and over again, as that will result in the same rendered output
anyway.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------

High volume data access architecture

Guest

David Browne

Frans Bouma [C# MVP]