Thoughts/advice on select optimizations

Thomas Nielsen [AM Production A/S] · Dec 16, 2003

I'm looking for thoughts/advice on how to build the fastest search + (paged)
results-page in an .NET + SQL Server 2000 environment. I hope this is the
proper group for this, otherwise i appologize for the spam.

I have one table with around 100.000 records containing some text fields.
What i need to do is free-text search through these records, and return a
paged result to the client. The user should be able to select which field to
sort by. I'm estimating 10.000 concurrent users on the system.

My initial approach to this would be to do the initial search once
(selecting records using a full-text indexing), and throw the result into a
TEMP table. Each paged result the user requests would then be taken from
this table, using DataAdapter.Fill(DataSet, startRecord, numberOfRecords)

I am, however, not sure if this is the right approach. Not using a TEMP
table would probably increase performance on the first results page, but I
would assume the TEMP table helps decrease system load a lot on subsequent
page requests. Also, I'm not completely sure if DataAdapter.Fill with
startrecord, numberofrecords does only transfers the selected records to the
web server, or if the entire resultset is transfered to the webserver, and
the dataset THEN populated with the selected records.

Does anyone have any thoughts on the best way to implement a system like
this?

Cheers,

Thomas

Miha Markic · Dec 16, 2003

Hi Thomas,

I would use the following method (your fill won't work anyway because it
will use the original select to retrieve the data again) .
But with slight modifications it could work.
Here is the idea.
You do the search, but return only primary key values to a temp table.
Assuming pk is incrementing integer.

When user needs data
You would create a select like: SELECT TOP NumberOfRecords ... .... FROm ...
WHERE PrimaryKey >= FirstPrimaryKey ORDER BY PrimaryKey
where NumberOfRecords is page size and FirstPrimaryKey is first pk to fetch.

And just run adapter.Fill(datatable) when needed to fetch a page of records.

John Kane · Dec 16, 2003

Thomas,
100K rows is well within the "sweet spot" for SQL Server 2000 Full-text
Search (FTS) and while I'm not sure if this what you're looking, but one way
to do this is with FREETEXTTABLE (or CONTAINSTABLE for more specific
results) and inserting the duplicated rows into a temp table and then using
the max RANK value to "summarize" the data down to one row, as follows:

CREATE TABLE #FTSQueryResult (PK_ID int, Rank int)
INSERT #FTSQueryResult
select * from freetexttable(FTSDash, VarcharCol,'"multi search words
phrase"') as fts
Select Max(Rank) as Rank From #FTSQueryResult Group by Rank

Thanks,
John
PS: For FTS related questions the newsgroup to post to is:
microsoft.public.sqlserver.fulltext

Anders Borum · Dec 16, 2003

Hello!

While this approach currently is the best way to do paging in SQL Server,
you might want to look out for ADO.NET 2.0. It allows you to do paging with
SQL Server - without first copying all your rows to a TEMP table first.

As far as I remember, it was about using a new SqlDataReader object where
you could specify page size and page position before getting the
SqlDataReader.

Miha Markic · Dec 16, 2003

Anders Borum said:
Hello!

While this approach currently is the best way to do paging in SQL Server,
you might want to look out for ADO.NET 2.0. It allows you to do paging with
SQL Server - without first copying all your rows to a TEMP table first.

As far as I remember, it was about using a new SqlDataReader object where
you could specify page size and page position before getting the
SqlDataReader.

ADO.NET is more or less a year away though. :-(

Thomas Nielsen [AM Production A/S] · Dec 17, 2003

Miha,

So if i understand you correctly;

1) Do search, fill TEMP table with resultset
2) Get each results page from the TEMP table - but use SQL logic to retrieve
page records, rather than ADO's .Fill.

The problem with paging into an incremeting integer list of records is that
it will not work with different sorts. If the user eg. choses to display the
search result sorted by something else, I'd have to fire the original search
again.

Thanks for the thoughts,

/Thomas

Thomas Nielsen [AM Production A/S] · Dec 17, 2003

Hi Anders,

This needs to go live mid 2004, so unfortunately i can't rely on that

But it's somewhat surprising to me that ADO.NET doesn't handle this more
ellegantly. I mean, it's fine that it's "easy to use" (for the programmer)
on clients connecting to data backends, but if the query is always fired
again, i'd be really worried about performance perspectives in a lot of
scenarios.

/Thomas

John Kane · Dec 17, 2003

Thomas,
I would disagree and actually so does Microsoft... see
http://www.microsoft.com/sql/techinfo/tips/administration/resultset.asp.
You can use the user stored procs - spGetNextPage & spGetPrevPage - to
handle the paging and modify it to add a variable for the order by clause to
alter the sort order of the results. While the unique au_id column dictates
the order of the results in this example, you can alter this to use "name"
as the concated author's name. And to quote Microsoft - "Although it is
possible to page through result sets using the ADO Recordset object, that
solution is not scalable". This solution is both scalable and meets your
requirements of using SQL FTS and when combined with the INSERT
#FTSQueryResult code that I posted earlier, you a viable solution today
don't have to wait until ADO.NET 2.0...

Regards,
John

Baisong Wei[MSFT] · Dec 17, 2003

Hi Thomas,

Thank you for using MSDN Newsgroup! It's my pleasure to assist you with
your issue.

As my understanding of your question, you have a table with over 100,000
records containing some text fields. You have about 10, 000 concurrent
users and will do free-text search on this table, which can be realized by
full-text index of SQL Server 2000. The result set will be returned to the
users by your application, writing based on ADO.NET technology and of cause
the queries are also lunched by it.

From my experience, when a user connect to the database, not only it will
create a session, but also create a Server Process of SQL Server Process
(each with a unique SPID), when a query is executed, there will be a result
set in your memory, which will be used for the page creating by your
application. There is no use for temp table to save the results set. Too
much temp table, especially for each user, will decrease the SQL Server
dramatically.

AS the question of DataAdapter.Fill(DataSet, startRecord, numberOfRecords),
it will return the selected records not the whole records to the Web Server.

Here I will give you some aspects related to performance of Full-Text
Indexes to your circumstance:

1) For the performance considerations for Full-Text Indexes, because you
table is not too large, it will not sustain heavy read and write activity
and you have little performance tuning to do. While your table become too
large, that is the amount of the records is over millions, you should
consider configuring your SQL Server and the directory where your catalogs
are stored to maximize disk I/O performance by load balancing across
multiple disk drives.

2) For your hardware, you should consider multiple fast processors and lots
of additional memory, since the number of the concurrent user is 10,000. If
you'll be doing a lot of full-text indexing and searching, 1 GB should be
the minimum memory. You'll also want multiple disk controllers with several
channels each, if possible, or, at a minimum, multiple channels on a single
controller. Consider using a very fast disk system with striping and no
fault tolerance to maximize read and write performance.

If you're installing SQL Server on a Windows NT Server, your page file will
need to be 1.5 to 2 times the amount of available physical RAM. This amount
can be reduced on Windows 2000 if you have larger amounts of RAM. The page
file should be placed on its own drive with its own controller.

3) SQL Server Configuration
After a full population of a table with more than a million rows, the
fastest, most efficient method of maintaining your indexes is to use the
change-tracking feature with the background update option. If you must do a
full or incremental population or update the changes in a single operation
rather than in the background, the population should be done during (or
scheduled for) periods of low system activity, typically during database
maintenance windows, if you have them.

When you determine whether to use one or multiple full-text catalogs, you
must take into account both maintenance and searching considerations. There
is a trade-off between performance and maintenance when considering this
design question with large SQL tables, and I suggest that you test both
options for your environment. If you choose to have multiple tables in one
catalog, you'll incur some overhead because of longer running full-text
search queries, but the main penalty will be that any population will take
longer because all tables in the catalog will need to be updated. If you
choose to have a single SQL table per catalog, you'll have the overhead of
maintaining separate catalogs. You also should be aware that there is a
limit of 256 full-text catalogs per machine. The Search service imposes
this limit, and there is only one Search service per machine. The limit is
not related to the number of SQL Server instances you might have.

Hope this will answer you question. If you still have questions, please
feel free to post new message here and I am ready to help!

Best regards

Baisong Wei
Microsoft SQL Server Online Support

Miha Markic · Dec 17, 2003

Hi Baisong,

AS the question of DataAdapter.Fill(DataSet, startRecord, numberOfRecords),
it will return the selected records not the whole records to the Web

Server.

I tend to disagree on this (i may be wrong though) or at least I want to
show disadvantages.
So, what happens if I want last ten records of 1,000,000 records?
IMO the sql server will execute the select and ado.net will *fetch* records
(and dropping them = not inserting them into table) until it reaches
999,990th record. The next ten records will store in DataTable.
So, apart from memory consumption (just 10 records) the process is the same
as fetching all of 1,000,000 records.

The alternative (using TOP 10) would be much more faster in this case.

Can you correct me, if I am wrong?

IbrahimMalluf · Dec 17, 2003

Hello Miha; and Baisong

Miha, I believe, is quite correct here.

"Select * From Table"
and
"Select Top 10 * From Table where Identity > someNumber" essentially produce
the same execution plan. I say essentially because there is the Top 10
statement in Miha's version but it has a 0% cost in this instance.

So the question of performance is shifted to ADO.Net. In Miha's plan the
network transmission will be less because only the 10 records will be
transported because SQL Server is making the decision of what rows to
retain.. In the Fill method's StartRecord, NumberOfRecords has to make that
decision on the client side unless it is auto-magically changing the
behavior of the SelectCommand's stored procedure or select statement. The
data is still transported, but discarded except for the designated rows. In
result sets of less than 10-100k rows this might not be a problem. But when
we get into the 1 million or more range then performance should noticably
deteriorate.

Maybe Miha and I have the wrong understanding of how StartRecord,
NumberOfRecords works. Could you elaborate?

--
Ibrahim Malluf
http://www.malluf.com
==============================================
MCS Data Services Code Generator
http://64.78.34.175/mcsnet/DSCG/Announcement.aspx
==============================================

Baisong Wei[MSFT] · Dec 18, 2003

Hi Miha, Ibrahim,

Thank you for using MSDN Newsgroup! It's my pleasure to assist you with
your issue.

As Miha's question : ' If I want last ten records of 1,000,000 records' and
' fetch records until it reaches records 999,990th record'. Actually, this
is a way of table scan, not a query bu using index. When there is no index
on a table, when the query want to get records from the table, it just scan
the table to find the record which match the 'Where' clause. But, when
there is a index established on a table, the way of getting the record will
totally change. No matter it is a clustered index or non-clustered index,
all indexes are saved in a B-tree structure, you get a record just by
searching the B-tree, from root to node and to leaf of the B-tree
structure. At the leaf of the B-tree of clustered index, you can get the
data, but for leaf of the B-tree of non-clustered index, you get a
bookmark of the record which is the RowID. That means, if a table is
indexed, you can get any records in a table with the same cost, seeking
from root to node then to the leaf and get the records. The performance of
the query depend on the index you established if proper or not (
selectivity of the records ,etc ) , whether it is a covering index for a
query, etc. A big topic to explain. I just want to say that index is to
avoid the table scan you mentioned. So, to select the top 10 is not faster
than the last 10 records.

As Ibrahim's concern, "Select * From Table" VS "Select Top 10 * From Table
where Identity", I have simple tests in the Query Analyzer:
1)
use pubs
go
select * from authors
2)
use pubs
select top 10 * from authors

The they all use a clustered index scan but the second one's 'estimated
cost' is lower than the first one because it just get 10 record from the
table. Since table 'authors' is not large, so there is no I/O difference in
this test, but if the table is pretty large, the cost of select * from
table will be much larger than select from top 10. Your mentioned the '0%
cost' of 'top ten' in the query plan because it use the clustered index
scan so it take little cost.

For Ibrahim's another concern, I'd better say it in this way, if there is a
query executed by SQL Server, a result set will be created consists all the
records returned for this query. The adapter will use this result set and
by your programming, fill some of the records from the result set to the
DataSet, and then return part of the results to the client. As you concern
about if the result set will be too large, actually the SQL Server will use
the assigned physical and virtual memory to contain it. It's rarely that a
query will returned a huge record set to the client, even for some OLAP
system, it will work on the records and return just some statistics as a
result set, then return to the client. If a result set is too huge, the
application or the query need to be re-designed since there seldom for a
reason to return so much records to a client to deal with. For multi-tier
structure, this task should be run on the server's side. But for the SQL
Server side, it is not suggested to create too much temp talbes or curors
unless the convinienc will overweigh the performance impact since they will
take some resource from the server, such as memory. So, for so many
concurrent user with many different queries, it really not a good idea.

Hope this will answer you question. If you still have questions, please
feel free to post new message here and I am ready to help!

Best regards

Baisong Wei
Microsoft SQL Server Online Support

Miha Markic · Dec 18, 2003

Hi Baisong,

Baisong Wei said:
Hi Miha, Ibrahim,

Thank you for using MSDN Newsgroup! It's my pleasure to assist you with
your issue.

As Miha's question : ' If I want last ten records of 1,000,000 records' and
' fetch records until it reaches records 999,990th record'. Actually, this
is a way of table scan, not a query bu using index. When there is no index
on a table, when the query want to get records from the table, it just scan
the table to find the record which match the 'Where' clause. But, when
there is a index established on a table, the way of getting the record will
totally change. No matter it is a clustered index or non-clustered index,
all indexes are saved in a B-tree structure, you get a record just by
searching the B-tree, from root to node and to leaf of the B-tree
structure. At the leaf of the B-tree of clustered index, you can get the
data, but for leaf of the B-tree of non-clustered index, you get a
bookmark of the record which is the RowID. That means, if a table is
indexed, you can get any records in a table with the same cost, seeking
from root to node then to the leaf and get the records. The performance of
the query depend on the index you established if proper or not (
selectivity of the records ,etc ) , whether it is a covering index for a
query, etc. A big topic to explain. I just want to say that index is to
avoid the table scan you mentioned. So, to select the top 10 is not faster
than the last 10 records.

That's on the server side, yes, that is no question.
But our concern is with ADO.NET Fill method.

As Ibrahim's concern, "Select * From Table" VS "Select Top 10 * From Table
where Identity", I have simple tests in the Query Analyzer:
1)
use pubs
go
select * from authors
2)
use pubs
select top 10 * from authors

The they all use a clustered index scan but the second one's 'estimated
cost' is lower than the first one because it just get 10 record from the
table. Since table 'authors' is not large, so there is no I/O difference in
this test, but if the table is pretty large, the cost of select * from
table will be much larger than select from top 10. Your mentioned the '0%
cost' of 'top ten' in the query plan because it use the clustered index
scan so it take little cost.

For Ibrahim's another concern, I'd better say it in this way, if there is a
query executed by SQL Server, a result set will be created consists all the
records returned for this query. The adapter will use this result set and
by your programming, fill some of the records from the result set to the
DataSet, and then return part of the results to the client.

Now, this is the question. How exactly does DataAdapter do this?
It was my understanding that it simply *reads* records and throws them away
until the desired page is found.
Then we stores the pagesize rows into datatable and then it returns.
If it reads records it means that it will have to *read* 999990 records +
plue 10 wanted ones, thus producing heavy network traffic for just 10
records.

Can your confirm or deny this?

Yan-Hong Huang[MSFT] · Dec 19, 2003

Hi All,

In ADO.NET programming, default paging is easy to add using the DataGrid's
built in support for paging. With default paging, all of the data is loaded
into the DataSet, but only a page of data is displayed in the DataGrid. So
if there is many records need to be retrieved, it is not so efficient.

A more efficient method is to use "custom paging". In a nutshell, custom
paging is implemented using carefully constructed SQL Select statements
with a sort index, a corresponding sort order and SQL7's TOP syntax. In
custom paging, only a page of data is loaded into memory. The current page
index must be manually persisted to the ViewState (if you are using
ASP.NET).

Does that answer your question?

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

Miha Markic · Dec 19, 2003

Hi Yan-Hong

Does that answer your question?

Actually not.
I still don't have a *confirmation* how DbDataAdapter.Fill method works when
passing startrecordnumber and pagesize.
The question is: Does it *fetch* all records counting from 0 to
startrecordnumber+pagesize to client (not storing those prior to
startrecordnumber) or does it use some better method.
The problem lies in network traffic (it it fetches 999980 records without
any read need or not).

IbrahimMalluf · Dec 19, 2003

Hello Miha .....:-D

Actually, Yan-Hong did answer our question.

he said:With default paging, all of the data is loaded
into the DataSet, but only a page of data is displayed in the DataGrid. So
if there is many records need to be retrieved, it is not so efficient.
<<<

So in fact, all of the 999,990 rows would be transported to the dataset, but
only the rows we wanted would be available. So yeas, network traffic would
be affected not to mention the workstaion's resources.

--
Ibrahim Malluf
http://www.malluf.com
==============================================
MCS Data Services Code Generator
http://64.78.34.175/mcsnet/DSCG/Announcement.aspx
==============================================

Miha Markic · Dec 20, 2003

Hi,

IbrahimMalluf said:
Hello Miha .....:-D

Actually, Yan-Hong did answer our question.

he said:
With default paging, all of the data is loaded
into the DataSet, but only a page of data is displayed in the DataGrid. So
if there is many records need to be retrieved, it is not so efficient.
<<<

This is not an entirely clear statament, though. But I figure you are right.

So in fact, all of the 999,990 rows would be transported to the dataset, but
only the rows we wanted would be available. So yeas, network traffic would
be affected not to mention the workstaion's resources.

Yup as we suspected.

Thoughts/advice on select optimizations

Thomas Nielsen [AM Production A/S]

Miha Markic

John Kane

Anders Borum

Miha Markic

Thomas Nielsen [AM Production A/S]

Thomas Nielsen [AM Production A/S]

John Kane

Baisong Wei[MSFT]

Miha Markic

IbrahimMalluf

Baisong Wei[MSFT]

Miha Markic

Yan-Hong Huang[MSFT]

Miha Markic

IbrahimMalluf

Miha Markic