sliced find in Linq

  • Thread starter Thread starter Andrus
  • Start date Start date
so I assume it has an appropriate primary key.

Clustered primary key on CustomerId [nchar(10) NOT NULL]

Constraint_type constraint_name delete_action update_action
status_enabled status_for_replication constraint_keys

PRIMARY KEY (clustered) PK_Customers (n/a) (n/a) (n/a) (n/a)
CustomerID
 
I agree, it should be in the provider level. In this specific case,
against a SQL Server database, where, if order is not defined, the order is,
well, undefined.

Given that, I definitely think it should be in the provider for SQL
Server.

For LINQ to objects I can see why you wouldn't want to enforce ordering,
since it might be guaranteed that an IEnumerable implementation will enforce
an order on every iteration (and in most cases, the ordering is in fact, the
same).

Did anyone try this with an order statement, to see if that was included
in the ROW_NUMBER ranking function, so that at least if you wanted to
specify ordering, it would be honored?
 
Did anyone try this with an order statement
Yes - see one of my earlier posts (orderby CompanyName):

SELECT TOP 10 {blah}
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[CompanyName]) AS
[ROW_NUMBER], {blah}
FROM [dbo].[Customers] AS [t0]
INNER JOIN [dbo].[Orders] AS [t1] ON [t0].[CustomerID] =
[t1].[CustomerID]
WHERE [t1].[ShipCountry] = @p0
) AS [t2]
WHERE [t2].[ROW_NUMBER] > @p1
ORDER BY [t2].[CompanyName]
 
Marc said:
take the 2nd page in a query on northwind employee filtered on order
I went for customers, but same horse...
The query below is deliberately not particularly well-done, so no
jubes please! So basically - some things will work on SQL Server
2000, but you may wish to limit yourself to simple paged operations.
I didn't want to start getting into workarounds for this specific
query (perhaps a distinct field or view etc) - the more important
point is "tread carefully" - or upgrade to SQL Server 2005 if you
want to do anything non-trivial with LINQ.

C#/LINQ:
string country = "France";
var query = from c in ctx.Customers
join o in ctx.Orders on
c.CustomerID equals o.CustomerID
where o.ShipCountry == country
orderby c.CompanyName
select c;
query = query.Skip(10).Take(10);

Using SQL 2000:
[Microsoft SQL Server 2000 - 8.00.760 (Intel X86)
Standard Edition on Windows NT 5.0 (Build 2195; Service Pack 4)]

Throws NotSupportedException:
<q>This provider supports Skip() only over ordered queries returning
entities or projections that contain all identity columns, where the
query is a single-table (non-join) query, or is a Distinct, Except,
Intersect, or Union (not Concat) operation.</q>

Strange isn't it?

Because all it takes is this:
1) create temptable with the same columns as the resultset + 1 identity
column, which is the field we'll be selecting on
2) use an INSERT INTO #temptable <normal select> query where you insert
a TOP (pagesize * (pagenumber+1) into the SELECT, so you only insert
the data you need at most, not the big resultset
3) issue a SELECT over the temptable which filters on the identity
column to select the page.
4) optional: drop the temptable.

This isn't that slow as Microsoft wants you to believe, because if it
would, a lot of operations in the RDBMS would be 'terribly slow'
because a lot of operations use temptables under the hood.

One could optimize this perhaps with an in-memory table type though
then support for sqlserver 7 will be out the window.

The thing is that this is generic code, so you can create a routine
which you send a SQL query to and it can create the SQL statements for
you, all batched into 1 statement to send to the DB.
Using SQL Express 2005:
[Microsoft SQL Server 2005 - 9.00.3042.00 (Intel X86)
Express Edition on Windows NT 5.1 (Build 2600: Service Pack 2)]

SELECT TOP 10 [t2].[CustomerID], [t2].[CompanyName],
[t2].[ContactName], [t2].[ContactTitle], [t2].[Address], [t2].[City],
[t2].[Region], [t2].[PostalCode], [t2].[Country], [t2].[Phone],
[t2].[Fax] FROM ( SELECT ROW_NUMBER() OVER (ORDER BY
[t0].[CompanyName]) AS [ROW_NUMBER], [t0].[CustomerID],
[t0].[CompanyName], [t0].[ContactName], [t0].[ContactTitle],
[t0].[Address], [t0].[City], [t0].[Region], [t0].[PostalCode],
[t0].[Country], [t0].[Phone], [t0].[Fax] FROM [dbo].[Customers] AS
[t0] INNER JOIN [dbo].[Orders] AS [t1] ON [t0].[CustomerID] =
[t1].[CustomerID] WHERE [t1].[ShipCountry] = @p0 ) AS [t2]
WHERE [t2].[ROW_NUMBER] > @p1 ORDER BY [t2].[CompanyName]
------------------------------- @p0 [String]: 'France' @p1 [Int32]: 10

I find it interesting why they didn't go for WITH to create a CTE.

Btw, I find the answer from the MS employee on your post in the Linq
forum not that great, it sounds like an excuse to not have to support
SQLSERVER 2000 that much, as that would hurt sales.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
Nicholas said:
I've been thinking about this a lot, and I have to say, I really
dislike the implementation of skip and take on the T-SQL side.

The problem I have is that the T-SQL side, the ROW_NUMBER ranking
function is being used to produce an ordered result, and that order
is used to determine what is taken, and what is skipped.

The reason I have a problem with this is that the ROW_NUMBER function
requires an order, an order that is not expressed on the .NET side of
things. It just orders by the fields as they occur in the database.
While it's great that they are going to do this consistently, it's
something that just shouldn't be allowed when there is not an order
operation on the query itself. Without an order specified, in SQL
(as it should be in LINQ, I believe) the result should be
indeterminate, which would make Take and Skip worthless.

ROW_NUMBER requires an ordering, though you can do:
WITH __actualSet AS
(
SELECT *,
ROW_NUMBER() OVER (ORDER BY CURRENT_TIMESTAMP) AS __rowcnt
FROM
(
-- the query to page here
) AS _tmpSet
)
SELECT * FROM __actualSet
WHERE [__rowcnt] > @rownumberLastRowOfPreviousPage
AND [__rowcnt] <= @rownumberLastRowOfPage
ORDER BY [__rowcnt] ASC

(ref:
http://weblogs.asp.net/fbouma/archive/2007/06/05/sqlserver-2005-paging-t
here-is-a-generic-wrapper-query-possible.aspx

why they didn't use this is beyond me.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
Nicholas said:
I agree, it should be in the provider level. In this specific case,
against a SQL Server database, where, if order is not defined, the
order is, well, undefined.

actually, by definition, paging over an unordered set is meaningless,
as it's undefined what's in page n+1.

The lack of specifying an ordering in the main query comes from the
perception that a SELECT statement is an ordered set, but that's by
definition not true: SELECT has no defined order unless an ordering has
been specified, so if you specify SELECT without an order by, the row
order you'll receive is by definition undefined.

Using TOP or a paging construct over such a set is then thus actually
bogus, because it's undefined what rows will be returned.

In fact, it's that bad that it might be better to throw an exception
if there's no ordering specified when either skip or take has been
specified, because it's unclear what will be returned at runtime which
could lead to unclear bugs.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
Carl said:
Yes, of course it does. But in this case, the query as stated in C#
didn't specify the ordering, so any ordering could have been used.
It appears that LINQ chose to use the most expensive ordering
possible, when it should use the least expensive when no ordering was
specified in the original expression. Hence my question to Jon - did
this table have a primary key? If it didn't (and no unique keys
either), then this is the best LINQ could do, but if any unique keys
exist, LINQ should use one of those - perhaps it does, but I can't
tell from this example.

As I explained in another post in a different branch in this thread
;), paging or using TOP over a query without an ordering is
meaningless, as SELECT has by definition no specified order, it's
undefined: it's not defined in which order the rows will be returned to
you. So using TOP or a paging construct over such a set which isn't
ordered by a specified ordering will return a rowset that's by
definition undefined as you don't know the order of the set you page
over / limit the resultset on. :)

A lot of developers have the perception that SELECT has always the
same order, but that's not true if there's no ORDER BY clause: the
RDBMS can re-order rows because of mempage constraints or other
internal housekeeping.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
actually, by definition, paging over an unordered set is meaningless,
as it's undefined what's in page n+1.

I agree with you (see my other posts) - but /technically/ even then an
order clause (by itself) is not really enough to guarantee anything -
unless that order clause includes something unique, such as a primary
key. Of course there may-or-may-not be a primary key, and there may-or-
may-not be a clustering, and the primary key may-or-may-not be itself
clustered. Keeping it simple has advantages, but it would be tempting
to see if any sensible and simple rules (as in: more than all columns
in order) could be proposed for an implicit ordering... but in reality
in most cases it isn't going to matter much - so perhaps if it *does*
matter, the dev should remember to declare an order...

I think I've talked myself full-circle there, but that's life ;-p
 
2) use an INSERT INTO #temptable <normal select> query where you insert
a TOP (pagesize * (pagenumber+1) into the SELECT, so you only insert
the data you need at most, not the big resultset
A small note here is that this gets expensive when the user clicks on
the "last page" (page 11234 of 11234) button; I know that you know
this - I'm just saying...
4) optional: drop the temptable.
In a stored-procedure, yes - but isn't that mandatory in a flat
command text approach?
perhaps with an in-memory table type
Generally, yes; but for larger volumns sometimes temp-tables are
preferable. Depends on the scenario, so very hard to write a generic
handler...
I find it interesting why they didn't go for WITH to create a CTE
Yes; even when you can do the same thing either way, the WITH syntax
is often easier read than a nested query. Perhaps the SQL optimiser
prefers it? Dunno.
Btw, I find the answer from the MS employee on your post in the Linq
forum not that great
Well, maybe; but getting a statement of the differences is a step in
the right direction. But yes; the fact that it breaks (rather than
working sub-optimally via the #table approach) is a little surprising.
 
Marc said:
A small note here is that this gets expensive when the user clicks on
the "last page" (page 11234 of 11234) button; I know that you know
this - I'm just saying...

Yes, that's a problem.
However, a developer of software which offers thousands of pages of
result to a request should ask him/herself this: will the user traverse
tens of thousands or even millions of rows manually? No, in almost all
cases, the user won't. So offering thousands of pages is likely not
going to help much. ;)

BUt I agree, it's worth mentioning and can bring down a server
unfortunately. However, as SqlServer is the only RDBMS system without
proper keywords in the SELECT statement for skip/take/paging, I think
it's more a problem of the vendor of the db system. E.g. picking oracle
or even mysql won't give you this problem ;) (ok, also not linq to sql
;)
In a stored-procedure, yes - but isn't that mandatory in a flat
command text approach?

Temptables are dropped at the connection closing statement. This might
be right after the command, but it can also be that there will be other
commands over the same connection. This means that the temptable used
will occupy space in the tempdb catalog's data file. So dropping it
actively after the usage of the fetch could help performance :)
Generally, yes; but for larger volumns sometimes temp-tables are
preferable. Depends on the scenario, so very hard to write a generic
handler...

yes, if you want truly the best performance for every scenario, you're
not going to cut it with temptables. They're the 'best' choice for a
generic scenario though, but not ideal...

FB


--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top