LinqToSql Paging problem (bug?) with Skip and Take

D

d-42

Hi,

I encountered a problem with a query of the following form:

var q2 = from p in Persons.OfType<Cust>
where p.AccountID == accid
where !(from x in PersonDistLocationMatch
where x.LocationID == locationid
select x.PersonID).Contains(p.PersonID)
select p;

The query itself works fine. It returns the customers belonging to
account accid that do not have an association to a particular
location.

The problem is during paging when I apply skip / take: e.g.:

q2 = q2.Skip(startrow).Take(maxrows)

// for example the function call executes
q2 = q2.Skip(0).Take(8) //for page 1
q2 = q2.Skip(8).Take(8) // for page 2 etc

The resulting sql queries linq generates have an unstable sort order.
For the first 'page' where skip is 0 it generates:

SELECT TOP (8) [t0].[personid] AS [PersonID], [t0].[namefirst] AS
[NameFirst], [t0].[namelast] AS [NameLast], [t0].[distaccountid] AS
[DistAccountID], [t0].[personrole] AS [Role]
FROM [dbo].[Persons] AS [t0]
WHERE (NOT (EXISTS(
SELECT NULL AS [EMPTY]
FROM [dbo].[PersonDistLocationMatch] AS [t1]
WHERE ([t1].[personid] = [t0].[personid]) AND ([t1].
[distlocationid] = @p0)
))) AND ([t0].[distaccountid] = @p1) AND ([t0].[personrole] = @p2)

For subsequent pages where Skip is non-zero it generates:

SELECT [t2].[personrole] AS [Role], [t2].[personid] AS [PersonID],
[t2].[namefirst] AS [NameFirst], [t2].[namelast] AS [NameLast], [t2].
[distaccountid] AS [DistAccountID]
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY [t0].[personid]) AS
[ROW_NUMBER], [t0].[personrole], [t0].[personid], [t0].[namefirst],
[t0].[namelast], [t0].[distaccountid]
FROM [dbo].[Persons] AS [t0]
WHERE (NOT (EXISTS(
SELECT NULL AS [EMPTY]
FROM [dbo].[PersonDistLocationMatch] AS [t1]
WHERE ([t1].[personid] = [t0].[personid]) AND ([t1].
[distlocationid] = @p0)
))) AND ([t0].[distaccountid] = @p1) AND ([t0].[personrole] =
@p2)
) AS [t2]
WHERE [t2].[ROW_NUMBER] BETWEEN @p3 + 1 AND @p3 + @p4
ORDER BY [t2].[ROW_NUMBER]

The problem is that this query returns the rows in a different order
than the previous one. Which means that some of the records displayed
on the first page are displayed again on subsequent pages, and other
records are missing entirely.

I appear to have worked around it by defining the original query as:

var q2 = from p in Persons.OfType<Cust>
where p.AccountID == accid
where !(from x in PersonDistLocationMatch
where x.LocationID == locationid
select x.PersonID).Contains(p.PersonID)
orderby p.personid
select p;

which forces a sort on the query and stabilizes the row order.

Should it be considered a bug that skip / take do not generate stable
sort orders on their own? Or was it my error not to have enforced a
sort order?

-best regards
Dave
 
M

Marc Gravell

Should it be considered a bug that skip / take do not generate stable
sort orders on their own?

Well, without a stated order, any standard SQL query is of
indeterminate order, and thus in theory unstable. In reality, most
databases will (as an implementation detail) honour a clustered index
as a default sort [assuming it can pick one from the numerous tables
involved], but even this isn't typically guaranteed. In reality, you
should always specify an order condition if you are paging.

Marc
 
D

d-42

Should it be considered a bug that skip / take do not generate stable
sort orders on their own?

Well, without a stated order, any standard SQL query is of
indeterminate order, and thus in theory unstable. In reality, most
databases will (as an implementation detail) honour a clustered index
as a default sort [assuming it can pick one from the numerous tables
involved], but even this isn't typically guaranteed. In reality, you
should always specify an order condition if you are paging.

Marc

Thanks Marc for your reply and I don't disagree with what you are
saying, but its more subtle an issue than that.

Specifying a mere 'order condition' isn't sufficient. If I were to
sort by last name and there were multiple people with the same last
name I don't want the rows with the same last name trading places
between queries. Even if I were to sort by the primary key of a parent
in a join query, I'll have to contend with unstable row order.

Requiring that I define my queries so that row order is 100%
determinate in order to use paging is a pretty serious burden, and if
skip and take really require that effort in order satisfy being
reliable then they should really generate row-order preserving sql
themselves.

And, if you take a look at the sql its generating, they DO do this!!
Take a look at the second sql example I provided. Linq is clearly
generating row-number stable ordering on the fly without my help, and
that's a good thing.

The problem, is that, for page 1, skip is 0, and the query is
processed as if the skip wasn't applied at all, all the row numbering,
and order stabilizing stuff is optimized out.

Skip(0) should generate the same row numbering sql that skip(n) does
so that paging systems can rely on the first page of a paged view
operates on the same query, with the same row order that will be used
for the other pages.

Skip(n) imposes a consistent row order. Skip(0) should use the same
row order.

The more I think about it, the more I think optimizing skip(0) out of
the query completely is a bug.

-regards,
Dave
 
F

Frans Bouma [C# MVP]

d-42 said:
Should it be considered a bug that skip / take do not generate
stable sort orders on their own?

Well, without a stated order, any standard SQL query is of
indeterminate order, and thus in theory unstable. In reality, most
databases will (as an implementation detail) honour a clustered
index as a default sort [assuming it can pick one from the numerous
tables involved], but even this isn't typically guaranteed. In
reality, you should always specify an order condition if you are
paging.

Marc

Thanks Marc for your reply and I don't disagree with what you are
saying, but its more subtle an issue than that.

Specifying a mere 'order condition' isn't sufficient. If I were to
sort by last name and there were multiple people with the same last
name I don't want the rows with the same last name trading places
between queries. Even if I were to sort by the primary key of a parent
in a join query, I'll have to contend with unstable row order.

Requiring that I define my queries so that row order is 100%
determinate in order to use paging is a pretty serious burden, and if
skip and take really require that effort in order satisfy being
reliable then they should really generate row-order preserving sql
themselves.

By definition 'SELECT' returns an unordered set, as SQL is set based,
and sets have no order. If you want an order in the returned set, you
have to define the ordering, with ORDER BY. If that results in multiple
rows with the same value, again, the order is undetermined.

So you can jump up and down that it is a burden and all, fact is that
you didn't provide a proper ordering of the set you wanted to return,
and therefore the RDBMS has to pick an ordering ;). Mind you, ANY
ordering it chooses IS an ordering, so therefore you have to provide it
yourself.

=> And, if you take a look at the sql its generating, they DO do this!!
Take a look at the second sql example I provided. Linq is clearly
generating row-number stable ordering on the fly without my help, and
that's a good thing.

It can't page otherwise. The rownumber is required to determine which
rows are in which page to return. though, the order in which the rows
are in that set with rownumber is determined based on the ordering in
the query! The rownumbers are added to the final, ordered set. So if in
THAT set there are some rows with undeterminable ordering (same value
in the order column(s)) it can be the order between queries flips.

Normally you won't see this, but you WILL if sqlserver can optimize
something internally with memory pages. Then the order will be
different in each query because it already has some rows in memory.
The problem, is that, for page 1, skip is 0, and the query is
processed as if the skip wasn't applied at all, all the row numbering,
and order stabilizing stuff is optimized out.

Skip(0) should generate the same row numbering sql that skip(n) does
so that paging systems can rely on the first page of a paged view
operates on the same query, with the same row order that will be used
for the other pages.

Skip(n) imposes a consistent row order. Skip(0) should use the same
row order.

The more I think about it, the more I think optimizing skip(0) out of
the query completely is a bug.

No, you think that the ordering you see in the rownumber column is a
valid order, but that's applied TO the ordered set which happens to be
NOT properly ordered! :).

Take northwind. Sort customers on country, then page over customers
with 10 rows a page. You'll see that rows aren't in the proper order
sometimes. This is because there are more customers per country. So you
have to tell the RDBMS in which order the customers have to be placed
which have the same country value, e.g. customerid, or city.

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
M

Marc Gravell

One workaound I can think of... LINQ is composable... you could, for
example, create a property on the data-context such as:

IOrderedQueryable<Foo> OrderedFoos {
return from foo in foos
order by A,B,C,D
select foo;
}

and then you can use OrderedFoos in multiple queries without having to
re-specify it. The only question would be: what happens if you do
OrderedFoos.OrderBy(...); in order to be stable, I'd hope that this
essentially does "order by new order, then by old order" (and that is
how I have done it in a LINQ [to objects] implementation I've done) -
but I don't know whether it simply throws away the old order.

Marc
 
D

d-42

One workaound I can think of... LINQ is composable... you could, for
example, create a property on the data-context such as:

IOrderedQueryable<Foo> OrderedFoos {
return from foo in foos
order by A,B,C,D
select foo;

}

and then you can use OrderedFoos in multiple queries without having to
re-specify it. The only question would be: what happens if you do
OrderedFoos.OrderBy(...); in order to be stable, I'd hope that this
essentially does "order by new order, then by old order" (and that is
how I have done it in a LINQ [to objects] implementation I've done) -
but I don't know whether it simply throws away the old order.

Marc

Yes, I can see how that should work.

My issue/question isn't so much how to satisfy the ordering
requirement of paging with skip/take, but rather to suggest that skip/
take should throw an error if the ordering requirement isn't
satisfied, or better still, satisfy the ordering requirement with the
SQL queries it generates, by including for example, default orderby
clauses like the one in your workaround.

Thanks,
Dave
 
D

d-42

By definition 'SELECT' returns an unordered set, as SQL is set based,
and sets have no order.

And by definition skip()/take() operate on an ordering. Its
meaningless to skip the "first" row, if none of the rows are "first".
So somewhere between the select and the skip/take an ordering is going
to be imposed on the set.

The question is "what order?"
If you want an order in the returned set, you
have to define the ordering, with ORDER BY. If that results in multiple
rows with the same value, again, the order is undetermined.
So you can jump up and down that it is a burden and all, fact is that
you didn't provide a proper ordering of the set you wanted to return,
and therefore the RDBMS has to pick an ordering ;). Mind you, ANY
ordering it chooses IS an ordering, so therefore you have to provide it
yourself.

The clear intention of the developer using skip/take is to page. The
semantics of that are that if the developer makes repeated queries for
the same page of he expects to get the same rows. And that if he
traverses the set by paging, he will get different records on each
page. And if he has specified a full or partial ordering, that should
be preserved in the pagination.

Linq should satisfy those semantics.

Why do I have to provide an ordering myself? If linq skip/take needs a
stable deterministic ordering to satisfy the semantics and intention
of a developer using skip/take, then linq/SQL should generate sql that
provides one. I should be free to provide my own, or a partial
ordering if I care too, but linq should be responsible for living up
to the paging semantics of skip/take even without my help.
It can't page otherwise.

If it can't page then it shouldn't page. And it absolutely shouldn't
return random rows from the queried set, and tell me its 'paging'.

So does Linq provide "paging" or does it just provide some sql
constructs that might be used in concert with carefully thought out
row ordering to achieve paging semantics?

Reading msdn, and the thousand blogs on linq would give one the
distict impression that linq provides "paging"... after all even Scott
Guthrie wrote on his blog:

----
"The code below demonstrates how to implement efficient server-side
database paging as part of a LINQ query. By using the Skip() and
Take() operators below, we'll only return 10 rows from the [Northwind]
database - starting with row 200.

var products = (from p in db.Products where
p.Category.CategoryName.Startswith("C") select p).Skip(200).Take(10);"
----

Its not really 'efficient server side paging', if it really just
returns 10 random rows from the that start with 'C', which strictly
speaking, is apparently all it does. Clearly that wasn't Scott's or
any other developers intent.

That's my 'issue'. I've seen literally dozens of similar samples of
implementing paging explained like this, or databound to gridviews, or
any of a hundred other permutations. And probably not one of them goes
went out of its way to mention that if you don't provide a
deterministic ordering, that your paging is essentially just randomly
selecting rows from the query set to show you on each 'page'.

Obviously the developer intention with skip/take is to page, and an
AWFUL lot of developers out there believe Scott's code snippet above
will result in 'efficient server side paging'.

In researching this, I came upon one of your own posts in a discussion
about the order of skip and take in a query. You said:

"You should look less at specs and more at what users WANT TO DO. Time
and time again you post a reply with how the syntax is, how the specs
says how it works etc. etc. All fine, but that's completely not hte
point here: the point is that the user wants to PAGE through a
resultset. ..."

In your case you are talking about the order of skip and take in the
statement. But really, I think we're making the same underlying
point. That being: "When the developer uses skip and take the
developers intention is to page, so Linq *should* page."

Or perhaps we need a new extension method: Page(skipqty, takeqty) that
DOES provide the semantics we all are looking for.

cheers,
Dave
 
F

Frans Bouma [C# MVP]

d-42 said:
And by definition skip()/take() operate on an ordering. Its
meaningless to skip the "first" row, if none of the rows are "first".
So somewhere between the select and the skip/take an ordering is going
to be imposed on the set.

The question is "what order?"

I agree with the fact that skip/take for paging isn't the brightest
combination in the linq design. The main issue is that linq is based on
sequences, which always have an order. So you can just define a paging
operation with just skip/take, which compiles fine but it could give
problems at runtime.

However, on the other hand... :
var q = (from c in nw.Customers
select c).Skip(10).Take(10);

what's the order in which the customers are returned in q ? That's the
order in which they're stored in the sequence nw.Customers.

However, in a database that's undefined. As I said in my post: they
are in _an_ order, as they're returned one by one, but it's potentially
different every time the query is ran.

And THATS the problem. So skip/take operate on sequences and in
theory, you can see a set returned from a sql query as a sequential set
(as there's a first row, a second row etc.), so skip/take can be used
on any sql query. HOWEVER, a PAGING query executes the same root query
(i.e. the query which produces the total resultset in which you're
paging) for every requested page.

As said above: every time the root query produces a set, in sequential
order, which has a first row, second row etc. but it's undefined WHICH
row that will be as, without an order by, there's no ordering defined
which is PREDICTABLE.

Back to the linq provider: using skip/take without orderby gives an
exception. This is to prevent that your application keeps on working
and produces WRONG results.
The clear intention of the developer using skip/take is to page. The
semantics of that are that if the developer makes repeated queries for
the same page of he expects to get the same rows. And that if he
traverses the set by paging, he will get different records on each
page. And if he has specified a full or partial ordering, that should
be preserved in the pagination.

yeah, but you're using a database, so you can demand what you want but
you'll get what a database will deliver. Without specified ordering,
the returned resultsets are ordered, but not in a predictable way.
Linq should satisfy those semantics.

linq does satisfy that. You just get a different ordered set back
every time the root query is executed. Though that set is ordered.
THough not in the ordering YOU EXPECT.

So if you want a set to be ordered in an ordering you expect, you
should specify it. As I illustrated above with the example query: the
nw.Customers sequence in the linq query has an order, but not the one
you might expect. You therefore need to specify the ordering YOU
EXPECT. The code can't decide for you what ordering to expect.

And before you give the naive solution to go for ordering on PK: what
if the paged set doesn't have a PK field?
Why do I have to provide an ordering myself? If linq skip/take needs a
stable deterministic ordering to satisfy the semantics and intention
of a developer using skip/take, then linq/SQL should generate sql that
provides one. I should be free to provide my own, or a partial
ordering if I care too, but linq should be responsible for living up
to the paging semantics of skip/take even without my help.

var q = (from c in nw.Customers select new { c.Country,
c.City}).Skip(10).Take(10);

What ordering should it use? Don't assume the tuple countr-city is
unique! So duplicates arise.

The PK on customers? What if customers doesn't have a pk? Or the set
to page is a query on a projected set joined with another projected set?
If it can't page then it shouldn't page. And it absolutely shouldn't
return random rows from the queried set, and tell me its 'paging'.

it throws an exception, so it doesn't page. At least some providers do
throw an exception. Others don't and move on to page the data. You see:
you are a developer. So if you think you are good enough to do your
job, you are good enough to understand what you're dealing with :) In
my country,city query above, what's the order of the returned data? You
can't know nor do you know an ordering to pick, as there's none suiting
the resultset.
So does Linq provide "paging" or does it just provide some sql
constructs that might be used in concert with carefully thought out
row ordering to achieve paging semantics?

'linq' doesn't provide anything. The Linq provider does.
Reading msdn, and the thousand blogs on linq would give one the
distict impression that linq provides "paging"... after all even Scott
Guthrie wrote on his blog:

----
"The code below demonstrates how to implement efficient server-side
database paging as part of a LINQ query. By using the Skip() and
Take() operators below, we'll only return 10 rows from the [Northwind]
database - starting with row 200.

var products = (from p in db.Products where
p.Category.CategoryName.Startswith("C") select p).Skip(200).Take(10);"
----

Its not really 'efficient server side paging', if it really just
returns 10 random rows from the that start with 'C', which strictly
speaking, is apparently all it does. Clearly that wasn't Scott's or
any other developers intent.

Let's give another example:
SELECT TOP 10 c.Country, c.City
FROM Customers c INNER JOIN Orders o
ON c.CustomerID = o.CustomerID
WHERE o.EmployeeID > 3

Which countries and cities are returned? It coult be just 1 country
and 1 city. The join results in duplicates. Also which country and city
? Unclear.

The internet is full with people complaining about this, but it's
actually pretty hard defined in how databases work: the INTENT to get
the first 10 country-city tuples from customers is based on a
definition what the order is: otherwise what does 'first 10' mean? The
first random 10? Or the first 10 from a set with a defined ordering (so
first/last etc. is defined) ?

Also, in the above query, the join results in duplicates, so
'DISTINCT' is required. if someone starts whining WHY this is required,
we can all write thick books why this is, but that doesn't change the
fact that it is required. So is specifying the order WHICH DEFINES THE
ORDERING OF THE SET IN WHICH CONTEXT THE PAGING IS INTENDED.

You seem to refuse to accept that. But without specifying an ordering,
your paging attempt likely fails to give the rows YOU EXPECT, in the
ordering YOU WANTED. The code probably works, it pages over a set
though the pages have probably the same rows, or are otherwise not
properly working.
That's my 'issue'. I've seen literally dozens of similar samples of
implementing paging explained like this, or databound to gridviews, or
any of a hundred other permutations. And probably not one of them goes
went out of its way to mention that if you don't provide a
deterministic ordering, that your paging is essentially just randomly
selecting rows from the query set to show you on each 'page'.

You confuse hype/marketing/blogposts with proper documentation. :) You
shouldn't do that. If Microsoft forgets to explain that a proper
ordering is essential, they made a mistake. However, you as a developer
should also think about what you're doing: you want the query to do
something: return page n of m rows from a big set.

You could assume that in theory paging works like:
- fetch complete set
- move row pointer to n*m
- fetch the m rows from row pointer
- done

next page fetch:
- simply move row pointer over already fetched set
- fetch m rows

But that's not the case. Every page fetch, the complete set is fetched
again. Which results in a different first/last row perhaps. Or not,
that's undefined.
Obviously the developer intention with skip/take is to page, and an
AWFUL lot of developers out there believe Scott's code snippet above
will result in 'efficient server side paging'.

then they're stupid. Scott is blogging to show some features, the
snippets posted are ILLUSTRATIONS of how things might work. But they're
not rules of thumb or replacements of documentation.

If we look at Linq's methods defined, we see several extension methods
which have no meaning in database land, and are therefore not
implemented in many if not all linq providers targeting a database. If
you don't read any documentation, you might assume that these methods
will work and you'll probably get confused when an exception is thrown.
In researching this, I came upon one of your own posts in a discussion
about the order of skip and take in a query. You said:

"You should look less at specs and more at what users WANT TO DO. Time
and time again you post a reply with how the syntax is, how the specs
says how it works etc. etc. All fine, but that's completely not hte
point here: the point is that the user wants to PAGE through a
resultset. ..."

In your case you are talking about the order of skip and take in the
statement. But really, I think we're making the same underlying
point. That being: "When the developer uses skip and take the
developers intention is to page, so Linq should page."

Or perhaps we need a new extension method: Page(skipqty, takeqty) that
DOES provide the semantics we all are looking for.

The order in which skip/take are specified is indeed confusing (IMHO),
and that's also why we defined our own paging extension method. The
thing is though that VB.NET has 'skip' as a native keyword for linq
queries like 'from' and 'join'.

It's also the case that some people want to skip r rows and take m
rows after that where r isn't a multiply of m. (e.g. skip(3).Take(10) ).

That is a problem that Microsoft introduced and they have to deal with
that, I've said enough about that topic so digging into that any more
will make me repeat myself.

the point YOU refer to is a different one: the order of the set
skip/take work on is apparently defined IMPLICITLY by the linq query
(as linq works on sequences, which always have an order) though to be
useful, the developer has to provide a proper ordering him/herself
because only THEN skip/take will result in the expected rows.

I hope with my examples in this post, things are a bit more clear,
i.e. the difference between the order of the rows in: select * from
customers, and the order of the rows in select * from customers order
by country, city, customerid

FB

--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
D

d-42

You wrote a great deal and there is so much I would have liked to
elaborate on but this really de-railed the whole debate. You posted
that linq DOES throw an exception if its not ordered, and that runs
contrary to my experience.

Back to the linq provider: using skip/take without orderby gives an
exception. This is to prevent that your application keeps on working
and produces WRONG results.

We wouldn't be having this conversation if it had thrown an error. It
doesn't. For as much as I'd like linq to provide its own default
consistent paging-semantics friendly row ordering when one isn't
specified, i don't mind having to provide my own. But I did NOT expect
linq to just start spitting random rows at me.

After finally determining the right place to look in the linq to sql
docuentation: it turns out in linq to sql, skip/take on an unordered
query is 'undefined', not 'throw an exception'.

Personally I still think it should throw an error rather than
returning something that's undefined, and that it would be useful for
linq to impose a default ordering that would be consistent between
page requests on its own (although I understand that would impose a
performance hit on certain types of query -- so maybe it should be an
option instead of automatic (e.g. page(int skiprows, int takerows,
bool AutoOrdering) but that's a question for the next version i guess.
At least the issue is documented if you look in the right place.

-best regards,
Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top