Do varchar(max), nvarchar(max) and varbinary(max) columns affect select queries?
Consider this table:
create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)
Let's imagine that we have over 100,000 books in this table.
Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.
One way to check for the duplicates is this way:
select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)
Does the existence of Text
column affect this query's performance? If so, how can we optimized it?
P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:
create table BookUniqueTokens
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)
create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)
And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text]
column affects any select
query on the UniqueToken
column.
sql-server performance
|
show 1 more comment
Consider this table:
create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)
Let's imagine that we have over 100,000 books in this table.
Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.
One way to check for the duplicates is this way:
select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)
Does the existence of Text
column affect this query's performance? If so, how can we optimized it?
P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:
create table BookUniqueTokens
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)
create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)
And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text]
column affects any select
query on the UniqueToken
column.
sql-server performance
2
Is there a nonclustered index onUniqueToken
? Also, I would not advise anIN
with 10k items, I would store them in a temp table and filter theUniqueTokens
with this temporary table. More on that here
– Randi Vertongen
yesterday
1
1) If you are checking for duplicates, why would you include theText
column in the query? 2) can you please update the question to inlcude a few examples of values stored in theUniqueToken
column? If you don't want to share actual company data, modify it, but keep the format the same.
– Solomon Rutzky
yesterday
@RandiVertongen, yes there is a nonclustered index on UniqueToken
– Saeed Neamati
yesterday
@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.
– Saeed Neamati
yesterday
@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have theText
column included, then I would look at theIN
for query optimization. There are better ways to find duplicate data.
– Randi Vertongen
yesterday
|
show 1 more comment
Consider this table:
create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)
Let's imagine that we have over 100,000 books in this table.
Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.
One way to check for the duplicates is this way:
select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)
Does the existence of Text
column affect this query's performance? If so, how can we optimized it?
P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:
create table BookUniqueTokens
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)
create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)
And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text]
column affects any select
query on the UniqueToken
column.
sql-server performance
Consider this table:
create table Books
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100) not null,
[Text] nvarchar(max) not null
)
Let's imagine that we have over 100,000 books in this table.
Now we're given a 10,000 books data to insert into this table, some of which are duplicate. So we need to filter duplicates first, and then insert new books.
One way to check for the duplicates is this way:
select UniqueToken
from Books
where UniqueToken in
(
'first unique token',
'second unique token'
-- 10,000 items here
)
Does the existence of Text
column affect this query's performance? If so, how can we optimized it?
P.S.
I have the same structure, for some other data. And it's not performing well. A friend told me that I should break my table into two tables as follow:
create table BookUniqueTokens
(
Id bigint not null primary key identity(1, 1),
UniqueToken varchar(100)
)
create table Books
(
Id bigint not null primary key,
[Text] nvarchar(max)
)
And I have to do my duplicate finding algorithm on the first table only, and then insert data into both of them. This way he claimed performance gets way better, because tables are physically separate. He claimed that [Text]
column affects any select
query on the UniqueToken
column.
sql-server performance
sql-server performance
asked yesterday
Saeed NeamatiSaeed Neamati
5501518
5501518
2
Is there a nonclustered index onUniqueToken
? Also, I would not advise anIN
with 10k items, I would store them in a temp table and filter theUniqueTokens
with this temporary table. More on that here
– Randi Vertongen
yesterday
1
1) If you are checking for duplicates, why would you include theText
column in the query? 2) can you please update the question to inlcude a few examples of values stored in theUniqueToken
column? If you don't want to share actual company data, modify it, but keep the format the same.
– Solomon Rutzky
yesterday
@RandiVertongen, yes there is a nonclustered index on UniqueToken
– Saeed Neamati
yesterday
@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.
– Saeed Neamati
yesterday
@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have theText
column included, then I would look at theIN
for query optimization. There are better ways to find duplicate data.
– Randi Vertongen
yesterday
|
show 1 more comment
2
Is there a nonclustered index onUniqueToken
? Also, I would not advise anIN
with 10k items, I would store them in a temp table and filter theUniqueTokens
with this temporary table. More on that here
– Randi Vertongen
yesterday
1
1) If you are checking for duplicates, why would you include theText
column in the query? 2) can you please update the question to inlcude a few examples of values stored in theUniqueToken
column? If you don't want to share actual company data, modify it, but keep the format the same.
– Solomon Rutzky
yesterday
@RandiVertongen, yes there is a nonclustered index on UniqueToken
– Saeed Neamati
yesterday
@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.
– Saeed Neamati
yesterday
@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have theText
column included, then I would look at theIN
for query optimization. There are better ways to find duplicate data.
– Randi Vertongen
yesterday
2
2
Is there a nonclustered index on
UniqueToken
? Also, I would not advise an IN
with 10k items, I would store them in a temp table and filter the UniqueTokens
with this temporary table. More on that here– Randi Vertongen
yesterday
Is there a nonclustered index on
UniqueToken
? Also, I would not advise an IN
with 10k items, I would store them in a temp table and filter the UniqueTokens
with this temporary table. More on that here– Randi Vertongen
yesterday
1
1
1) If you are checking for duplicates, why would you include the
Text
column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken
column? If you don't want to share actual company data, modify it, but keep the format the same.– Solomon Rutzky
yesterday
1) If you are checking for duplicates, why would you include the
Text
column in the query? 2) can you please update the question to inlcude a few examples of values stored in the UniqueToken
column? If you don't want to share actual company data, modify it, but keep the format the same.– Solomon Rutzky
yesterday
@RandiVertongen, yes there is a nonclustered index on UniqueToken
– Saeed Neamati
yesterday
@RandiVertongen, yes there is a nonclustered index on UniqueToken
– Saeed Neamati
yesterday
@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.
– Saeed Neamati
yesterday
@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.
– Saeed Neamati
yesterday
@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the
Text
column included, then I would look at the IN
for query optimization. There are better ways to find duplicate data.– Randi Vertongen
yesterday
@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the
Text
column included, then I would look at the IN
for query optimization. There are better ways to find duplicate data.– Randi Vertongen
yesterday
|
show 1 more comment
1 Answer
1
active
oldest
votes
Examples
Consider your query with 8 filter predicates in your IN
clause on a dataset of 10K records.
select UniqueToken
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
A clustered index scan is used, there are no other indexes present on this test table
With a data size of 216 Bytes.
You should also note how even with 8 records, the OR
filters are stacking up.
The reads that happened on this table:
Credits to statisticsparser.
When you include the Text
column in the select part of your query, the actual data size changes drastically:
select UniqueToken,Text
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
Again, the Clustered index Scan with a residual predicate:
But with a dataset of 32KB.
As there are almost 1000 lob logical reads:
Now, when we create the two tables in question, and fill them up with the same 10k records
Executing the same select without Text
. Remember that we had 99 Logical reads when using the Books
Table.
select UniqueToken
from BookUniqueTokens
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
)
The reads on BookUniqueTokens
are lower, 67 instead of 99.
We can track that back to the pages in the original Books
table and the pages in the new table without the Text
.
Original Books
table:
New BookUniqueTokens
table
So, all the pages + (2 overhead pages?) are read from the clustered index.
Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)
Books
Data space
BooksWithText Data space
The reason for this is ROW_OVERFLOW_DATA.
When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.
Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?
Due to the 24 byte pointer added to the Clustered index to track each of these pages.
After all, sql server needs to know where it can find the lob data.
Source
To answer your questions
He claimed that [Text] column affects any select query on the
UniqueToken column.
And
Does the existence of Text column affect this query's performance? If
so, how can we optimized it?
If the data stored is actually Lob Data, and the Query provided in the answer is used
It does bring some overhead due to the 24 byte pointers.
Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.
Remember that this overhead only happens if an index that includes Text
is used, such as the clustered index.
But, what if the clustered index scan is used, and the lob data does not exceed 8kb?
If the data does not exceed 8kb, and you have no index on UniqueToken
,the overhead could be bigger . even when not selecting the Text
column.
Logical reads on 10k records when Text is only 137 characters long (all records):
Table 'Books2'. Scan count 1, logical reads 419
Due to all this extra data being on the clustered index pages.
Again, an index on UniqueToken
(Without including the Text
column) will resolve this issue.
As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases
TL;DR
Based on the query given, indexing UniqueToken
without including TEXT
should resolve your troubles.
Additionally, I would use a temporary table or table type to do the filtering instead of the IN
statement.
EDIT:
yes there is a nonclustered index on UniqueToken
Your example query is not touching the Text
column, and based on the query this should be a covering index.
If we test this on the three tables we previously used (UniqueToken
+ Lob data, Solely UniqueToken
, UniqueToken
+ 137 Char data in nvarchar(max) column)
CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);
The reads remain the same for these three tables, because the nonclustered index is used.
Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Additional details
by @David Browne - Microsoft
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases/
Remember that you have to rebuild your indexes for this to take effect on already populated data.
By @Erik Darling
On
- MAX Data Types Do WHAT?
Filtering on Lob data sucks.
- Memory Grants and Data Size
Your memory grants might go through the roof when using bigger datatypes, impacting performance.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f232941%2fdo-varcharmax-nvarcharmax-and-varbinarymax-columns-affect-select-queries%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Examples
Consider your query with 8 filter predicates in your IN
clause on a dataset of 10K records.
select UniqueToken
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
A clustered index scan is used, there are no other indexes present on this test table
With a data size of 216 Bytes.
You should also note how even with 8 records, the OR
filters are stacking up.
The reads that happened on this table:
Credits to statisticsparser.
When you include the Text
column in the select part of your query, the actual data size changes drastically:
select UniqueToken,Text
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
Again, the Clustered index Scan with a residual predicate:
But with a dataset of 32KB.
As there are almost 1000 lob logical reads:
Now, when we create the two tables in question, and fill them up with the same 10k records
Executing the same select without Text
. Remember that we had 99 Logical reads when using the Books
Table.
select UniqueToken
from BookUniqueTokens
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
)
The reads on BookUniqueTokens
are lower, 67 instead of 99.
We can track that back to the pages in the original Books
table and the pages in the new table without the Text
.
Original Books
table:
New BookUniqueTokens
table
So, all the pages + (2 overhead pages?) are read from the clustered index.
Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)
Books
Data space
BooksWithText Data space
The reason for this is ROW_OVERFLOW_DATA.
When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.
Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?
Due to the 24 byte pointer added to the Clustered index to track each of these pages.
After all, sql server needs to know where it can find the lob data.
Source
To answer your questions
He claimed that [Text] column affects any select query on the
UniqueToken column.
And
Does the existence of Text column affect this query's performance? If
so, how can we optimized it?
If the data stored is actually Lob Data, and the Query provided in the answer is used
It does bring some overhead due to the 24 byte pointers.
Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.
Remember that this overhead only happens if an index that includes Text
is used, such as the clustered index.
But, what if the clustered index scan is used, and the lob data does not exceed 8kb?
If the data does not exceed 8kb, and you have no index on UniqueToken
,the overhead could be bigger . even when not selecting the Text
column.
Logical reads on 10k records when Text is only 137 characters long (all records):
Table 'Books2'. Scan count 1, logical reads 419
Due to all this extra data being on the clustered index pages.
Again, an index on UniqueToken
(Without including the Text
column) will resolve this issue.
As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases
TL;DR
Based on the query given, indexing UniqueToken
without including TEXT
should resolve your troubles.
Additionally, I would use a temporary table or table type to do the filtering instead of the IN
statement.
EDIT:
yes there is a nonclustered index on UniqueToken
Your example query is not touching the Text
column, and based on the query this should be a covering index.
If we test this on the three tables we previously used (UniqueToken
+ Lob data, Solely UniqueToken
, UniqueToken
+ 137 Char data in nvarchar(max) column)
CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);
The reads remain the same for these three tables, because the nonclustered index is used.
Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Additional details
by @David Browne - Microsoft
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases/
Remember that you have to rebuild your indexes for this to take effect on already populated data.
By @Erik Darling
On
- MAX Data Types Do WHAT?
Filtering on Lob data sucks.
- Memory Grants and Data Size
Your memory grants might go through the roof when using bigger datatypes, impacting performance.
add a comment |
Examples
Consider your query with 8 filter predicates in your IN
clause on a dataset of 10K records.
select UniqueToken
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
A clustered index scan is used, there are no other indexes present on this test table
With a data size of 216 Bytes.
You should also note how even with 8 records, the OR
filters are stacking up.
The reads that happened on this table:
Credits to statisticsparser.
When you include the Text
column in the select part of your query, the actual data size changes drastically:
select UniqueToken,Text
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
Again, the Clustered index Scan with a residual predicate:
But with a dataset of 32KB.
As there are almost 1000 lob logical reads:
Now, when we create the two tables in question, and fill them up with the same 10k records
Executing the same select without Text
. Remember that we had 99 Logical reads when using the Books
Table.
select UniqueToken
from BookUniqueTokens
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
)
The reads on BookUniqueTokens
are lower, 67 instead of 99.
We can track that back to the pages in the original Books
table and the pages in the new table without the Text
.
Original Books
table:
New BookUniqueTokens
table
So, all the pages + (2 overhead pages?) are read from the clustered index.
Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)
Books
Data space
BooksWithText Data space
The reason for this is ROW_OVERFLOW_DATA.
When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.
Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?
Due to the 24 byte pointer added to the Clustered index to track each of these pages.
After all, sql server needs to know where it can find the lob data.
Source
To answer your questions
He claimed that [Text] column affects any select query on the
UniqueToken column.
And
Does the existence of Text column affect this query's performance? If
so, how can we optimized it?
If the data stored is actually Lob Data, and the Query provided in the answer is used
It does bring some overhead due to the 24 byte pointers.
Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.
Remember that this overhead only happens if an index that includes Text
is used, such as the clustered index.
But, what if the clustered index scan is used, and the lob data does not exceed 8kb?
If the data does not exceed 8kb, and you have no index on UniqueToken
,the overhead could be bigger . even when not selecting the Text
column.
Logical reads on 10k records when Text is only 137 characters long (all records):
Table 'Books2'. Scan count 1, logical reads 419
Due to all this extra data being on the clustered index pages.
Again, an index on UniqueToken
(Without including the Text
column) will resolve this issue.
As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases
TL;DR
Based on the query given, indexing UniqueToken
without including TEXT
should resolve your troubles.
Additionally, I would use a temporary table or table type to do the filtering instead of the IN
statement.
EDIT:
yes there is a nonclustered index on UniqueToken
Your example query is not touching the Text
column, and based on the query this should be a covering index.
If we test this on the three tables we previously used (UniqueToken
+ Lob data, Solely UniqueToken
, UniqueToken
+ 137 Char data in nvarchar(max) column)
CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);
The reads remain the same for these three tables, because the nonclustered index is used.
Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Additional details
by @David Browne - Microsoft
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases/
Remember that you have to rebuild your indexes for this to take effect on already populated data.
By @Erik Darling
On
- MAX Data Types Do WHAT?
Filtering on Lob data sucks.
- Memory Grants and Data Size
Your memory grants might go through the roof when using bigger datatypes, impacting performance.
add a comment |
Examples
Consider your query with 8 filter predicates in your IN
clause on a dataset of 10K records.
select UniqueToken
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
A clustered index scan is used, there are no other indexes present on this test table
With a data size of 216 Bytes.
You should also note how even with 8 records, the OR
filters are stacking up.
The reads that happened on this table:
Credits to statisticsparser.
When you include the Text
column in the select part of your query, the actual data size changes drastically:
select UniqueToken,Text
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
Again, the Clustered index Scan with a residual predicate:
But with a dataset of 32KB.
As there are almost 1000 lob logical reads:
Now, when we create the two tables in question, and fill them up with the same 10k records
Executing the same select without Text
. Remember that we had 99 Logical reads when using the Books
Table.
select UniqueToken
from BookUniqueTokens
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
)
The reads on BookUniqueTokens
are lower, 67 instead of 99.
We can track that back to the pages in the original Books
table and the pages in the new table without the Text
.
Original Books
table:
New BookUniqueTokens
table
So, all the pages + (2 overhead pages?) are read from the clustered index.
Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)
Books
Data space
BooksWithText Data space
The reason for this is ROW_OVERFLOW_DATA.
When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.
Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?
Due to the 24 byte pointer added to the Clustered index to track each of these pages.
After all, sql server needs to know where it can find the lob data.
Source
To answer your questions
He claimed that [Text] column affects any select query on the
UniqueToken column.
And
Does the existence of Text column affect this query's performance? If
so, how can we optimized it?
If the data stored is actually Lob Data, and the Query provided in the answer is used
It does bring some overhead due to the 24 byte pointers.
Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.
Remember that this overhead only happens if an index that includes Text
is used, such as the clustered index.
But, what if the clustered index scan is used, and the lob data does not exceed 8kb?
If the data does not exceed 8kb, and you have no index on UniqueToken
,the overhead could be bigger . even when not selecting the Text
column.
Logical reads on 10k records when Text is only 137 characters long (all records):
Table 'Books2'. Scan count 1, logical reads 419
Due to all this extra data being on the clustered index pages.
Again, an index on UniqueToken
(Without including the Text
column) will resolve this issue.
As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases
TL;DR
Based on the query given, indexing UniqueToken
without including TEXT
should resolve your troubles.
Additionally, I would use a temporary table or table type to do the filtering instead of the IN
statement.
EDIT:
yes there is a nonclustered index on UniqueToken
Your example query is not touching the Text
column, and based on the query this should be a covering index.
If we test this on the three tables we previously used (UniqueToken
+ Lob data, Solely UniqueToken
, UniqueToken
+ 137 Char data in nvarchar(max) column)
CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);
The reads remain the same for these three tables, because the nonclustered index is used.
Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Additional details
by @David Browne - Microsoft
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases/
Remember that you have to rebuild your indexes for this to take effect on already populated data.
By @Erik Darling
On
- MAX Data Types Do WHAT?
Filtering on Lob data sucks.
- Memory Grants and Data Size
Your memory grants might go through the roof when using bigger datatypes, impacting performance.
Examples
Consider your query with 8 filter predicates in your IN
clause on a dataset of 10K records.
select UniqueToken
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
A clustered index scan is used, there are no other indexes present on this test table
With a data size of 216 Bytes.
You should also note how even with 8 records, the OR
filters are stacking up.
The reads that happened on this table:
Credits to statisticsparser.
When you include the Text
column in the select part of your query, the actual data size changes drastically:
select UniqueToken,Text
from Books
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
);
Again, the Clustered index Scan with a residual predicate:
But with a dataset of 32KB.
As there are almost 1000 lob logical reads:
Now, when we create the two tables in question, and fill them up with the same 10k records
Executing the same select without Text
. Remember that we had 99 Logical reads when using the Books
Table.
select UniqueToken
from BookUniqueTokens
where UniqueToken in
(
'Unique token 1',
'Unique token 2',
'Unique token 3',
'Unique token 4',
'Unique token 5',
'Unique token 6',
'Unique token 9999',
'Unique token 5000'
-- 10,000 items here
)
The reads on BookUniqueTokens
are lower, 67 instead of 99.
We can track that back to the pages in the original Books
table and the pages in the new table without the Text
.
Original Books
table:
New BookUniqueTokens
table
So, all the pages + (2 overhead pages?) are read from the clustered index.
Why is there a difference, and why is the difference not bigger? After all the datasize difference is huge (Lob data <> No Lob data)
Books
Data space
BooksWithText Data space
The reason for this is ROW_OVERFLOW_DATA.
When data gets bigger than 8kb the data is stored as ROW_OVERFLOW_DATA on different pages.
Ok, if lob data is stored on different pages, why are the page sizes of these two clustered indexes not the same?
Due to the 24 byte pointer added to the Clustered index to track each of these pages.
After all, sql server needs to know where it can find the lob data.
Source
To answer your questions
He claimed that [Text] column affects any select query on the
UniqueToken column.
And
Does the existence of Text column affect this query's performance? If
so, how can we optimized it?
If the data stored is actually Lob Data, and the Query provided in the answer is used
It does bring some overhead due to the 24 byte pointers.
Depending on the executions / min not being crazy high, I would say that this is negligible, even with 100K records.
Remember that this overhead only happens if an index that includes Text
is used, such as the clustered index.
But, what if the clustered index scan is used, and the lob data does not exceed 8kb?
If the data does not exceed 8kb, and you have no index on UniqueToken
,the overhead could be bigger . even when not selecting the Text
column.
Logical reads on 10k records when Text is only 137 characters long (all records):
Table 'Books2'. Scan count 1, logical reads 419
Due to all this extra data being on the clustered index pages.
Again, an index on UniqueToken
(Without including the Text
column) will resolve this issue.
As pointed out by @David Browne - Microsoft, you could also store the data off row, as to not add this overhead on the Clustered index when not selecting this Text Column.
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases
TL;DR
Based on the query given, indexing UniqueToken
without including TEXT
should resolve your troubles.
Additionally, I would use a temporary table or table type to do the filtering instead of the IN
statement.
EDIT:
yes there is a nonclustered index on UniqueToken
Your example query is not touching the Text
column, and based on the query this should be a covering index.
If we test this on the three tables we previously used (UniqueToken
+ Lob data, Solely UniqueToken
, UniqueToken
+ 137 Char data in nvarchar(max) column)
CREATE INDEX [IX_Books_UniqueToken] ON Books(UniqueToken);
CREATE INDEX [IX_BookUniqueTokens_UniqueToken] ON BookUniqueTokens(UniqueToken);
CREATE INDEX [IX_Books2_UniqueToken] ON Books2(UniqueToken);
The reads remain the same for these three tables, because the nonclustered index is used.
Table 'Books'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BookUniqueTokens'. Scan count 8, logical reads 16, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Books2'. Scan count 8, logical reads 16, physical reads 5, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Additional details
by @David Browne - Microsoft
Also, if you do want the text stored off-row, you can force that
without using a separate table. Just set the 'large value types out of
row' option with sp_tableoption.
docs.microsoft.com/en-us/sql/relational-databases/
Remember that you have to rebuild your indexes for this to take effect on already populated data.
By @Erik Darling
On
- MAX Data Types Do WHAT?
Filtering on Lob data sucks.
- Memory Grants and Data Size
Your memory grants might go through the roof when using bigger datatypes, impacting performance.
edited yesterday
answered yesterday
Randi VertongenRandi Vertongen
3,926824
3,926824
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f232941%2fdo-varcharmax-nvarcharmax-and-varbinarymax-columns-affect-select-queries%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Is there a nonclustered index on
UniqueToken
? Also, I would not advise anIN
with 10k items, I would store them in a temp table and filter theUniqueTokens
with this temporary table. More on that here– Randi Vertongen
yesterday
1
1) If you are checking for duplicates, why would you include the
Text
column in the query? 2) can you please update the question to inlcude a few examples of values stored in theUniqueToken
column? If you don't want to share actual company data, modify it, but keep the format the same.– Solomon Rutzky
yesterday
@RandiVertongen, yes there is a nonclustered index on UniqueToken
– Saeed Neamati
yesterday
@SolomonRutzky, I'm retrieving existing values from database, to be excluded inside the application code.
– Saeed Neamati
yesterday
@SaeedNeamati I added an edit based on the NC index existing. If the query in the question is the one that needs to be optimized, and the NC index does not have the
Text
column included, then I would look at theIN
for query optimization. There are better ways to find duplicate data.– Randi Vertongen
yesterday