Saturday, February 21, 2009

[SQLServer] Indexing and Scalability Tips

Recently I spent a lot of time messing about with indexes. Not because I like it - but because there was a little bit of a fuckup at work with scalability issues on a DB. After messing around with SQLProfiler (real life-saver in this case) I gained a bit more of understanding about indexes as I never got the chance to do before.

What I learned about indexes and scalability can be summarized in 2 points:

1)
You don't wanna have a clustered index on the primary key on a table once it gets huge (just make it nonclustered) if you plan on running loads of inserts

This is not a problem when the table is small - and since it's good for fast selects everyone's cool with it. Once the table gets populated with LOADS of records or it gets HUGE in size this WILL KILL YOU on any insert or update operation.

Just to give a few figures:
- A clustered index on a table with 45 million records was killing my SQL server - every insert from code was resulting in nasty timeouts (it was taking more than the default 30 secs for standard inserts - Increasing your timeout tolerance won't solve your scalability problems!)
- A clustered index on a HUGE table (15 gigs) with a few million records was slowing down inserts to the point it was taking 15 times as much as it was taking on an empty db (we're talking mSecs but on big batches it's gonna kill you)

2) You wanna have a non-clustered index for frequently selected fields on any table you plan on running extensive selects on (add the fields you select often to a single index)

- On a medium size table (40k records) select time on a given set of fields was doubling in time compared to the same table with a lot less data before adding a triple-index on the set of given fields.


Following the points above I was able to solve my scalability issues (the process I was running is now taking the same time on a huge db compared to an empty one). Obviosuly these rules are not applicable in every case - but understanding the points above might help people when 'scalability' hits the fan.


kick it on DotNetKicks.com

13 comments:

cyberprunes said...

i assume you mean a clustered index other that the primary key?

Johnny Idol said...

Nope I mean a clustered index ON the primary key. As far as I know you can have only one clustered index per table.

Tracy Talbot said...

RE: Point 2. You create a separate index for each high use access path to the table, right? Not just one index that includes every column used in all high use access paths to the table. Column 3 and Column 4 are highly used as select criteria in one access path. Column 7, 8 and 9 are used as select criteria for another access path to the same table. I create two indexes, one with 3 and 4, the other with 7, 8, and 9. Right?

Johnny Idol said...

@Tracy

Yep - when coding SQL we should try to spot frequently used selects and for each of those create one index grouping the selected fields.

Your example looks good to me (it's exactly what I am talking about in point 2).

Tracy Talbot said...

I've done a lot of DB2 and Oracle SQL but their optimizers are very different than SQL Server's (which I'm new to). I can speed up my SS2005 stored procedures by pulling joins apart and building temp tables up front instead, with much better results (faster) than adding performance indexes (which makes no sense to me at all). So I just don't assume anything with SS2005, which is why I asked. BTW, you blog has answered several questions for me the past few weeks, so thank you very much for your efforts!

Johnny Idol said...

Do you use temp tables or table variables? In my experience with table variables you get much better results (depending on size).

Anyway - what you say it's probably the first step in optimization and also keep in mind that I am talking about scalability. I was perfectly happy with what I had till the DB went up to 20 Gigs of stuff.

Johnny Idol said...

Oh, and thanks for following the blog - I really do it because otherwise I would forget everything but at the same time is nice if it happens to help people!

Tracy Talbot said...

Temp tables (eg. #temp) are what I'm using and it flies in the face of RDBMS reason, even to the experienced SQL Server data warehousing staff. But it is faster (I'm not running 20g of data through it though, so I see your point). I'm hoping to eventually stumble upon the reason why this works the way it does with SQL Server. But I'm not holding my breath.

Johnny Idol said...

Just for fun have you tried to use a table variable declared like this and to compare performances?

like this:

declare @var table (id int, data varchar(20) )

instead of:

create table #temp (id int, data varchar(20) )

If you're not moving mountains you should notice a sensible improvement in the access time with table variables because they are partially stored in memory (instead of all on disk).

Maybe I should add I really hate (T)SQL.

Tracy Talbot said...

yep, I've tried @temp and #temp. #temp is faster (even without defining indexes against it) but I am moving a fair amount of data through these so that may be why?

Johnny Idol said...

I'd say that might be why - but this stuff is really far from being an exact science - It's more kinda like 'this works for me - that works for you' :-)

Johnny Idol said...

@Tracy

Anyway if you have a

select A, B from myTable

how do you speed that up when the table gets huge (make it 5 million records)?

Only an index on A, B will help you in this case (Point 2).

Tracy Talbot said...

True, it's the joins that can be improved more (sometimes?) with a temp table than an additional index. I posted because I want to learn why a SS2005 temp table makes it faster. In DB2 or Oracle you'd never think of splitting a join out into separate, even un-indexed, temp tables first to realize a performance gain, the indexing on the base tables would be all you'd ever need.