I found this article by Armando Prato from facility9’ssite; for the most part, the article explains the differences between using a uniqueidentifier and integer-based identity values to auto-generate (surrogate) primary key values. However, the conclusion to the article has a logical flaw; can you spot it?:
It’s evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:
|
Here’s the hint: page-splitting has less to do with data-types than it does with relationship of the clustered index to the natural order of insertion for the data. the author assumes that the PRIMARY KEY is located on the clustered index, and that’s one of my pet peeves. A PRIMARY KEY constraint is a physical implementation of a logical constraint; it’s used to enforce data validation rules, and has nothing to do with data storage or performance concerns. A clustered index, on the other hand, is one of the basic tools for enhancing database performance. Although SQL Server automatically creates a unique clustered index when building a PRIMARY KEY constraint (if no clustered index exists), a good database design never assumes that the clustered index should automatically sit on the same columns as the PRIMARY KEY constraint.
Granted, if you use integer-based identity values for your surrogate keys, this argument has less power than if you are using uniqueidentifiers; a good rule of thumb to use when choosing columns for clustered indexes is that they should be relatively unique, and they should increase monotonically (i.e, the order of values is such that 1<=1<=2, etc). Identity’s are always unique, and the value of the next identity is always greater than the last.
But there are times when you cannot use an identity column (partitioned views. for example), and the choice of using a uniqueidentifier should not be dismissed for fear of page splits; rather, you should choose a different column for your clustered index. I modified the code from the original article to prove my point:
SET NOCOUNT ON GO USE MASTER GO CREATE DATABASE CodeGumbo GO USE CodeGumbo GO -- Start at 1 and increment by 1 CREATE TABLE IDENTITY_TEST1 ( ID INT IDENTITY(1,1) PRIMARY KEY, TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000) ) GO --uniqueidentifier, clustered by default CREATE TABLE NEWID_TEST ( ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY, TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000) ) GO --uniqueidentifier, clustered on a different column CREATE TABLE NEWID_TEST2 ( ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY NONCLUSTERED, TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000), UTCDATESTORED datetime DEFAULT GETUTCDATE() ) GO CREATE CLUSTERED INDEX clidx_NewID_Test_UtcDateStored ON NEWID_TEST2(UTCDateStored) GO -- INSERT 1000 ROWS INTO EACH TEST TABLE DECLARE @COUNTER INT SET @COUNTER = 1 WHILE (@COUNTER <= 1000) BEGIN INSERT INTO IDENTITY_TEST1 DEFAULT VALUES INSERT INTO NEWID_TEST DEFAULT VALUES INSERT INTO NEWID_TEST2 DEFAULT VALUES SET @COUNTER = @COUNTER + 1 END GO SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null) WHERE index_id = 1 ORDER BY tablename GO USE MASTER; CROP DATABASE CodeGumbo; GO |
As you can see from the results below, the clustered index (and thus the table) has minimal fragmentation when the clustered index is on a different column than the uniqueidentifier (NewID_Test2).
To be fair, the index on the primary key is pretty shredded; however, it’s been my experience that having a fragmented nonclustered index is less damaging than a fragmented clustered index. Both will require maintenance over time, but the longer you can keep the clustered index healthy, the better off your database will be. The trick is determining where the clustered index should live, and it’s not always the same column(s) as the primary key.