I found this article by Armando Prato from facility9’ssite; for the most part, the article explains the differences between using a uniqueidentifier and integer-based identity values to auto-generate (surrogate) primary key values. However, the conclusion to the article has a logical flaw; can you spot it?:
It’s evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:
|
Here’s the hint: page-splitting has less to do with data-types than it does with relationship of the clustered index to the natural order of insertion for the data. the author assumes that the PRIMARY KEY is located on the clustered index, and that’s one of my pet peeves. A PRIMARY KEY constraint is a physical implementation of a logical constraint; it’s used to enforce data validation rules, and has nothing to do with data storage or performance concerns. A clustered index, on the other hand, is one of the basic tools for enhancing database performance. Although SQL Server automatically creates a unique clustered index when building a PRIMARY KEY constraint (if no clustered index exists), a good database design never assumes that the clustered index should automatically sit on the same columns as the PRIMARY KEY constraint.
Granted, if you use integer-based identity values for your surrogate keys, this argument has less power than if you are using uniqueidentifiers; a good rule of thumb to use when choosing columns for clustered indexes is that they should be relatively unique, and they should increase monotonically (i.e, the order of values is such that 1<=1<=2, etc). Identity’s are always unique, and the value of the next identity is always greater than the last.
But there are times when you cannot use an identity column (partitioned views. for example), and the choice of using a uniqueidentifier should not be dismissed for fear of page splits; rather, you should choose a different column for your clustered index. I modified the code from the original article to prove my point:
SET NOCOUNT ON GO USE MASTER GO CREATE DATABASE CodeGumbo GO USE CodeGumbo GO -- Start at 1 and increment by 1 CREATE TABLE IDENTITY_TEST1 ( ID INT IDENTITY(1,1) PRIMARY KEY, TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000) ) GO --uniqueidentifier, clustered by default CREATE TABLE NEWID_TEST ( ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY, TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000) ) GO --uniqueidentifier, clustered on a different column CREATE TABLE NEWID_TEST2 ( ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY NONCLUSTERED, TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000), UTCDATESTORED datetime DEFAULT GETUTCDATE() ) GO CREATE CLUSTERED INDEX clidx_NewID_Test_UtcDateStored ON NEWID_TEST2(UTCDateStored) GO -- INSERT 1000 ROWS INTO EACH TEST TABLE DECLARE @COUNTER INT SET @COUNTER = 1 WHILE (@COUNTER <= 1000) BEGIN INSERT INTO IDENTITY_TEST1 DEFAULT VALUES INSERT INTO NEWID_TEST DEFAULT VALUES INSERT INTO NEWID_TEST2 DEFAULT VALUES SET @COUNTER = @COUNTER + 1 END GO SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null) WHERE index_id = 1 ORDER BY tablename GO USE MASTER; CROP DATABASE CodeGumbo; GO |
As you can see from the results below, the clustered index (and thus the table) has minimal fragmentation when the clustered index is on a different column than the uniqueidentifier (NewID_Test2).
To be fair, the index on the primary key is pretty shredded; however, it’s been my experience that having a fragmented nonclustered index is less damaging than a fragmented clustered index. Both will require maintenance over time, but the longer you can keep the clustered index healthy, the better off your database will be. The trick is determining where the clustered index should live, and it’s not always the same column(s) as the primary key.
As the original author of the cited article, I guess I’m not clear on why you think my statement is flawed. You’re right, page splitting is affected by the insertion order. I think NEWID() vs IDENTITY sort of implies that, no? NEWID’s are randomly generated while IDENTITY is sequential.
“A PRIMARY KEY constraint is a phsyical (sic) implementation of a logical constraint; it’s used to enforce data validation rules, and has nothing to do with data storage or performance concerns”
This is not entirely accurate. You should’ve mentioned that the SQL Server primary key constraint by default auto creates a supporting clustered index. You’d have to explicitly declare it as non-clustered (like you’ve done). Not many people understand that if they declare a GUID as a PK, they’re also generating a clustered index on it by default. Regardless of it being clustered or not, you will increase the size of your database and pages will split quite a bit.
Hi Armando,
Thanks for the reply, as well as catching my typo (it seems no matter how hard I try, at least one typo always escapes my attention). I’ll try to explain my objection to your original post a bit more clearly this time.
Your summation to your article implies that using NEWID() to generate a new uniqueidentifier as a PRIMARY KEY will cause more page splits than using IDENTITY to generate a new integer. That’s not completely accurate; it’s not the choice of datatype as a primary key that affects page splits; it’s the choice of datatype as clustered index that increases the opportunity for page splits.
As you pointed out in your reply, “not many people understand that if they declare a GUID (sic) as a PK, they’re also generating a clustered index on it by default”. That’s exactly my point; the candidate for a PRIMARY KEY constraint is not necessarily the best candidate for a clustered index; since SQL Server tries to be helpful and create the clustered index for you if a) it doesn’t exist, and b) you don’t specify the type, I think the relationship between a constraint and an index has been muddied.
While using a uniqueidentifier as a PRIMARY KEY candidate with the NEWID() function does (practically) guarantee a fragmented underlying index, it’s technically not a page split unless that index is clustered. In my code sample I was trying to demonstrate that by clustering on a monotonically increasing value, you minimize the page splits, regardless of whether you choose an integer or uniqueidentifier as a PRIMARY KEY candidate.
I hope that clarifies my position a bit, but if not, I’d be happy to try again 🙂 .
Thanks, Stu
I’m sorry, you did mention that the PK is auto created as clustered. I missed that, my apologies.
Mea culpa for muddying the PK/index relationship. I went with the basic approach that the PK is being created clustered by default. I purposely used this scenario to examine the effect on the table itself. What I was trying to get across is that the randomness of the insertion on a “clustered” PK will dictate the heaviness of the table fragmentation. I thought I presented that accurately. And, as you noted, even moving a NEWID to a non-clustered index structure will heavily fragment that index as well. Using NEWID guarantees heavy fragmentation – clustered or not.
Thanks for the review… I will amend the article with some improvements
Thinking about it some more, I should have probably stated up front that I agree with 90% of your arguments for choosing IDENTITY over NEWID; however, in those situations where a design calls for a surrogate primary key with a uniqueidentifier datatype, you can mitigate some of the fragmentation concerns by choosing a different column for a clustered index (rather than the accepting the defaults).
Anyway, thanks for the replies; it’s spawned some ideas for some future posts, and hopefully, I’ll do a better job of explaining my thoughts the next time.
No discussion of uniqueidentifier as a primary key is complete without a discussion of NEWSEQUENTIALID(). NEWSEQUENTIALID() provides increasing guid values instead of psuedorandom values. Since a primary key by definition is indexed, the question shifts to the performance of the index. A uniqueidentifier populated by NEWSEQUENTIALID() is similar to an int populated by IDENTITY, except for size of the datatype, of course.
Hey John Paul,
As with anything, there are tradeoffs 🙂 You do get the benefit of the super-large range of the uniqueidentifier type (as opposed to bigint IDENTITY), but you lose the ability to generate surrogate primary keys at the application level (which, to me, is the biggest benefit of using a uniqueidentifier).
To be honest, I haven’t found a reasonable use case for using NEWSEQUENTIALID() yet; when I need a small centrally-managed surrogate key, I use IDENTITY() (and will probably cluster on the Primary Key). When I need a key that’s generated at the application level, I’ll use NEWID() (and cluster on some other column; usually a date inserted).
Using OUTPUT allows the application to obtain the value that the DEFAULT NEWSEQUENTIALID() generated.
When replicating globally, guids makes things much easier to manage. Guids are preferrable to a composite key based on IDENTITY and a site id. Even if a mistake is never made on the site id, a site might be closed and consolidated with another site. You couldn’t reassign the closed site id to the new site id because you’d end up with collisions on the composite key.
In my previous post, I indirectly said NEWID() is psuedorandom in deference to those who say computers never generate truly random numbers. Practically speaking when we manage the consequences of its behavior, it is random. The main point remains, NEWSEQUENTIALID() is predictable in that it provdes increasing values similar to IDENTITY.
I’m not sure I understand your use case (and that’s probably my fault), but I’ll try to be clearer in describing my own:
IDENTITY() is useful to me when designing an application where I want to have a small footprint for my surrogate Primary Key. A downside is that if I am inserting transactional data, I have to check with the database to determine the key value so that I can use it with the associated subordinate rows (ie., if I insert a new Manager and associated Employees, I have to first find the Manager key so that I can use it with the new Employees).
NEWID() allows me to build a dataset at the application level, assigning a GUID (uniqueidentifier) to the appropriate key columns before I insert the data; since my application is managing the key values, I can construct a complete data set at the application without relying on the database to generate ID values (I, of course, will have to do the existance check before inserting or handle primary key violations). The downside is that I have a very wide column, and a bad candidate for a clustered index; as a best practice, I’ll need to cluster somewhere else.
NEWSEQUENTIALID(), in my opinion, addresses the clustering issue (by generating a sequential identifier), but doesn’t address the ability of the application to generate its own primary key; like IDENTITY(), I still need to check the database for the newly generated value. It’s not bad, but I don’t see the benefit for my specific scenarios (and, like I said, I’m not sure I understand the cases you’re putting forward).
Thanks for the input, however.
Although it may not be recommended but I am trying to use both uniqueidentifier with default newid along with a column set to identity property.
However after insert statement the @@IDENTITY returns NULL.
Do you know why this is happening and how to use both of these into a single table?
My blog may not be the best place for a back and forth interchange on this, so you may want to address this question to a local usergroup or the USENET forums. However, I’d like to help if I could; I couldn’t replicate your issue, however. When I tried it with a sample script, @@IDENTITY worked fine (BTW, you need to be very careful with that function; SCOPE_IDENTITY() is usually a better fit for most applications). Your description of the problem was great, but if you could include a sample script with only a few columns that shows the problem you’re having, it would make it a lot easier to help you solve the problem.
Good luck.