SQL

Auto generated SQL Server keys – uniqueidentifier or IDENTITY – a rebuttal

I found this article by Armando Prato from facility9’ssite; for the most part, the article explains the differences between using a uniqueidentifier and integer-based identity values to auto-generate (surrogate) primary key values. However, the conclusion to the article has a logical flaw; can you spot it?:

It’s evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:

  • IDENTITY generated values are configurable, easy to read, and easier to work with
  • Fewer database pages are required to satisfy query requests
  • In comparison to NEWID(), page splitting (and its associated overhead) is eliminated
  • Database size is minimized
  • System functions exist to capture the last generated IDENTITY value (i.e. SCOPE_IDENTITY(), etc)
  • Some system functions – such as MIN() and MAX(), for instance – cannot be used on uniqueidentifier columns
  •  

    Here’s the hint: page-splitting has less to do with data-types than it does with relationship of the clustered index to the natural order of insertion for the data. the author assumes that the PRIMARY KEY is located on the clustered index, and that’s one of my pet peeves. A PRIMARY KEY constraint is a physical implementation of a logical constraint; it’s used to enforce data validation rules, and has nothing to do with data storage or performance concerns. A clustered index, on the other hand, is one of the basic tools for enhancing database performance. Although SQL Server automatically creates a unique clustered index when building a PRIMARY KEY constraint (if no clustered index exists), a good database design never assumes that the clustered index should automatically sit on the same columns as the PRIMARY KEY constraint.

    Granted, if you use integer-based identity values for your surrogate keys, this argument has less power than if you are using uniqueidentifiers; a good rule of thumb to use when choosing columns for clustered indexes is that they should be relatively unique, and they should increase monotonically (i.e, the order of values is such that 1<=1<=2, etc). Identity’s are always unique, and the value of the next identity is always greater than the last.

    But there are times when you cannot use an identity column (partitioned views. for example), and the choice of using a uniqueidentifier should not be dismissed for fear of page splits; rather, you should choose a different column for your clustered index. I modified the code from the original article to prove my point:

    SET NOCOUNT ON
    GO
    USE MASTER
    GO
    CREATE DATABASE CodeGumbo
    GO
    USE CodeGumbo
    GO
    -- Start at 1 and increment by 1
    CREATE TABLE IDENTITY_TEST1
    (
    ID INT IDENTITY(1,1) PRIMARY KEY,
    TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
    )
    GO
    --uniqueidentifier, clustered by default
    CREATE TABLE NEWID_TEST
    (
    ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
    TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
    )
    GO
    --uniqueidentifier, clustered on a different column
    CREATE TABLE NEWID_TEST2
    (
    ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY NONCLUSTERED,
    TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000),
    UTCDATESTORED datetime DEFAULT GETUTCDATE()
    )
    GO
    CREATE CLUSTERED INDEX clidx_NewID_Test_UtcDateStored ON NEWID_TEST2(UTCDateStored)
    GO
    -- INSERT 1000 ROWS INTO EACH TEST TABLE
    DECLARE @COUNTER INT
    SET @COUNTER = 1
    WHILE (@COUNTER <= 1000)
    BEGIN
    INSERT INTO IDENTITY_TEST1 DEFAULT VALUES
    INSERT INTO NEWID_TEST DEFAULT VALUES
    INSERT INTO NEWID_TEST2 DEFAULT VALUES
    SET @COUNTER = @COUNTER + 1
    END
    GO
    SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count
    FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null)
    WHERE index_id = 1
    ORDER BY tablename
    GO
    USE MASTER;
    CROP DATABASE CodeGumbo;
    GO

    As you can see from the results below, the clustered index (and thus the table) has minimal fragmentation when the clustered index is on a different column than the uniqueidentifier (NewID_Test2).

    To be fair, the index on the primary key is pretty shredded; however, it’s been my experience that having a fragmented nonclustered index is less damaging than a fragmented clustered index. Both will require maintenance over time, but the longer you can keep the clustered index healthy, the better off your database will be. The trick is determining where the clustered index should live, and it’s not always the same column(s) as the primary key.

    Converting IP addresses to bigints in T-SQL

    Because I really want to keep up the rhythm of posting at least one code slice a week, here’s a snippet I wrote a long time ago. In our business, we often want to compare ranges of IP addresses, and so we need to convert them from an quad-based notation (x.x.x.x) to an integer basis. This particular method uses the system function in SQL Server called PARSENAME(), which simply splits a period-delimited varchar into it’s seperate components.

    Below is how you flip from quad to integer and back; enjoy!


    DECLARE @SrcIP varchar(15),
    @SrcIPNbr bigint

    SET @SrcIP = ‘190.10.10.1’

    SET @SrcIPNbr = (CONVERT(bigint, PARSENAME(@SrcIP,4))*POWER(256,3))
    + (CONVERT(bigint, PARSENAME(@SrcIP,3))*POWER(256,2))
    + (CONVERT(bigint, PARSENAME(@SrcIP,2)*POWER(256,1)))
    + CONVERT(bigint, PARSENAME(@SrcIP,1))

    SELECT @SrcIPNbr

    SELECT CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,3))%256) + ‘.’ +
    CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,2))%256) + ‘.’ +
    CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,1))%256) + ‘.’ +
    CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,0))%256)

    Writing Dynamic SQL

    I’m currently working on presentation for Dynamic SQL, and I thought I’d share one of my favorite tricks.  I had actually posted it to USENET earlier, so you can read the whole thread here, complete with some very useful suggestions, but here’s a synopsis:

    As you’re probably aware, writing dynamic SQL statements can be a bear, and often involve breaking in and out of a concatenated string like so:

    DECLARE @sql nvarchar(MAX)
    SELECT @sql = N'SELECT * FROM '+ @DatabaseName
                    + '.'+ @SchemaName
                    + '.'+ @TableName
                    + ' WHERE '+ @ColumnName + ' = ' + @Value
    EXEC sp_executesql @sql

    (Note that I am NOT endorsing the use of this example, or the cost or benefit of dynamic SQL as a whole; I’m simply showing a method of writing dynamic SQL when needed).

    I use a REPLACE statement to avoid dinking around with the extra quotation marks, and thanks to suggestions by Erland and others (in the above thread), I can also mitigate some of the risk of SQL injection, like so:

    SET @SQL = N'SELECT *
                 FROM |DatabaseName|.|SchemaName|.|TableName|
                 WHERE |ColumnName| = @param_value'

    SET @SQL = REPLACE(@SQL,'|DatabaseName|',QUOTENAME(@DatabaseName))
    SET @SQL = REPLACE(@SQL,'|SchemaName|',QUOTENAME(@SchemaName))
    SET @SQL = REPLACE(@SQL,'|TableName|',QUOTENAME(@TableName))
    SET @SQL = REPLACE(@SQL,'|ColumnName|',QUOTENAME(@ColumnName))

    Happy coding!