Code

Fun with NEWSEQUENTIALID()

We’re thinking about converting the constrain on an existing column in a table with billions of rows of data. The column is a uniqueidentifier serving as a nonclustered primary key; we’re planning on replacing the NEWID() constraint to a NEWSEQUENTIALID(). What is challenging me is my concern over primary key collisions; will the new default values eventually run into an existing primary key value?

I used the following script to test, and I was a little surprised at the results:

CREATE TABLE TestNewSeqID(ID uniqueidentifier CONSTRAINT DF1 DEFAULT newID(),
                            Marker integer Identity(1,1))
GO

DECLARE @x int
SET @x= 1
WHILE @x <=10
BEGIN
    INSERT INTO TestNewSeqID DEFAULT VALUES
    SET @x = @x +1
END


SELECT *
FROM TestNewSeqID
ORDER BY ID DESC


ALTER TABLE TestNewSeqID
DROP CONSTRAINT DF1;

ALTER TABLE TestNewSeqID
ADD CONSTRAINT DF1 DEFAULT newsequentialid() for ID;

SET @x= 1
WHILE @x <=10
BEGIN
    INSERT INTO TestNewSeqID DEFAULT VALUES
    SET @x = @x +1
END

SELECT *
FROM TestNewSeqID
ORDER BY ID DESC

My results looked something like the following:

ID    Marker
F664E06A-51BF-4FC1-960E-C9E5E854ADAD    9
0746DBAC-8B14-463C-8C4E-9CC789794963    7
24217231-1865-4E2A-B8EF-944DEC9AC6A2    1
323C15B4-59DA-4B73-94E5-54A1A0860F3D    8
B0C1E7EE-3701-4916-A3E2-465FE8462965    2
BFE8C2AB-4788-4596-867E-221F75BAB338    10
C0BC2263-F5E6-41F1-83F6-14BD9D50A6DD    3
9FA2A683-FF02-458E-8703-10D1DCDEFA97    4
75445937-6B46-4B88-9525-0DD809A7BE4B    6
E93284BE-E93B-4009-B206-04CB2B107293    5
4390AD7A-CD7D-DE11-98B8-00196665566A    20
4290AD7A-CD7D-DE11-98B8-00196665566A    19
4190AD7A-CD7D-DE11-98B8-00196665566A    18
4090AD7A-CD7D-DE11-98B8-00196665566A    17
3F90AD7A-CD7D-DE11-98B8-00196665566A    16
3E90AD7A-CD7D-DE11-98B8-00196665566A    15
3D90AD7A-CD7D-DE11-98B8-00196665566A    14
3C90AD7A-CD7D-DE11-98B8-00196665566A    13
3B90AD7A-CD7D-DE11-98B8-00196665566A    12
3A90AD7A-CD7D-DE11-98B8-00196665566A    11

If I looped through several more times, the seed changed and the values got progressively higher, but they were still sorting lower than the original data set. I’d appreciate it if others would run this script and see if they got the same results.

You CAN teach an old dog….

This is more of a sticky note kind of post, to remind me to use a better syntax in the future. Maybe it’ll help some of you as well.

A long time ago, when I was starting to write T-SQL, and I needed to extract only the date from a DATETIME column, I used to do something like this:

SELECT CONVERT(varchar(10), GETDATE(), 101)

and it worked OK. That is, until someone pointed out that I needed to return a datetime instead of a varchar to my calling applications, so I wrapped the whole thing in another CONVERT function, like so:

SELECT CONVERT(datetime, CONVERT(varchar(10), GETDATE(), 101))

At least I now had the correct datetype. But then I read on the newsgroups, that there may be a (marginal) performance benefit to using the DATEADD and DATEDIFF functions to return the date portion of a datetime, and the following method was suggested:

SELECT DATEADD(day, 0, DATEDIFF(day, 0, GETDATE()))

and that works great! I’ve been using that for years now. However, it doesn’t work so well when trying to find the first date of a month or year or week. I figured out that if I flip the parts of the DATEADD statement, then I could easily extract those values:

SELECT DATEADD(month, DATEDIFF(month, 0, GETDATE()),0) --returns 1st day of month
SELECT DATEADD(year, DATEDIFF(year, 0, GETDATE()),0) --returns 1st day of year
SELECT DATEADD(hour, DATEDIFF(hour, 0, GETDATE()),0) --returns the hour
SELECT DATEADD(quarter, DATEDIFF(quarter, 0, GETDATE()),0) --returns the 1st day of the quarter

And while that’s good, I still revert to the original positions out of habit when converting a datetime to a date (e.g., 20090724 23:59:00 to 20090724). This post is more of a reminder to me to STOP DOING THAT, and use a consistent syntax for date conversions.

SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0)

2009-07-24 by Stuart Ainsworth SQL SQLServerPedia Syndication 1

Partitioning computed columns

For a project at work, I’ve been asked to help optimize our ETL process. Part of that involves taking an existing table, and partitioning it along two existing columns. Since the partitioning process only supports partitioning on one column, the obvious choice is to use a computed column. I ran into some issues along the way, and thought I would share them here.

First, when partitioning a table on a computed column, you must persist that column. Persistence is a new feature for SQL Server, introduced in SQL Server 2005 (and still available in 2008). If you’re familiar with the concept of a computed column in SQL 2000, the basic concept is the same; however, persisted computed columns are columns that are physically stored in the database (the engine updates the values whenever values change in the source column) as opposed to virtual columns.

A persisted computed column is defined like so:

CREATE TABLE Sample (DateID int, TimeID int);

INSERT INTO Sample (DateID, TimeID)
VALUES (1,1);

ALTER TABLE SAMPLE
ADD DateIDTimeID AS DATEADD(minute, TimeID, DATEADD(day, DateID, '20090101'))
PERSISTED;

Of course, if you just ran that sample, you got the following error message:

(1 row(s) affected)
Msg 4936, Level 16, State 1, Line 6
Computed column ‘DateIDTimeID’ in table ‘SAMPLE’ cannot be persisted because the column is non-deterministic.

This was the first challenge; although the DATEADD function is always deterministic, the problem lies with the character representation of the date (e.g., ‘20090101’). This behavior is different when running the code on a SQL 2005 server under the SQL 2000 compatibility mode (80) than when running it at level 90 or above; frankly, I was surprised that the code even compiled under SQL 2000 compatibility mode, since persistence of a computed column was not introduced until SQL 2005.

The way that I determined it was the character representation was I checked out Microsoft Connect (http://connect.microsoft.com), which listed the following issue:

Persistent Computed column using DATEADD | Microsoft Connect

From the response, let me quote:

Hi
The reason for the behavior you observe is because the conversion from string to datetime is not deterministic – it will depend on date format settings.
A work around may be to specify the date constant as a bigint (convert your date to bigint).
The behavior with the string constant is by design.
– Christian Kleinerman

Following Christian’s advice, I sought to do what he suggested, only the programmer in me couldn’t distinguish between CONVERT (the T-SQL function) and convert (the English command); in other words, I tried this:

ALTER TABLE SAMPLE
ADD DateIDTimeID AS DATEADD(minute, TimeID, DATEADD(day, DateID, CONVERT(bigint, '20090101')))
PERSISTED;

Which, of course, gives a whole different error message:

Msg 8115, Level 16, State 2, Line 2
Arithmetic overflow error converting expression to data type datetime.
The statement has been terminated.

Stepping out of programming mode for a minute, I realized that Christian meant that I should replace the string constant I was using to represent the date of January 1, 2009 with the bigint (the number of days since January 1, 1753). I ran the following:

SELECT CONVERT(bigint, CONVERT(datetime, '20090101'))

Which gave me the result of 39812, which I then plugged into my ALTER TABLE statement, like so:

ALTER TABLE SAMPLE
ADD DateIDTimeID AS DATEADD(minute, TimeID, DATEADD(day, DateID, 39812))
PERSISTED;

SELECT *
FROM Sample

And the results were much closer to my liking:

DateID	TimeID	DateIDTimeID
1	1	2009-01-02 00:01:00.000

Of course, I now had to deal with the fact that when I had designed my date and time lookup tables, I had used a seed of 1 instead of 0, so now my results were off by a day and a minute, but simple math could adjust for that (future design tip; always use 0-based keys OR use a smart value for the date, like 20090101). Anyway, partitioning now works, and I can alter the table as needed.

2009-07-22 by Stuart Ainsworth SQL SQL Server SQLServerPedia Syndication 0

SCRUM, Source Control, and the SQL Server Developer (Part 2)

So my last post focused on the modifications my shop has made to our implementation of Scrum without a lot of details about how we manage our code changes. This post is intended to explain how we set up source control to support the scrum process.

Source control is one of those poorly-defined practices; it’s something we know that we should be doing as a developer, but it’s not always done as a DBA. In fact, I would guess that many shops that have DBA’s doing most of their reporting and data integration tasks don’t practice source control methods at all; if they do, it’s typically in the form of scripting out the entire database or relying on differential backups. Separating the data structure from the data itself is often difficult to do, so I am sure that many teams don’t do it at all.

We’re currently using Visual Studio for Team Systems with Team Foundation Server as our source control repository; it’s been a steep learning curve for those of us who came from a Visual SourceSafe background, and applying that knowledge to the unfamiliar context of a new UI (VSTS:DB Pro) and concept (source control for databases) has been more than challenging, particularly since we’ve adopted a new development method (Scrum) as well. It’s take a lot of discussion (sometimes quite heated) to get where we are today, especially since there’s not a lot of “best practices” discussion for source controlling SQL Server out there.

The biggest challenge has been the concept of branching and merging; TFS recognizes that you may have multiple development lines going on, especially among multiple developers. When do you split code out to work on it? When do you put it back together? How does this affect deployment of new features, vs. the release of patches to fix the currently deployed line of code?

For us, we decided to address our database source control thusly:

We set up the initial databases in source control in our Main branch. The Main branch is supposed to represent the heart of our code; it’s the final resting place of changes before they go to QA.
From Main, we branch off a line of code for each PBI (Product Backlog Item; see my first post) that we’re working on. If a PBI spans multiple databases, we include a copy of each database under that PBI. Our cookie trail in TFS looks something like this:

Main…Database1
Main…Database2
Development…PBI 1…Database1
Development…PBI 1…Database2
Development…PBI 2…Database1

Once a PBI is ready to be deployed, we merge the databases for that PBI back to their original sources; if another developer merges in their changes which would cause conflict, those conflicts have to be resolved before we allow the final merge (this means that if Dev1 working on PBI 1 has some changes that would affect Dev2 working on PBI 2, those changes are discovered at the second merge).
We also have a line of code that represents Productions deployment; at the beginning of each Sprint, we branch from the Main line into Production and then do a schema comparison with production to ensure that the Production line actually resembles Production. In case of a needed patch, we patch the production line, deploy it, and then merge those changes back into the Main line. If there are conflicts, we have to find them and resolve them.

There are some drawbacks; merging a patch into Main doesn’t always go smoothly, since we have to track down who is responsible for what changes. Documentation in the procs helps, but we need to do a better job (especially when those changes involve objects that are not easily commented; like tables). Furthermore, when it comes time to deploy from the Main branch, if QA decides that a feature isn’t ripe yet, then we have to do some legwork to back it out. All in all, however, it works, despite the bumps along the way.

I’d be interested in hearing how others are doing source control for their database scripts; please feel free to comment below.

2009-05-20 by Stuart Ainsworth SQL SQL Server SQLServerPedia Syndication VSTS:DB 0

SQLSaturday #7: Birmingham, AL May 30, 2009

I’ve been so busy with all of my other projects at work that I’ve forgotten to really tout this. On May 30, I’ll be presenting at the Birmingham SQLSaturday (hosted by the Steel City SQL User Group). My talk is titled “Confessions of a Data integrator”, and will focus on dealing with other people’s database designs.

Over the last twelve years, I’ve built a career on pulling data from a variety of data sources, most of them which were poorly designed (according to the basic rules of normalization). From that experience, I’ve pulled together a set of principles for extracting data which I plan to share. This is a bit of challenge for me, because I’ve never actually written any of these rules of mine down 🙂

Anyway, if you can make it, please do; stop by and see me.

2009-05-17 by Stuart Ainsworth Conferences SQL 0

SQL Server & XML: links of interest

I’ve been working on a project recently that requires using XML in SQL Server, so I’ve been searching the web to find references. Books Online is unusually obtuse on this subject (the samples are sub-par for this complex subject), so I’ve been trying to find good examples of how it works. Here’s some of the better ones:

Passing lists to SQL Server 2005 with XML Parameters Jon Galloway gives a short and sweet example of shredding an XML variable into rows.

XQuery Inside SQL Server 2005 A longer set of examples using CROSS APPLY to associate XML data in a column to a value in a standard column. This is probably the most important query I needed, because we have a lot of XML values (config files) that are stored in a table; those config files are specific to particular clients.

SQL Server 2008 Books Online Here’s the basic data elements; again, the vocabulary is dense, but you at least need to know where it is.

Now, on to FLWOR (which I’m just beginning to study): Blooming FLWOR – An Introduction to the XQuery FLWOR Expression

2009-05-14 by Stuart Ainsworth SQL Server SQLServerPedia Syndication XML 0

Death by a thousand cuts…

This is has been an awful week; things have just not gone as planned from the get-go. I’m gonna run through several issues in this post, and perhaps someone will find some value in them; lessons learned may help others avoid the pain of my ways.

VSTS: DB Pro

This week, I’ve spent way too much time just trying to figure out how to use Visual Studio; I’ve mentioned in previous posts that I’m a database developer, but that I’ve spent most of my career working with the standard SQL Server DBA tools: Query Analyzer & Management Studio. When we migrated to TFS, my manager encouraged me to adopt VSTS:DB Pro as a method of developing, and I’ve put some real effort into learning how the system works. I’m slowly becoming accustomed to it, but there’s still some very quirky things that I’ve yet to figure out.

Issue 1: Logins. Whenever I import a database into a database project, I get a long list of errors because we have Windows accounts associated with users in the database; since those users don’t exist on my laptop, the project throws all kinds of ugly errors. Googling the error reveals very little information; the only method I’ve found to eliminate the errors is to drop the associated users and schemas, which of course, can lead to VERY BAD THINGS in deployment scripts. I’m sure there is a method to resolve it; I just haven’t found it yet.

Issue 2: The Silent Source Control Failure. Most of my projects involve references to vendor databases; we have a policy of not modifying vendor databases, so when I need to write a stored procedure that accesses data from a third-party source, I create a linked database that sits on the server beside the vendor database, and include my proc in there. I don’t touch their code, but am able to customize my queries to pull data. When setting up projects in VSTS:DB Pro, I usually make my custom database the primary project, and include a database reference to a project based on the vendor db. This usually works out OK, but I ran into an error yesterday where the project wizard would complete, but the project was unable to load. Nothing I did would fix it; I finally removed the vendor db from the solution, and was able to load my project (albeit with tons of errors because of the lack of references).

The problem? One of the stored procs in the vendor db has an extremely long name (nearly 100 characters); combined with the path to the workspace, the name of the file exceeds the 255 character limit for Windows file management. TFS apparently doesn’t know how to tell me that the file name is too long, so it just refuses to open. Try wasting 2 hours of precious development time tracking that down, and you can imagine how frustrating that is.

Haste makes waste.

A few months ago, we made a decision to adopt a version of Scrum to standardize our development processes; we’ve made up some of our rules along the way, and for the most part, we’ve done OK with it. However, the pressure to show progress and meet the needs of the organization means that sometimes I’ve made the wrong decision in my desire to please our Product Users. One case was today; one of our rules is that we a) strive to only deploy once a month (at the end of a sprint), and b) if we need to deploy an emergency patch, we only deploy on one day of the week (Wednesdays).

One of our users found an issue, and a production DBA researched it and made a recommendation for a relatively simple code change; I was out of the office Monday, so I only saw the recommendation on Tuesday. The issue was important, and the code looked good, so I thought I could get away with rolling it out today; unfortunately, I didn’t do due diligence on the code, and neglected to check for dependencies on that particular proc. We deployed my change today at 8:30 AM; by 3:30 PM, I was rolling it back. Luckily, the only side effect was that we generated a few extra tickets for our clients, but still, it was a painful experience.

The one thing I want to carry away from this experience is that as our system becomes increasingly complex, the less likely it is that a simple change can be made in less than a day. The code may be simple, but the context of the code in terms of surrounding processes SHOULD warrant time for thorough review. I need to quit reacting to the deemed urgency of the issue, and think about long-term stability instead.

Even the best deployments can have issues.

Overall, I think our deployment process works well, but even a good process can have points of failure. Even though we have several sets of eyes examining all of our code before it goes into production, things can still slip by. For example, at 3:45 this afternoon, I was informed that a stored procedure in production was not returning all of the data; the procedure pulls data from a third-party database, and makes certain assumptions about the nature of that data. In this case, we have a standard naming syntax for all of the devices we manage; ClientID, followed by a space, then device name (e.g., 999 Joe’s Server). The stored proc had a where clause that ended with LIKE ‘[0-9][0-9][0-9] %’; the missing row of data had a device name with the space omitted.

Now, it wasn’t entirely my fault because I had written the code to spec; however, I should have anticipated that the naming convention could be slightly off (and still be parseable; if they had a 2-digit client id, it would be a different story); in the end, our company looked bad to a client because of typo, and that’s not good. Easy to fix the code, but hard to regain the trust of the client.

Tomorrow WILL be a better day.

I’m not much for mumbo-jumbo, but I do believe that life is what you make of it. If I go back to work tomorrow and assume that it’s going to be a bad day, then I’ll probably be proven right. I’m changing my attitude as of this moment, and I choose to believe that the priority issue that I address tomorrow (while it may not be the issue I want to work on) will be the issue I was meant to work on.

2009-05-14 by Stuart Ainsworth SQL SQLServerPedia Syndication VSTS:DB 0

Call me a believer: Microsoft SSAS

This week, I’ve spent most of my days working with Microsoft’s Premier Support; they’re on site at our office, showing us the basics of SQL Server Reporting Services and SQL Server Analysis Services. I must admit that I was NOT looking forward to this week, because I felt like we were going to spend a lot of time defending design choices we had made.

I’ve been pleasantly surprised.

Our rep has been very helpful in helping us understand the power of SSAS and SSRS; I’ve always appreciated the potential of OLAP cubes, but I’ve never really worked with them. We spent most of yesterday talking about the tools, and stepping through the various concepts, but what really helped today was when we took an existing report that we currently deliver to our clients (using standard SQL queries) and developed an analog to it using SSAS and SSRS.

I think what made the difference for me was that this was MY data; I know this stuff in and out. For all of the imperfections in my design, it’s still my design, and I was able to see how the use of OLAP cubes could very effectively address some business problems we have. That’s a very effective teaching tip that I need to remember when I go to classes (and something I remember from way back in my education classes at UGA): PEOPLE UNDERSTAND CONCEPTS BEST WHEN THEY CAN GET THEIR HANDS DIRTY.

I’ve tried tutorials on SSAS before, and often found myself not quite able to grasp the concept; today, I was working to answer a common question using data I was familar with, but a totally different language. It was very enlightening, and I think it’s going to be what pushes us toward implementation. There’s still a lot of ground to cover (why is MDX so frikking complicated?), but I think I’m going to enjoy the challenge.

2009-04-01 by Stuart Ainsworth Code Education SQLServerPedia Syndication 0

Building ranges using a dynamically generated numbers table

We use a queueing metaphor for dataloads for the majority of our imports, and it’s been a challenge to figure out how to generate dynamic ranges to add to the queue. In other words, we want to tell our system to go and pcik data from a source that has an id value >x and <=y, and then do it again in a few minutes with a new x and y value. I should say that it’s been a challenge because I’m still not used to SQL 2005 features; I came across a bit of code using a CTE and recursion to generate a numbers table (thanks to Stefan Keir Gordon). Here’s a slightly modified version of his code:

/*thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com*/

SET STATISTICS IO ON

;WITH Numbers(n) AS
   (SELECT 1 AS n
	UNION ALL
	SELECT (n + 1) AS n
	FROM Numbers
	WHERE n <= 32767)
SELECT n
FROM Numbers
WHERE n<=32767
OPTION(MAXRECURSION 32767)

In case you’re wondering, 32767 is the maximum number of recursions supported by SQL Server in a CTE; Stefan’s code neatly generates a single column table with values from 1 to 32767. Looking at that, I realized I could easily get a simple range by adding a second column to the final SELECT, like so:

 ;WITH Ranges(MinValue) AS
  (SELECT 1 AS MinValue
   UNION ALL
   SELECT (Minvalue + 1) AS MinValue
   FROM Ranges WHERE MinValue <= 32767)
SELECT MinValue, MinValue + 1 as MaxValue
FROM Ranges
WHERE MinValue <= 32767
OPTION(MAXRECURSION 32767)

I end up with a two-column table with a MaxValue > MinValue, and the ranges increase by 1. But I had some issues to address:

I needed to be able to pass in Start and End values to this code.
I needed chunks bigger than 1 value in size; I also had a minimum size I wanted to use.
In some cases 32767 ranges would be appropriate; in some cases, I needed less.

Here’s the final code; setting the Start and End valeus for the range of ranges is easy to understand, but the other two values may be a bit confusing. The first is @MaxRanges, which simply says “I want no more than this number of ranges to get from the Start to the End”. The maximum value for this number is 32767. @MinChunk refers to the minumum chunk size, or the space between MinValue and MaxValue for a given row.

It may be easier to just play with the values and see; you should never have more rows than @MaxValue, and the difference between MinValue and MaxValue for a given row should always be greater than or equal to @MinChunk:

/*A dynamically generated table of ranges.
2009-01-25 Stuart R Ainsworth
	http://codegumbo.com/

Thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com */

SET STATISTICS IO ON

DECLARE @MaxRanges int, @MinChunk int, @Start int, @End int
SELECT @Start=10, @End=200000

/*@MaxRanges must be <= 32767;
  if the range between Start and End is large,
  and @MaxRanges > 32767, the End point won't be reached.*/
SELECT @MaxRanges=1000, @MinChunk=205

IF (SELECT CEILING((@End-@Start)/(@MaxRanges*1.00)))>@MinChunk
	SET @MinChunk = CEILING((@End-@Start)/(@MaxRanges*1.00))

;WITH Ranges(MinValue) AS
  (SELECT @Start AS MinValue
   UNION ALL
   SELECT (Minvalue + @MinChunk) AS MinValue
   FROM Ranges WHERE MinValue < @End)
SELECT MinValue, MaxValue = CASE WHEN MinValue + @MinChunk < @End
			     THEN MinValue + @MinChunk
			     ELSE @End End
FROM Ranges
WHERE MinValue < @End
OPTION(MAXRECURSION 32767)

Good luck; happy coding!

2009-01-25 by Stuart Ainsworth SQL 6

Auto generated SQL Server keys – uniqueidentifier or IDENTITY – a rebuttal

I found this article by Armando Prato from facility9’ssite; for the most part, the article explains the differences between using a uniqueidentifier and integer-based identity values to auto-generate (surrogate) primary key values. However, the conclusion to the article has a logical flaw; can you spot it?:

It’s evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:

IDENTITY generated values are configurable, easy to read, and easier to work with

Fewer database pages are required to satisfy query requests

In comparison to NEWID(), page splitting (and its associated overhead) is eliminated

Database size is minimized

System functions exist to capture the last generated IDENTITY value (i.e. SCOPE_IDENTITY(), etc)

Some system functions – such as MIN() and MAX(), for instance – cannot be used on uniqueidentifier columns

Here’s the hint: page-splitting has less to do with data-types than it does with relationship of the clustered index to the natural order of insertion for the data. the author assumes that the PRIMARY KEY is located on the clustered index, and that’s one of my pet peeves. A PRIMARY KEY constraint is a physical implementation of a logical constraint; it’s used to enforce data validation rules, and has nothing to do with data storage or performance concerns. A clustered index, on the other hand, is one of the basic tools for enhancing database performance. Although SQL Server automatically creates a unique clustered index when building a PRIMARY KEY constraint (if no clustered index exists), a good database design never assumes that the clustered index should automatically sit on the same columns as the PRIMARY KEY constraint.

Granted, if you use integer-based identity values for your surrogate keys, this argument has less power than if you are using uniqueidentifiers; a good rule of thumb to use when choosing columns for clustered indexes is that they should be relatively unique, and they should increase monotonically (i.e, the order of values is such that 1<=1<=2, etc). Identity’s are always unique, and the value of the next identity is always greater than the last.

But there are times when you cannot use an identity column (partitioned views. for example), and the choice of using a uniqueidentifier should not be dismissed for fear of page splits; rather, you should choose a different column for your clustered index. I modified the code from the original article to prove my point:

SET NOCOUNT ON
GO
USE MASTER
GO
CREATE DATABASE CodeGumbo
GO

USE CodeGumbo
GO
-- Start at 1 and increment by 1
CREATE TABLE IDENTITY_TEST1
(
ID INT IDENTITY(1,1) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

--uniqueidentifier, clustered by default
CREATE TABLE NEWID_TEST
(
ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

--uniqueidentifier, clustered on a different column
CREATE TABLE NEWID_TEST2
(
ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY NONCLUSTERED,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000),
UTCDATESTORED datetime DEFAULT GETUTCDATE()
)
GO

CREATE CLUSTERED INDEX clidx_NewID_Test_UtcDateStored ON NEWID_TEST2(UTCDateStored)
GO

-- INSERT 1000 ROWS INTO EACH TEST TABLE
DECLARE @COUNTER INT
SET @COUNTER = 1

WHILE (@COUNTER <= 1000)
BEGIN
INSERT INTO IDENTITY_TEST1 DEFAULT VALUES
INSERT INTO NEWID_TEST DEFAULT VALUES
INSERT INTO NEWID_TEST2 DEFAULT VALUES
SET @COUNTER = @COUNTER + 1
END
GO

SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count
FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null)
WHERE index_id = 1
ORDER BY tablename
GO

USE MASTER;
CROP DATABASE CodeGumbo;
GO

As you can see from the results below, the clustered index (and thus the table) has minimal fragmentation when the clustered index is on a different column than the uniqueidentifier (NewID_Test2).

To be fair, the index on the primary key is pretty shredded; however, it’s been my experience that having a fragmented nonclustered index is less damaging than a fragmented clustered index. Both will require maintenance over time, but the longer you can keep the clustered index healthy, the better off your database will be. The trick is determining where the clustered index should live, and it’s not always the same column(s) as the primary key.

2008-12-30 by Stuart Ainsworth SQL 11

Code

VSTS: DB Pro

Haste makes waste.

Even the best deployments can have issues.

Tomorrow WILL be a better day.

Tags