Development

Building ranges using a dynamically generated numbers table

We use a queueing metaphor for dataloads for the majority of our imports, and it’s been a challenge to figure out how to generate dynamic ranges to add to the queue. In other words, we want to tell our system to go and pcik data from a source that has an id value >x and <=y, and then do it again in a few minutes with a new x and y value. I should say that it’s been a challenge because I’m still not used to SQL 2005 features; I came across a bit of code using a CTE and recursion to generate a numbers table (thanks to Stefan Keir Gordon). Here’s a slightly modified version of his code:

/*thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com*/

SET STATISTICS IO ON

;WITH Numbers(n) AS
   (SELECT 1 AS n
	UNION ALL
	SELECT (n + 1) AS n
	FROM Numbers
	WHERE n <= 32767)
SELECT n
FROM Numbers
WHERE n<=32767
OPTION(MAXRECURSION 32767)
In case you’re wondering, 32767 is the maximum number of recursions supported by SQL Server in a CTE; Stefan’s code neatly generates a single column table with values from 1 to 32767. Looking at that, I realized I could easily get a simple range by adding a second column to the final SELECT, like so:
 ;WITH Ranges(MinValue) AS
  (SELECT 1 AS MinValue
   UNION ALL
   SELECT (Minvalue + 1) AS MinValue
   FROM Ranges WHERE MinValue <= 32767)
SELECT MinValue, MinValue + 1 as MaxValue
FROM Ranges
WHERE MinValue <= 32767
OPTION(MAXRECURSION 32767)

I end up with a two-column table with a MaxValue > MinValue, and the ranges increase by 1. But I had some issues to address:

  • I needed to be able to pass in Start and End values to this code.
  • I needed chunks bigger than 1 value in size; I also had a minimum size I wanted to use.
  • In some cases 32767 ranges would be appropriate; in some cases, I needed less.

Here’s the final code; setting the Start and End valeus for the range of ranges is easy to understand, but the other two values may be a bit confusing. The first is @MaxRanges, which simply says “I want no more than this number of ranges to get from the Start to the End”. The maximum value for this number is 32767. @MinChunk refers to the minumum chunk size, or the space between MinValue and MaxValue for a given row.

It may be easier to just play with the values and see; you should never have more rows than @MaxValue, and the difference between MinValue and MaxValue for a given row should always be greater than or equal to @MinChunk:

/*A dynamically generated table of ranges.
2009-01-25 Stuart R Ainsworth
	http://codegumbo.com/

Thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com */

SET STATISTICS IO ON

DECLARE @MaxRanges int, @MinChunk int, @Start int, @End int
SELECT @Start=10, @End=200000

/*@MaxRanges must be <= 32767;
  if the range between Start and End is large,
  and @MaxRanges > 32767, the End point won't be reached.*/
SELECT @MaxRanges=1000, @MinChunk=205

IF (SELECT CEILING((@End-@Start)/(@MaxRanges*1.00)))>@MinChunk
	SET @MinChunk = CEILING((@End-@Start)/(@MaxRanges*1.00))

;WITH Ranges(MinValue) AS
  (SELECT @Start AS MinValue
   UNION ALL
   SELECT (Minvalue + @MinChunk) AS MinValue
   FROM Ranges WHERE MinValue < @End)
SELECT MinValue, MaxValue = CASE WHEN MinValue + @MinChunk < @End
			     THEN MinValue + @MinChunk
			     ELSE @End End
FROM Ranges
WHERE MinValue < @End
OPTION(MAXRECURSION 32767)

Good luck; happy coding!

catching up…

I’m starting to fall back into a bad habit of not posting regularly; I’ve already missed a couple of weekly updates, and I haven’t shown off much coding lately.  I wish I could blame it on a lack of time, but the truth is that I’m lazy.  Even though I’ve been keeping busy, I still have plenty of free time.  I’ve just been frittering it away on television.  However, I’ve decided I’m going to take some advice from a fellow blogger: if you can’t keep a committment to two posts a week, commit to one.  Henceforth, I’ll be posting at least one post a week on Wednesday nights.

Now that I’ve got that out of the way, here’s the breakdown of what’s been up with me since my last post:

  • Judged a speech contest at the local community college,
  • Started our latest sprint in the scrum method at work,
  • Finished the sponsor letter for SQLSaturday,
  • made several appeals for volunteers and speakers for SQLSaturday,
  • finally took down my Christmas lights, and
  • cleaned my home office.

I’m pooped just making the list.  🙂

AtlantaMDF meeting 2009-01-12

Wanted to drop a quick post wrapping up tonight’s usergroup meeting for AtlantaMDF.  Tonight was the first chance I’ve had to pitch SQLSaturday #13 to a larger audience, and I got a lot of positive feedback.  I had several people volunteer to help, and even had two or three people volunteer to present (not including those who were interested in presenting in my two super-short sessions).  I’m hoping this is going to be killer; I’ve got a lot of legwork to do to get sponsors lined up, but we’re on our way!

There were a lot of new faces tonight; the cynical side of me wants to point to the economy.  A lot of people are either looking for work, or want to make sure that they still have work in the next year.  User groups provide both free education, and free contacts, and you can’t fault someone for doing what they can to improve themselves and make sure that their opportunites are lined up.  However, I think a lot of people are finally starting to realize that SQL Server is a great platform for database development, and the best way they can learn about the technology is to participate in the user group.

Tonight, Eric Melcher of Intellinet gave a great presentation on End-To-End BI.  If you weren’t there, you may want to download it from the AtlantaMDF site when he publishes it; I think the title was a little misleading, but the content was great.  Eric really focused on the client tools available for analysis, including everything from Excel to Performance Point, touching on Excel Web Services, Proclarity, and Reporting Services along the way.  If you’re like me, and you feel overwhelmed by what BI is, and what options are available to display data, this was a great presentation to get your feet wet.

Happy New Year!

So, sitting up and watching Double Impact and cruising the web, I read Andy Warren’s Thoughts for 2009, and I thought I should probably contribute my own ideas as well.  I definitely want to spend more time blogging and contributing to the user groups at large.  Obviously, I’m going to be tied up with SQL Saturday for the first part of the year (speaking of which, I need to get my sponsor letter organized), but I still want to do more. 

Professionally, I need to find more leadership opportunities in my company.  This year’s a bit of a challenge because we’re understaffed with a lot of projects coming down the pipe, and there’s not a lot of openings in management positions, BUT I need to find a way to make my name known.  I want to be the SQL guy for not just my division, but I also want to expand beyond that to the company level as well.

I want to complete my MCITP certs for database administration and development this year.  One thing I learned at PASS Summit was that I need to continue to find ways to expand my knowledge.  I feel very confident in my ability to code T-SQL and design elegant database solutions, but I have 0 experience in Microsoft’s BI solution, and my DBA skills are a bit rusty.

I like Andy’s suggestion to play chess; I’m going to open that up and say I want to play more games with my kids.  They’re getting older, and soon they’ll be too old to hang out with me.  I want to find ways to encourage them to think, and I want to spend time just having fun with them.

Finally, I’m going to get healthy.  As soon as this shoulder recuperates from surgery, I’m back on the workout routine.  I am going to find 30 minutes a day to work out, and I’m going to stick to it.

How about you?

Auto generated SQL Server keys – uniqueidentifier or IDENTITY – a rebuttal

I found this article by Armando Prato from facility9’ssite; for the most part, the article explains the differences between using a uniqueidentifier and integer-based identity values to auto-generate (surrogate) primary key values. However, the conclusion to the article has a logical flaw; can you spot it?:

It’s evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:

  • IDENTITY generated values are configurable, easy to read, and easier to work with
  • Fewer database pages are required to satisfy query requests
  • In comparison to NEWID(), page splitting (and its associated overhead) is eliminated
  • Database size is minimized
  • System functions exist to capture the last generated IDENTITY value (i.e. SCOPE_IDENTITY(), etc)
  • Some system functions – such as MIN() and MAX(), for instance – cannot be used on uniqueidentifier columns
  •  

    Here’s the hint: page-splitting has less to do with data-types than it does with relationship of the clustered index to the natural order of insertion for the data. the author assumes that the PRIMARY KEY is located on the clustered index, and that’s one of my pet peeves. A PRIMARY KEY constraint is a physical implementation of a logical constraint; it’s used to enforce data validation rules, and has nothing to do with data storage or performance concerns. A clustered index, on the other hand, is one of the basic tools for enhancing database performance. Although SQL Server automatically creates a unique clustered index when building a PRIMARY KEY constraint (if no clustered index exists), a good database design never assumes that the clustered index should automatically sit on the same columns as the PRIMARY KEY constraint.

    Granted, if you use integer-based identity values for your surrogate keys, this argument has less power than if you are using uniqueidentifiers; a good rule of thumb to use when choosing columns for clustered indexes is that they should be relatively unique, and they should increase monotonically (i.e, the order of values is such that 1<=1<=2, etc). Identity’s are always unique, and the value of the next identity is always greater than the last.

    But there are times when you cannot use an identity column (partitioned views. for example), and the choice of using a uniqueidentifier should not be dismissed for fear of page splits; rather, you should choose a different column for your clustered index. I modified the code from the original article to prove my point:

    SET NOCOUNT ON
    GO
    USE MASTER
    GO
    CREATE DATABASE CodeGumbo
    GO
    USE CodeGumbo
    GO
    -- Start at 1 and increment by 1
    CREATE TABLE IDENTITY_TEST1
    (
    ID INT IDENTITY(1,1) PRIMARY KEY,
    TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
    )
    GO
    --uniqueidentifier, clustered by default
    CREATE TABLE NEWID_TEST
    (
    ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
    TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
    )
    GO
    --uniqueidentifier, clustered on a different column
    CREATE TABLE NEWID_TEST2
    (
    ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY NONCLUSTERED,
    TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000),
    UTCDATESTORED datetime DEFAULT GETUTCDATE()
    )
    GO
    CREATE CLUSTERED INDEX clidx_NewID_Test_UtcDateStored ON NEWID_TEST2(UTCDateStored)
    GO
    -- INSERT 1000 ROWS INTO EACH TEST TABLE
    DECLARE @COUNTER INT
    SET @COUNTER = 1
    WHILE (@COUNTER <= 1000)
    BEGIN
    INSERT INTO IDENTITY_TEST1 DEFAULT VALUES
    INSERT INTO NEWID_TEST DEFAULT VALUES
    INSERT INTO NEWID_TEST2 DEFAULT VALUES
    SET @COUNTER = @COUNTER + 1
    END
    GO
    SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count
    FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null)
    WHERE index_id = 1
    ORDER BY tablename
    GO
    USE MASTER;
    CROP DATABASE CodeGumbo;
    GO

    As you can see from the results below, the clustered index (and thus the table) has minimal fragmentation when the clustered index is on a different column than the uniqueidentifier (NewID_Test2).

    To be fair, the index on the primary key is pretty shredded; however, it’s been my experience that having a fragmented nonclustered index is less damaging than a fragmented clustered index. Both will require maintenance over time, but the longer you can keep the clustered index healthy, the better off your database will be. The trick is determining where the clustered index should live, and it’s not always the same column(s) as the primary key.

    SQL Saturday #13 Call for Speakers NOW OPEN!

    We just opened up the call for speakers for the first Atlanta SQL Saturday. This is becoming more real by the minute 🙂

    If you’re in the Atlanta Area, and would be willing to help out, please check out our event page for details. Nutshell: it’s being held at the Microsoft offices in Alpharetta on April 25, 2009. All details are subject to change, but that’s where we’re at for now.

    SQL Quiz: Toughest Challenges

    Blast those pingbacks; I lament feeling left out of Chris Shaw’s quiz, and the next thing you know, he tags me.

    Here’s the questions:

    What are the largest challenges that you have faced in your career and how did you overcome those?

    First answer, a technical one:

    One of my biggest challenges was designing a database that would handle millions of rows of syslog data; it was a SQL 2000 box, and the budget for hardware was tight. We started off with a dual-core machine, with only 4GIGs of RAM, and yet we had to analyze 100,000 rows per minute, looking for patterns. We also needed to report trends to the customer on a monthly basis, so I did some reading on data warehousing (Kimball, obviously), and started building fact tables. It became easily visible that we were going to need to partition the data so that the server wouldn’t need to slog through 90 days worth of data when looking for a minute’s worth; so I started playing with partitioned views.

    One problem: partitioned views wouldn’t use parameters to activate partition exclusioning. I searched high and low for the answer, got into several arguments over execution plans with people on the newsgroups about why they thought it was working (they were clearly wrong), and I was just about to give up when I realized that dynamic SQL could save the day. I rewrote all of my procs to use dynamic SQL to activate the partitions, and I as off to the races. I ended up using something that most DBA’s would agree could easily be misused, and it solved my problem. It really drove home the fact that a good database developer will have a toolbox full of stuff; you never know when a left-handed screwdriver will come in handy 🙂

    My second answer is ethical in nature; as most of you are aware, DBA’s are guardians of some of the most precious assets in a company: their data. I had just started working for a company as their all around DBA/report writer/developer/printer support person when they had a change in senior management. Shortly after the change, this company was being audited by their parent company; the auditor came on site, and at one point, asked me to run some numbers for him from our Enterprise Resource Planning (ERP) system. I agreed, but asked for more time, since I was arm deep in a printer at that point (I was serious about the printer support above).

    Between the time that I finished repairing the printer and could get back to my desk, I was approached by the head accountant who asked me to “run the numbers for the auditor, give them to the Chief Operating Officer, make my computer look like it was doing something and go home for the rest of the day”. In other words, hand off the data and get the heck out of Dodge; the accountant would “make sure” the auditor saw it.

    I wish I could say I took the high road, and say that I told the accountant “no, I’ll deliver it myself”, but I didn’t. I went back to my desk generated the data file, saved it to a disk, started a long-running query on my machine, handed the file to the accountant and left. I went home, and started working on my resume. The next day, the auditor was gone, the COO was trumpeting our success, and I told my manager about how icky I felt. Two months later, I was working at a new job, and I got a phone call from my former manager; apparently, the company was being audited again, and he invited me to share my experience with the new auditor. I did so, and shortly after that, most of the upper management (COO, CEO, CFO, and the accountant) were gone.

    Is there a lesson in this? I’m not sure; I clearly didn’t do the right thing the first time, but things still worked out OK in the end. The only apparent victim was my sense of morality, but still, I walked away wishing I had handled the situation differently.

    think I’ll tag SQL Sister and Tim Benninghoff now.

    CSS, max-height

    I wish I had something of substance to add today, but I’ve been overwhelmed by personal stuff.  My oldest daughter has been sick with strep for the last week, so I’ve been having to rearrange a lot of my schedule so I can hang out with her when needed.  Consequently, I’m just wiped in the evenings. 

    Anyway, you may have noticed that I’ve been playing around with my website, especially with my new blogroll page.  I”ve been adding some blogs to it using RSS feeds, and one thing has become clear: different RSS feeds behave differently when asked to return an item.  For example, some blog postings auto truncate themselves after a few phrases; for example, Brent Ozar:

    Brent Ozar

    [feedsnap, 1]http://feeds.feedburner.com/BrentOzar-SqlServerDba[/feedsnap]

    Contrast this with Andy Warren’s feed:

    It Depends (Andy Warren)

    [feedsnap, 1]http://feeds.feedburner.com/itdepends[/feedsnap]

    Now, while I like reading Andy’s stuff, I don’t necessarily want a bunch of lengthy articles on my blogroll page; I tried to figure out a way to limit the amount of text returned, when I tripped across this nifty CSS property: max-height. From DevGuru: The max-height property is used to set the maximum height of an element. Other properties can be used to set the height, width, maximum width, minimum height and the minimum width. This property is not inherited. (note that Devguru also specifies that this property is not currently supported by any browser; it works in IE7. Go figure).

    Anyway, I simply add the max-height to my div container wrapped around the feedsnap plugin, and suddenly the RSS feed truncates after a certain pixel height.

    <div style=”overflow:hidden;max-height:100px”></div>

    It Depends (Andy Warren)

    [feedsnap, 1]http://feeds.feedburner.com/itdepends[/feedsnap]

     

    Anyway, I hope this puts some ideas in your head.

    Converting IP addresses to bigints in T-SQL

    Because I really want to keep up the rhythm of posting at least one code slice a week, here’s a snippet I wrote a long time ago. In our business, we often want to compare ranges of IP addresses, and so we need to convert them from an quad-based notation (x.x.x.x) to an integer basis. This particular method uses the system function in SQL Server called PARSENAME(), which simply splits a period-delimited varchar into it’s seperate components.

    Below is how you flip from quad to integer and back; enjoy!


    DECLARE @SrcIP varchar(15),
    @SrcIPNbr bigint

    SET @SrcIP = ‘190.10.10.1’

    SET @SrcIPNbr = (CONVERT(bigint, PARSENAME(@SrcIP,4))*POWER(256,3))
    + (CONVERT(bigint, PARSENAME(@SrcIP,3))*POWER(256,2))
    + (CONVERT(bigint, PARSENAME(@SrcIP,2)*POWER(256,1)))
    + CONVERT(bigint, PARSENAME(@SrcIP,1))

    SELECT @SrcIPNbr

    SELECT CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,3))%256) + ‘.’ +
    CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,2))%256) + ‘.’ +
    CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,1))%256) + ‘.’ +
    CONVERT(varchar(3),FLOOR(@SrcIPNbr/POWER(256,0))%256)

    We’re committed now!


    Tonight, during the AtlantaMDF planning session, I finally took the bait and committed to planning a SQL Saturday. I’ve been thinking about it for a while, but was delaying until I got more information about the other user group experience in the region (Code Camp). I finally got word that they’ve picked a date (mid March), so we can begin moving forward with our plans.

    I’m a big proponent of the code camp model, so I’m looking forward to helping set the wheels in motion; however, I’m a little worried that I’m in over my head. I’m probably going to be blogging about this a lot over the next few weeks (months), so be prepared. Here’s my short list of stuff to do:

    • Contact Andy Warren to officially say we’re doing this.
    • Report back with some tentative dates to the rest of the AtlantaMDF leadership team.
    • Start identifying vendors, volunteers, and speakers.

    Our local Microsoft rep has suggested that we could use the Microsoft office if we limited the number of attendees, which is reasonable. That’s a big chunk of time time that we can save. Looks like this is really happening; exciting times ahead.