SQL

<facedesk>

So, I’ve been answering questions on StackOverflow for two days now, and I still like the experience.  However, I have noticed a disturbing trend in several of the questions being asked that involve SQL Server.  Here’s an example of one; I’ll give you the question, and my answer, and then tell you what’s bugging me.

TSQL Generate 5 character length string, all digits [0-9] that doesn’t already exist in database

What’s the best way to do this?

I need to generate a 5 digit length string where all the characters are numeric. However, I need to be able to do this ‘x’ amount of times (user variable) and store this random strings in a database. Furthermore, I can’t generate the same string twice. Old strings will be removed after 6 months.

Pseudo-code

DECLARE @intIterator INT,
@intMax

SET @intIterator = 1
SET @intMax = 5 (number of strings to generate)

WHILE @intIterator <= @intMax
BEGIN

  -- GENERATE RANDOM STRING OF 5 NUMERIC DIGITS
   ???

  -- INSERT INTO DB IF DOESN'T ALREADY EXIST
  INSERT INTO TSTRINGS
  SELECT @RANDOMSTRING

  IF @@ERROR = 0
    SET @intIterator = @intIterator + 1

END

I know this probably isn’t the best way to do it, so advice is appreciated. But really looking for ideas on how to generate the numeric 5 length strings.

 

I think this is a great question; the guy’s obviously looking for a list of randomly generated five-character strings comprised entirely of digits; this means there are only 100,000 possibilities.  This question is tagged as SQL and TSQL, which means that he’s looking for a database-driven solution (I’m aware that there are several ways to skin this cat; I’m just showing you that he’s specifically looking for a SQL solution).  Note, however, that he’s using a loop to create, validate, and insert 1 row at a time. 

Here’s where a lot of reading lately on numbers tables comes into play; I think about this a bit, and post the following solution:

 

Here’s a set-based approach, using SQL 2005 syntax (would be a little easier with SQL 2008, but you didn’t specify). Also, if you had a numbers table, a large chunk of it can be cut out.

No looping, no duplicates, and should be nearly instantaneous (assuming the target column is indexed).

DECLARE @intMax INTEGER
SET @intMax = 5
INSERT  INTO TSTRINGS
        SELECT  q.nString
        FROM    ( SELECT    ROW_NUMBER() OVER ( ORDER BY ( NEWID() ) ) AS N2 ,
                            RIGHT(REPLICATE(‘0’, 5) + CONVERT(VARCHAR(5), N), 5) AS nString
                  FROM      –the subquery below could be replaced by a numbers table   
                            ( SELECT TOP 100000
                                        ROW_NUMBER() OVER ( ORDER BY ( ac1.Object_ID ) ) – 1 AS N
                              FROM      Master.sys.columns ac1
                                        CROSS JOIN Master.sys.columns ac2
                                        CROSS JOIN Master.sys.columns ac3
                            ) numbers
                  WHERE     RIGHT(REPLICATE(‘0’, 5) + CONVERT(VARCHAR(5), N), 5) NOT IN ( SELECT    nString
                                                                                          FROM      TSTRINGS ) –check to see if reused
                ) q
        WHERE   q.N2 <= @intMax

A nice, tidy, set-based solution that I’ll probably keep in mind for generating passwords in the future (with a little work).  So, what’s bugging me?  Is it the fact that I have 0 votes for this solution?  No, that’s not really it, because votes are kind of whimsical; some times you get them; sometimes you don’t.

No, what bugs me, is that out of the 6 answers posted, 3 of them involve CURSORS.  For a SQL question.   I know.  Shock and horror.

Seriously, though; why do people still use cursors?  I know that there are times when cursors are an appropriate tool, but you shouldn’t use them everyday.  It’s kind of like buying those specialized drill bits for removing a stripped screw; for that specific application, they work great.  If you leave them in your drill, and you try to use them in an everyday situation; guess what?  You’ve stripped out a screw.

Let me be as clear as I can:

CURSORS SUCK, PEOPLE.  HEAR ME?  THEY SUCK.

There you go; I feel better now.

Building ranges using numbers

A while back, I posted a solution to a problem I was facing where I needed to generate a table of ranges.  While I haven’t experienced any sort of issues with the solution, a couple of things have recently occurred which has caused me to go back and take a look at the problem, and see if I can come up with a more elegant solution.

The first was a comment by Jeff Moden on the original post; Jeff suggested that recursion may not be the best choice for a solution given the relatively large number of logical reads compared to his solution and/or a permanent Numbers table.  Jeff posted a solution, but his solution only addressed the first part of the article (building a dynamic numbers table) and NOT the second part (coming up with a table of ranges).  The problem that I was struggling with using a table of numbers was the dynamic span between rows; in other words, I couldn’t figure out how to force row 1 to start with 10, and row 2 start with 215

The second driving factor was the FizzBuzz solution I proffered yesterday; while I was reviewing that code, I had a glimmer of an idea about how to handle the dynamic spanning using MODULO.  I reworked the solution using Jeff’s suggestion, as well as using a permanent numbers table.  My examples are below:

Original code:

/*A dynamically generated table of ranges.
2009-01-25 Stuart R Ainsworth
    http://codegumbo.com/

Thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com */


DECLARE @MaxRanges int, @MinChunk int, @Start int, @End int
SELECT @Start=10, @End=200000

/*@MaxRanges must be <= 32767;
  if the range between Start and End is large,
  and @MaxRanges > 32767, the End point won't be reached.*/


SELECT @MaxRanges=100, @MinChunk=2

IF (SELECT CEILING((@End-@Start)/(@MaxRanges*1.00)))>@MinChunk
    SET @MinChunk = CEILING((@End-@Start)/(@MaxRanges*1.00))

;WITH Ranges(MinValue) AS
  (SELECT @Start AS MinValue
   UNION ALL
   SELECT (Minvalue + @MinChunk) AS MinValue
   FROM Ranges WHERE MinValue < @End)
SELECT MinValue, MaxValue = CASE WHEN MinValue + @MinChunk < @End
                 THEN MinValue + @MinChunk
                 ELSE @End End
FROM Ranges
WHERE MinValue < @End
OPTION(MAXRECURSION 32767)

Modification based on Jeff’s suggestion:

/*A dynamically generated table of ranges. part 2
2009-09-27 Stuart R Ainsworth
    http://codegumbo.com/

Thanks to Jeff Moden for the inspiration*/

DECLARE @MaxRanges int, @MinChunk int, @Start int, @End int
SELECT @Start=10, @End=200000
SELECT @MaxRanges=100, @MinChunk=2

IF (SELECT CEILING((@End-@Start)/(@MaxRanges*1.00)))>@MinChunk
    SET @MinChunk = CEILING((@End-@Start)/(@MaxRanges*1.00))


SELECT MinValue, MaxValue = CASE WHEN MinValue + @MinChunk < @End
                 THEN MinValue + @MinChunk
                 ELSE @End End
FROM (    SELECT 
            ROW_NUMBER() OVER (ORDER BY (ac1.Object_ID)) -1 + @Start AS MinValue
        FROM Master.sys.columns ac1
        CROSS JOIN Master.sys.columns ac2) x
WHERE (MinValue = @Start
    OR (Minvalue-@Start) % @MinChunk = 0)
    AND MinValue + @MinChunk <= @End + @MinChunk

I also used a permanent table of numbers (per Jeff’s suggestion), but given the similarity to Jeff’s modification, I won’t post the code here.  Simply replace the subquery with a numbers table (you’ll need to monkey with the math to get the dynamic starting point).  What was interesting to me was the output for each of the three solutions when using SET STATISTICS IO and TIME:

Original code:

Table 'Worktable'. Scan count 2, logical reads 607, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:   CPU time = 16 ms,  elapsed time = 88 ms.

Modification (dynamic table):

Table 'syscolpars'. Scan count 2, logical reads 19, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:   CPU time = 187 ms,  elapsed time = 203 ms.

Modification (permanent table):

Table 'NUMBERS'. Scan count 1, logical reads 424, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:   CPU time = 125 ms,  elapsed time = 156 ms.

At first glance, it would appear that Jeff’s assertion is correct; the number of logical reads is greatly reduced by his suggested modification (more so by the dynamic table than the permanent table, interestingly enough).  However, the CPU time is increased.  And, while I know that execution plans are no guarantee for performance, in every situation SQL Server estimated that avoiding the original code was 0% of the three plans (dynamic modification 65% and permanent numbers table 35%).

So now, I’m stuck.  I can see Jeff’s point, but I’m not sure that a solution based on his suggestion is that much better than my original solution; do I worry more about I/O or CPU?

FizzBuzz

Stumbled across Mike Hillwig’s post in response to a question posed by Brent Ozar.  Depending on how literally you want to use the word “print” (do you mean the actual command, or do you just want output?), I think I have a slightly more elegant solution than what Mike proposed.

SELECT CASE WHEN n%3=0 AND n%5=0 THEN 'FizzBuzz' 
            WHEN n % 3 = 0 THEN 'Fizz'
            WHEN n % 5 = 0 THEN 'Buzz'
            ELSE CONVERT(varchar(8), n) END as "N"
FROM (    SELECT TOP 100 ROW_NUMBER() OVER (ORDER BY (ac1.Object_ID)) AS "N"
    FROM Master.sys.All_Columns ac1) x

Basically, it avoids the WHILE loop; however, if you must use a PRINT command, then Mike’s suggestion is probably as elegant as I can imagine.  What’s interesting is that this may also provide me a better solution to my need to build a dynamic range that I posted a while back.  I’ll have to think about it a bit more, but I think it’s possible.

Fun with NEWSEQUENTIALID()

We’re thinking about converting the constrain on an existing column in a table with billions of rows of data.  The column is a uniqueidentifier serving as a nonclustered primary key; we’re planning on replacing the NEWID() constraint to a NEWSEQUENTIALID().  What is challenging me is my concern over primary key collisions; will the new default values eventually run into an existing primary key value?

I used the following script to test, and I was a little surprised at the results:

CREATE TABLE TestNewSeqID(ID uniqueidentifier CONSTRAINT DF1 DEFAULT newID(),
                            Marker integer Identity(1,1))
GO

DECLARE @x int
SET @x= 1
WHILE @x <=10
BEGIN
    INSERT INTO TestNewSeqID DEFAULT VALUES
    SET @x = @x +1
END


SELECT *
FROM TestNewSeqID
ORDER BY ID DESC


ALTER TABLE TestNewSeqID
DROP CONSTRAINT DF1;

ALTER TABLE TestNewSeqID
ADD CONSTRAINT DF1 DEFAULT newsequentialid() for ID;

SET @x= 1
WHILE @x <=10
BEGIN
    INSERT INTO TestNewSeqID DEFAULT VALUES
    SET @x = @x +1
END

SELECT *
FROM TestNewSeqID
ORDER BY ID DESC


My results looked something like the following:

ID    Marker
F664E06A-51BF-4FC1-960E-C9E5E854ADAD    9
0746DBAC-8B14-463C-8C4E-9CC789794963    7
24217231-1865-4E2A-B8EF-944DEC9AC6A2    1
323C15B4-59DA-4B73-94E5-54A1A0860F3D    8
B0C1E7EE-3701-4916-A3E2-465FE8462965    2
BFE8C2AB-4788-4596-867E-221F75BAB338    10
C0BC2263-F5E6-41F1-83F6-14BD9D50A6DD    3
9FA2A683-FF02-458E-8703-10D1DCDEFA97    4
75445937-6B46-4B88-9525-0DD809A7BE4B    6
E93284BE-E93B-4009-B206-04CB2B107293    5
4390AD7A-CD7D-DE11-98B8-00196665566A    20
4290AD7A-CD7D-DE11-98B8-00196665566A    19
4190AD7A-CD7D-DE11-98B8-00196665566A    18
4090AD7A-CD7D-DE11-98B8-00196665566A    17
3F90AD7A-CD7D-DE11-98B8-00196665566A    16
3E90AD7A-CD7D-DE11-98B8-00196665566A    15
3D90AD7A-CD7D-DE11-98B8-00196665566A    14
3C90AD7A-CD7D-DE11-98B8-00196665566A    13
3B90AD7A-CD7D-DE11-98B8-00196665566A    12
3A90AD7A-CD7D-DE11-98B8-00196665566A    11

If I looped through several more times, the seed changed and the values got progressively higher, but they were still sorting lower than the original data set.  I’d appreciate it if others would run this script and see if they got the same results.

You CAN teach an old dog….

This is more of a sticky note kind of post, to remind me to use a better syntax in the future.  Maybe it’ll help some of you as well.

A long time ago, when I was starting to write T-SQL, and I needed to extract only the date from a DATETIME column, I used to do something like this:

SELECT CONVERT(varchar(10), GETDATE(), 101)

 

and it worked OK.  That is, until someone pointed out that I needed to return a datetime instead of a varchar to my calling applications, so I wrapped the whole thing in another CONVERT function, like so:

SELECT CONVERT(datetime, CONVERT(varchar(10), GETDATE(), 101))

At least I now had the correct datetype.  But then I read on the newsgroups, that there may be a (marginal) performance benefit to using the DATEADD and DATEDIFF functions to return the date portion of a datetime, and the following method was suggested:

SELECT DATEADD(day, 0, DATEDIFF(day, 0, GETDATE()))

and that works great! I’ve been using that for years now.  However, it doesn’t work so well when trying to find the first date of a month or year or week.   I figured out that if I flip the parts of the DATEADD statement, then I could easily extract those values:

SELECT DATEADD(month, DATEDIFF(month, 0, GETDATE()),0) --returns 1st day of month
SELECT DATEADD(year, DATEDIFF(year, 0, GETDATE()),0) --returns 1st day of year
SELECT DATEADD(hour, DATEDIFF(hour, 0, GETDATE()),0) --returns the hour
SELECT DATEADD(quarter, DATEDIFF(quarter, 0, GETDATE()),0) --returns the 1st day of the quarter

And while that’s good, I still revert to the original positions out of habit when converting a datetime to a date (e.g., 20090724 23:59:00 to 20090724).  This post is more of a reminder to me to STOP DOING THAT, and use a consistent syntax for date conversions.

SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0)

Partitioning computed columns

For a project at work, I’ve been asked to help optimize our ETL process.  Part of that involves taking an existing table, and partitioning it along two existing columns.  Since the partitioning process only supports partitioning on one column, the obvious choice is to use a computed column.  I ran into some issues along the way, and thought I would share them here.

First, when partitioning a table on a computed column, you must persist that column.  Persistence is a new feature for SQL Server, introduced in SQL Server 2005 (and still available in 2008).  If you’re familiar with the concept of a computed column in SQL 2000, the basic concept is the same; however, persisted computed columns are columns that are physically stored in the database (the engine updates the values whenever values change in the source column) as opposed to virtual columns.

A persisted computed column is defined like so:

CREATE TABLE Sample (DateID int, TimeID int);

INSERT INTO Sample (DateID, TimeID)
VALUES (1,1);

ALTER TABLE SAMPLE
ADD DateIDTimeID AS DATEADD(minute, TimeID, DATEADD(day, DateID, '20090101'))
PERSISTED;

Of course, if you just ran that sample, you got the following error message:

(1 row(s) affected)
Msg 4936, Level 16, State 1, Line 6
Computed column ‘DateIDTimeID’ in table ‘SAMPLE’ cannot be persisted because the column is non-deterministic.

This was the first challenge; although the DATEADD function is always deterministic, the problem lies with the character representation of the date (e.g., ‘20090101’).  This behavior is different when running the code on a SQL 2005 server under the SQL 2000 compatibility mode (80) than when running it at level 90 or above; frankly, I was surprised that the code even compiled under SQL 2000 compatibility mode, since persistence of a computed column was not introduced until SQL 2005.

The way that I determined it was the character representation was I checked out Microsoft Connect (http://connect.microsoft.com), which listed the following issue:

Persistent Computed column using DATEADD | Microsoft Connect

From the response, let me quote:

Hi
The reason for the behavior you observe is because the conversion from string to datetime is not deterministic – it will depend on date format settings.
A work around may be to specify the date constant as a bigint (convert your date to bigint).
The behavior with the string constant is by design.
– Christian Kleinerman

Following Christian’s advice, I sought to do what he suggested, only the programmer in me couldn’t distinguish between CONVERT (the T-SQL function) and convert (the English command); in other words, I tried this:

ALTER TABLE SAMPLE
ADD DateIDTimeID AS DATEADD(minute, TimeID, DATEADD(day, DateID, CONVERT(bigint, '20090101')))
PERSISTED;

Which, of course, gives a whole different error message:

Msg 8115, Level 16, State 2, Line 2
Arithmetic overflow error converting expression to data type datetime.

The statement has been terminated.

Stepping out of programming mode for a minute, I realized that Christian meant that I should replace the string constant I was using to represent the date of January 1, 2009 with the bigint (the number of days since January 1, 1753).  I ran the following:

SELECT CONVERT(bigint, CONVERT(datetime, '20090101'))

Which gave me the result of 39812, which I then plugged into my ALTER TABLE statement, like so:

ALTER TABLE SAMPLE
ADD DateIDTimeID AS DATEADD(minute, TimeID, DATEADD(day, DateID, 39812))
PERSISTED;

SELECT *
FROM Sample 

And the results were much closer to my liking:

DateID TimeID DateIDTimeID
1 1 2009-01-02 00:01:00.000

 

Of course, I now had to deal with the fact that when I had designed my date and time lookup tables, I had used a seed of 1 instead of 0, so now my results were off by a day and a minute, but simple math could adjust for that (future design tip; always use 0-based keys OR use a smart value for the date, like 20090101).  Anyway, partitioning now works, and I can alter the table as needed.

SCRUM, Source Control, and the SQL Server Developer (Part 2)

So my last post focused on the modifications my shop has made to our implementation of Scrum without a lot of details about how we manage our code changes.  This post is intended to explain how we set up source control to support the scrum process.

Source control is one of those poorly-defined practices; it’s something we know that we should be doing as a developer, but it’s not always done as a DBA.  In fact, I would guess that many shops that have DBA’s doing most of their reporting and data integration tasks don’t practice source control methods at all; if they do, it’s typically in the form of scripting out the entire database or relying on differential backups.  Separating the data structure from the data itself is often difficult to do, so I am sure that many teams don’t do it at all.

We’re currently using Visual Studio for Team Systems with Team Foundation Server as our source control repository; it’s been a steep learning curve for those of us who came from a Visual SourceSafe background, and applying that knowledge to the unfamiliar context of a new UI (VSTS:DB Pro) and concept (source control for databases) has been more than challenging, particularly since we’ve adopted a new development method (Scrum) as well.  It’s take a lot of discussion (sometimes quite heated) to get where we are today, especially since there’s not a lot of “best practices” discussion for source controlling SQL Server out there.

The biggest challenge has been the concept of branching and merging; TFS recognizes that you may have multiple development lines going on, especially among multiple developers.  When do you split code out to work on it?  When do you put it back together?  How does this affect deployment of new features, vs. the release of patches to fix the currently deployed line of code?

For us, we decided to address our database source control thusly:

  1. We set up the initial databases in source control in our Main branch.  The Main branch is supposed to represent the heart of our code; it’s the final resting place of changes before they go to QA. 
  2. From Main, we branch off a line of code for each PBI (Product Backlog Item; see my first post) that we’re working on.  If a PBI spans multiple databases, we include a copy of each database under that PBI.  Our cookie trail in TFS looks something like this:
    • Main…Database1
    • Main…Database2
    • Development…PBI 1…Database1
    • Development…PBI 1…Database2
    • Development…PBI 2…Database1
  3. Once a PBI is ready to be deployed, we merge the databases for that PBI back to their original sources; if another developer merges in their changes which would cause conflict, those conflicts have to be resolved before we allow the final merge (this means that if Dev1 working on PBI 1 has some changes that would affect Dev2 working on PBI 2, those changes are discovered at the second merge).
  4. We also have a line of code that represents Productions deployment; at the beginning of each Sprint, we branch from the Main line into Production and then do a schema comparison with production to ensure that the Production line actually resembles Production.  In case of a needed patch, we patch the production line, deploy it, and then merge those changes back into the Main line.  If there are conflicts, we have to find them and resolve them.

There are some drawbacks; merging a patch into Main doesn’t always go smoothly, since we have to track down who is responsible for what changes.  Documentation in the procs helps, but we need to do a better job (especially when those changes involve objects that are not easily commented; like tables).  Furthermore, when it comes time to deploy from the Main branch, if QA decides that a feature isn’t ripe yet, then we have to do some legwork to back it out.  All in all, however, it works, despite the bumps along the way.

I’d be interested in hearing how others are doing source control for their database scripts; please feel free to comment below.

SQLSaturday #7: Birmingham, AL May 30, 2009

I’ve been so busy with all of my other projects at work that I’ve forgotten to really tout this.  On May 30, I’ll be presenting at the Birmingham SQLSaturday (hosted by the Steel City SQL User Group).  My talk is titled “Confessions of a Data integrator”, and will focus on dealing with other people’s database designs. 

Over the last twelve years, I’ve built a career on pulling data from a variety of data sources, most of them which were poorly designed (according to the basic rules of normalization).  From that experience, I’ve pulled together a set of principles for extracting data which I plan to share.  This is a bit of challenge for me, because I’ve never actually written any of these rules of mine down 🙂

Anyway, if you can make it, please do; stop by and see me.

Death by a thousand cuts…

This is has been an awful week; things have just not gone as planned from the get-go.  I’m gonna run through several issues in this post, and perhaps someone will find some value in them; lessons learned may help others avoid the pain of my ways.

 

VSTS: DB Pro

This week, I’ve spent way too much time just trying to figure out how to use Visual Studio; I’ve mentioned in previous posts that I’m a database developer, but that I’ve spent most of my career working with the standard SQL Server DBA tools: Query Analyzer & Management Studio.  When we migrated to TFS, my manager encouraged me to adopt VSTS:DB Pro as a method of developing, and I’ve put some real effort into learning how the system works.  I’m slowly becoming accustomed to it, but there’s still some very quirky things that I’ve yet to figure out.

Issue 1: Logins.  Whenever I import a database into a database project, I get a long list of errors because we have Windows accounts associated with users in the database; since those users don’t exist on my laptop, the project throws all kinds of ugly errors.  Googling the error reveals very little information; the only method I’ve found to eliminate the errors is to drop the associated users and schemas, which of course, can lead to VERY BAD THINGS in deployment scripts.  I’m sure there is a method to resolve it; I just haven’t found it yet.

Issue 2: The Silent Source Control Failure.  Most of my projects involve references to vendor databases; we have a policy of not modifying vendor databases, so when I need to write a stored procedure that accesses data from a third-party source, I create a linked database that sits on the server beside the vendor database, and include my proc in there.  I don’t touch their code, but am able to customize my queries to pull data.  When setting up projects in VSTS:DB Pro, I usually make my custom database the primary project, and include a database reference to a project based on the vendor db.  This usually works out OK, but I ran into an error yesterday where the project wizard would complete, but the project was unable to load.  Nothing I did would fix it; I finally removed the vendor db from the solution, and was able to load my project (albeit with tons of errors because of the lack of references).

The problem? One of the stored procs in the vendor db has an extremely long name (nearly 100 characters); combined with the path to the workspace, the name of the file exceeds the 255 character limit for Windows file management.  TFS apparently doesn’t know how to tell me that the file name is too long, so it just refuses to open.  Try wasting 2 hours of precious development time tracking that down, and you can imagine how frustrating that is.

 

Haste makes waste.

A few months ago, we made a decision to adopt a version of Scrum to standardize our development processes; we’ve made up some of our rules along the way, and for the most part, we’ve done OK with it.  However, the pressure to show progress and meet the needs of the organization means that sometimes I’ve made the wrong decision in my desire to please our Product Users.  One case was today;  one of our rules is that we a) strive to only deploy once a month (at the end of a sprint), and b) if we need to deploy an emergency patch, we only deploy on one day of the week (Wednesdays).

One of our users found an issue, and a production DBA researched it and made a recommendation for a relatively simple code change; I was out of the office Monday, so I only saw the recommendation on Tuesday.  The issue was important, and the code looked good, so I thought I could get away with rolling it out today; unfortunately, I didn’t do due diligence on the code, and neglected to check for dependencies on that particular proc.  We deployed my change today at 8:30 AM; by 3:30 PM, I was rolling it back.  Luckily, the only side effect was that we generated a few extra tickets for our clients, but still, it was a painful experience.

The one thing I want to carry away from this experience is that as our system becomes increasingly complex, the less likely it is that a simple change can be made in less than a day. The code may be simple, but the context of the code in terms of surrounding processes SHOULD warrant time for thorough review.  I need to quit reacting to the deemed urgency of the issue, and think about long-term stability instead.

 

Even the best deployments can have issues.

Overall, I think our deployment process works well, but even a good process can have points of failure.  Even though we have several sets of eyes examining all of our code before it goes into production, things can still slip by.  For example, at 3:45 this afternoon, I was informed that a stored procedure in production was not returning all of the data; the procedure pulls data from a third-party database, and makes certain assumptions about the nature of that data.  In this case, we have a standard naming syntax for all of the devices we manage; ClientID, followed by a space, then device name (e.g., 999 Joe’s Server).  The stored proc had a where clause that ended with LIKE ‘[0-9][0-9][0-9] %’; the missing row of data had a device name with the space omitted.

Now, it wasn’t entirely my fault because I had written the code to spec; however, I should have anticipated that the naming convention could be slightly off (and still be parseable; if they had a 2-digit client id, it would be a different story); in the end, our company looked bad to a client because of  typo, and that’s not good.  Easy to fix the code, but hard to regain the trust of the client.

 

Tomorrow WILL be a better day.

I’m not much for mumbo-jumbo, but I do believe that life is what you make of it.  If I go back to work tomorrow and assume that it’s going to be a bad day, then I’ll probably be proven right.  I’m changing my attitude as of this moment, and I choose to believe that the priority issue that I address tomorrow (while it may not be the issue I want to work on) will be the issue I was meant to work on. 

Building ranges using a dynamically generated numbers table

We use a queueing metaphor for dataloads for the majority of our imports, and it’s been a challenge to figure out how to generate dynamic ranges to add to the queue. In other words, we want to tell our system to go and pcik data from a source that has an id value >x and <=y, and then do it again in a few minutes with a new x and y value. I should say that it’s been a challenge because I’m still not used to SQL 2005 features; I came across a bit of code using a CTE and recursion to generate a numbers table (thanks to Stefan Keir Gordon). Here’s a slightly modified version of his code:

/*thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com*/

SET STATISTICS IO ON

;WITH Numbers(n) AS
   (SELECT 1 AS n
	UNION ALL
	SELECT (n + 1) AS n
	FROM Numbers
	WHERE n <= 32767)
SELECT n
FROM Numbers
WHERE n<=32767
OPTION(MAXRECURSION 32767)
In case you’re wondering, 32767 is the maximum number of recursions supported by SQL Server in a CTE; Stefan’s code neatly generates a single column table with values from 1 to 32767. Looking at that, I realized I could easily get a simple range by adding a second column to the final SELECT, like so:
 ;WITH Ranges(MinValue) AS
  (SELECT 1 AS MinValue
   UNION ALL
   SELECT (Minvalue + 1) AS MinValue
   FROM Ranges WHERE MinValue <= 32767)
SELECT MinValue, MinValue + 1 as MaxValue
FROM Ranges
WHERE MinValue <= 32767
OPTION(MAXRECURSION 32767)

I end up with a two-column table with a MaxValue > MinValue, and the ranges increase by 1. But I had some issues to address:

  • I needed to be able to pass in Start and End values to this code.
  • I needed chunks bigger than 1 value in size; I also had a minimum size I wanted to use.
  • In some cases 32767 ranges would be appropriate; in some cases, I needed less.

Here’s the final code; setting the Start and End valeus for the range of ranges is easy to understand, but the other two values may be a bit confusing. The first is @MaxRanges, which simply says “I want no more than this number of ranges to get from the Start to the End”. The maximum value for this number is 32767. @MinChunk refers to the minumum chunk size, or the space between MinValue and MaxValue for a given row.

It may be easier to just play with the values and see; you should never have more rows than @MaxValue, and the difference between MinValue and MaxValue for a given row should always be greater than or equal to @MinChunk:

/*A dynamically generated table of ranges.
2009-01-25 Stuart R Ainsworth
	http://codegumbo.com/

Thanks to Stefan Keir Gordon for the inspiration
 http://www.keirgordon.com */

SET STATISTICS IO ON

DECLARE @MaxRanges int, @MinChunk int, @Start int, @End int
SELECT @Start=10, @End=200000

/*@MaxRanges must be <= 32767;
  if the range between Start and End is large,
  and @MaxRanges > 32767, the End point won't be reached.*/
SELECT @MaxRanges=1000, @MinChunk=205

IF (SELECT CEILING((@End-@Start)/(@MaxRanges*1.00)))>@MinChunk
	SET @MinChunk = CEILING((@End-@Start)/(@MaxRanges*1.00))

;WITH Ranges(MinValue) AS
  (SELECT @Start AS MinValue
   UNION ALL
   SELECT (Minvalue + @MinChunk) AS MinValue
   FROM Ranges WHERE MinValue < @End)
SELECT MinValue, MaxValue = CASE WHEN MinValue + @MinChunk < @End
			     THEN MinValue + @MinChunk
			     ELSE @End End
FROM Ranges
WHERE MinValue < @End
OPTION(MAXRECURSION 32767)

Good luck; happy coding!