SQL

The 12th day of SQL

Dadmas
My Christmas tree is like this post; short, easy to assemble, and a little tacky.

So, at the first FreeCon, a bunch of writers gathered together and talked about stuff. Most of the stuff we talked about was how to be a better technical writer, as well as how to blend our writing skills with our own personal and professional goals.  We dismissed from that conference eager to write, and looking for opportunities to hone our skills; this particular series of posts was born of that collaboration, and I hope that other series will follow.  While I could list out each individual post in the Twelve Days of SQL series, it’s probably more fun to start at the beginning.  You’ll eventually get back to this one, I hope (if you don’t poke out your eyes after seeing David Stein’s Christmas ornament).

Most of the other posts have described their favorite post of the year.  Me?  I wanna go out with a bang, a celebration of those posts that we all rely on but rarely celebrate.  At the heart of the technical blogging community is, well, the technical blog post, and it’s these posts which rarely get attention.  We often celebrate those witty and well crafted posts, but we rarely celebrate the “how to do this” posts.  Sometimes these posts are little more than scripts; sometimes they’re well-crafted opera describing how to do a single thing.

Why do I sing praises of these short-but-sweet articles?  I’ll answer that in the form of a metaphor…

The Ghost of SQL Past

All blogs begin with a first post, and that first post leads to another.  Many of us that are regular (or irregular in my case) bloggers began our blogs with a few scripts and sample code.  Why?  Because it was a useful place to dump content that we wanted to remember.   Some fine examples of this are Aaron Nelson’s early posts on PowerShell and Ken Simpson’s XML to Pivot scripts.  These early works are indicators of great things to come; their blogs are littered with samples and ideas.

But good technical blogs are born not only of coding tricks; writers craft their works by expanding their repertoire beyond scripts and samples, and move on to include their observations of life.  Sometimes these observations are a bit too revealing (as in Brent Ozar’s self-professed love of amphibians); usually they are fascinating insights into the character of a person.  When Andy Leonard comments that Iteration = Maturity, he’s not just talking about software.

The Ghost of SQL Present

In recent days, newer bloggers have carried on the tradition of the technical post, but are finding ways to blend in a sense of community as well (like David Taylor’s exposition on #sqlhelp).   A quirky sense of humor works as well, as in Julie Smith’s opera of concatenation (I won’t spoil it for you, but there may be magic involved).  Successful technical blogs should be both fun to read, as well as provide some insight into how to do something.

The Ghost of SQL Future

Not much to say here, because we’re not there yet.  Hopefully, what I’ll see in the future is an evolution of what we’ve seen so far in the Past and the Present, but I hope that you’re reading this because you want to understand how to be a better blogger.   Technical blogs need technical content, but good technical blogs need a sense of whimsy, a touch of your personal style, and a nod to the community of content out there. Others have far better posts than I on that subject, but the simplest piece of advice I can give you is:

Write.

That’s it.  Write, because when you write, you force yourself to think, and thinking is the strongest tool in the toolbox for a technical person.   Believe me, I’m pointing the finger squarely at myself on this one as well; I have been far too reticent in my writing as of late, and I hope to rectify that shortly.  But back to you; next year, I hope to celebrate your writing in a similar post.  Tell me how to do something; share your experiences, and educate your peers. 

Up Next?  Steve Jones, for the cleanup!

How many licks DOES it take…?

So, it’s been a while since I’ve posted, but I’ve finally managed to eke out some time after Summit 2010, and wanted to follow up on a conversation that Tim Mitchell, Arnie Rowland, and I had at the Friends of Red Gate dinner.  We were doing a SQL Server oriented trivia contest which asked the following question:

How many nonclustered indexes does SQL Server 2008 support on a single table?

And the answer is: 999, which is enough to cause most DBA’s to grumble under their breath abut the consequences of setting up that many indexes and what they would do to if they ever found that many indexes on a table.  Of course, being the amateur math geek that I am, I casually asked Tim and Arnie:

What’s the smallest table (in terms of amount of columns) that would support 999 indexes?

After knocking it around for a bit, we came up a estimate of 6, which actually isn’t too far off; however, our method of getting there was mostly intuitive, and I wanted to figure out the actual formula for calculating that number.  I knew it had to with factorials, but I wasn’t exactly sure how to get there.  After searching the internet, I finally figured out the following principles:

  • Column order matters when building indexes, so when choosing pairs from a set of columns, a set of ab <> ba.
  • The more columns on the table, the wider the indexes could be; choosing columns from a wider set would require iteration.  In other words, if you have 3 columns on a table, you would have 3 single-column indexes, 6 double-column indexes, and 6 triple-column indexes.

The formula that represents this is SUM(n!/(n-k)!), where n represents the number of columns in the table and k represents the number of columns in the index.  Plugging this into an spreadsheet, you get the following matrix:

    Number of Columns in Index (k)
  1 2 3 4 5 6 SUM
Number of Possible Columns (n) 1 1           1
2 2 2         4
3 3 6 6       15
4 4 12 24 24     64
5 5 20 60 120 120   325
6 6 30 120 360 720 720 1956

 

At first glance, we’re done; it looks like 6 was the right answer, because with only 6 columns in a table, you have a whopping 1,956 possible indexes to choose from.  However, there’s more to the story: SQL Server 2005 introduced the INCLUDE option to indexes, which throws a kink in the above formula. 

At first, I thought it was relatively simple; you had two subsets for each n, where the elements in each subset couldn’t be in the other one, but it’s a little more deceptive.  Here’s the principles for generating it:

  • For a set (n) of possible columns, there are two mutually exclusive subsets: the base (k) and the included columns (l).  The number of elements in the two subsets must be less than or equal to the number of elements in the master set.
  • Column order matters in the base columns, but not the included columns, so the formula above can work for a base set of columns, but iterating through the included columns requires only the unique set of elements.

And here’s the part where my brain exploded; I couldn’t figure out a way to mathematically demonstrate the two relationships, so I built a SQL script, iterating through a set of 5 columns; all in all I ended up with a listing of 845 possible combinations, which means that 6 still stands as the minimum number of columns on a table needed to generate the maximum number of nonclustered indexes.

The point to this story?  None, really.  Just a silly geek exercise.  However, I think it does point out that index strategy is a complex problem, and there are multiple ways to index any given table.  Choosing the right one is more difficult than it looks.


DECLARE @c TABLE ( NAME VARCHAR(100) ) ; INSERT  INTO @c
       
( NAME )
VALUES  ( 'a' ),
        (
'b' ),
        (
'c' ),
        (
'd' ),
        (
'e' )
      
SELECT  n = 1
     
, k = 1
     
, l = 0
     
, NAME
     
, INCLUDE = NULL
INTO    #tmp
FROM    @c
UNION ALL
SELECT  n = 2
     
, k = 2
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.NAME
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
WHERE   c1.name <> c2.name
UNION ALL
SELECT  n = 2
     
, k = 1
     
, l = 1
     
, NAME = c1.NAME
     
, INCLUDE = c2.NAME
FROM    @c c1
       
CROSS JOIN @c c2
WHERE   c1.name <> c2.name
UNION ALL
SELECT  n = 3
     
, k = 3
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.NAME + ',' + c3.NAME
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
UNION ALL
SELECT  n = 3
     
, k = 2
     
, l = 1
     
, NAME = c1.NAME + ',' + c2.name
     
, INCLUDE = c3.name
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
UNION ALL
SELECT  n = 3
     
, k = 1
     
, l = 2
     
, NAME = c1.NAME
     
, INCLUDE = c2.NAME + ',' + c3.name
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c2.name < c3.name
UNION ALL
SELECT  n = 4
     
, k = 4
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.NAME + ',' + c3.NAME + ',' + c4.name
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
UNION ALL
SELECT  n = 4
     
, k = 3
     
, l = 1
     
, NAME = c1.NAME + ',' + c2.NAME + ',' + c3.NAME
     
, INCLUDE = c4.name
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
UNION ALL
SELECT  n = 4
     
, k = 2
     
, l = 2
     
, NAME = c1.NAME + ',' + c2.NAME
     
, INCLUDE = c3.name + ',' + c4.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c3.name < c4.name
UNION ALL
SELECT  n = 4
     
, k = 1
     
, l = 3
     
, NAME = c1.NAME
     
, INCLUDE = c2.name + ',' + c3.NAME + ',' + c4.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name < c3.name
       
AND c2.name < c4.NAME
       
AND c3.name < c4.name
UNION ALL
SELECT  n = 5
     
, k = 5
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.name + ',' + c3.NAME + ',' + c4.NAME + ',' + c5.NAME
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
       
AND c1.name <> c5.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name <> c5.name
       
AND c4.name <> c5.name
UNION ALL
SELECT  n = 5
     
, k = 4
     
, l = 1
     
, NAME = c1.NAME + ',' + c4.name + ',' + c3.NAME + ',' + c2.NAME
     
, INCLUDE = c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
       
AND c1.name <> c5.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name <> c5.name
       
AND c4.name <> c5.name
UNION ALL
SELECT  n = 5
     
, k = 3
     
, l = 2
     
, NAME = c1.NAME + ',' + c2.name + ',' + c3.NAME
     
, INCLUDE = c4.NAME + ',' + c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c1.name <> c5.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name <> c4.name
       
AND c3.name <> c5.name
       
AND c4.name < c5.name
UNION ALL
SELECT  n = 5
     
, k = 2
     
, l = 3
     
, NAME = c1.NAME + ',' + c2.name
     
, INCLUDE = c3.NAME + ',' + c4.NAME + ',' + c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c1.name <> c5.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name < c4.name
       
AND c3.name < c5.name
       
AND c4.name < c5.name
UNION ALL
SELECT  n = 5
     
, k = 1
     
, l = 4
     
, NAME = c1.NAME
     
, INCLUDE = c2.name + ',' + c3.NAME + ',' + c4.NAME + ',' + c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c1.name <> c5.NAME
       
AND c2.name < c3.name
       
AND c2.name < c4.NAME
       
AND c2.name < c5.NAME
       
AND c3.name < c4.name
       
AND c3.name < c5.name
       
AND c4.name < c5.name SELECT n, COUNT(*)
FROM #tmp
GROUP BY n
ORDER BY n DROP TABLE #tmp   

#TSQL2sDay – My Least Favorite SQL Server Myth

TSQL2sDay150x150

 

It’s time again for another T-SQL Tuesday, hosted this month by Sankar Reddy; the topic is misconceptions in SQL Server.  It’s been a while since I wrote one of these (I usually forget about them until the following Wednesday), but this topic is a good one.  I’ve had many discussions with people about the following myth for a long time, so it’s nice to be able to put it to rest again.

The myth?  “You should always use stored procedures in your code, because SQL Server is optimized for stored procedure re-use.”  Don’t get me wrong; there are lots of arguments to use stored procedures (security, obfuscation, code isolation), but performance is not necessarily a good one.   This myth has been around for a long time (since SQL Server 7), and Binging “stored procedures SQL Server performance” yields such gems as the following:

SQL Server Performance Tuning for Stored Procedures

As you know, one of the biggest reasons to use stored procedures instead of ad-hoc queries is the performance gained by using them. The problem that is that SQL Server will only … www.sqlserverperformance.com/tips/stored_procedures_p2.aspx · Cached page

Increase SQL Server stored procedure performance with these tips

Database developers often use stored procedures to increase performance. Here are three tips to help you get the most from your SQL Server storedarticles.techrepublic.com.com/5100-10878_11-1045447.html · Cached page

 

The guts of this myth originate from the fact that prior to version 7 (released in 1998), SQL Server WOULD precompile stored procedures and save an execution plan for future reuse of that procedure, BUT THAT CHANGED AS OF VERSION 7.0.  Here’s a quote from Books Online (SQL Server 2000) that tries to explain what happened (emphasis added by me):

Stored Procedures and Execution Plans

In SQL Server version 6.5 and earlier, stored procedures were a way to partially precompile an execution plan. At the time the stored procedure was created, a partially compiled execution plan was stored in a system table. Executing a stored procedure was more efficient than executing an SQL statement because SQL Server did not have to compile an execution plan completely, it only had to finish optimizing the stored plan for the procedure. Also, the fully compiled execution plan for the stored procedure was retained in the SQL Server procedure cache, meaning that subsequent executions of the stored procedure could use the precompiled execution plan.

SQL Server 2000 and SQL Server version 7.0 incorporate a number of changes to statement processing that extend many of the performance benefits of stored procedures to all SQL statements. SQL Server 2000 and SQL Server 7.0 do not save a partially compiled plan for stored procedures when they are created. A stored procedure is compiled at execution time, like any other Transact-SQL statement. SQL Server 2000 and SQL Server 7.0 retain execution plans for all SQL statements in the procedure cache, not just stored procedure execution plans. The database engine uses an efficient algorithm for comparing new Transact-SQL statements with the Transact-SQL statements of existing execution plans. If the database engine determines that a new Transact-SQL statement matches the Transact-SQL statement of an existing execution plan, it reuses the plan. This reduces the relative performance benefit of precompiling stored procedures by extending execution plan reuse to all SQL statements.

SQL Server 2000 and SQL Server version 7.0 offer new alternatives for processing SQL statements. For more information, see Query Processor Architecture.

Note from the above quote that the query optimizer now uses execution plans for ALL T-SQL statements, not just stored procedures.  The perceived performance gain from stored procedures stems not from some magic use of CREATE PROC, but rather in plan re-use, which is available to ad-hoc queries as well. 

So what promotes plan re-use?  The simplest answer is parameterization; SQL statements which use parameters efficiently (which includes many stored procedures) will be more likely to reuse a plan.  Developers should focus on making the most out of parameters, rather than simply assuming that a stored procedure will be efficient simply because of some magical aspect of said procs.

A final thought: For a great starting place on understanding SQL Server plan reuse, see http://msdn.microsoft.com/en-us/library/ee343986(SQL.100).aspx, including the Appendix A: When Does SQL Server Not Auto-Parameterize Queries.  Also, this post by Zeeshan Hirani explains why LINQ to SQL query plans don’t get reused.

Querying XML in SQL Server 2005+: namespaces

I recently helped a friend solve an XML problem, and thought I would post the solution here.  Although there are lots of notes on how to use XQuery in SQL Server 2005+, this was a real world scenario that was trickier than I expected.  The friend works for an insurance company broker, and in one of their applications, accident questionnaires (and their answers) are stored in XML.  This allows them to treat all questionnaires as the same, regardless of their origin as long as the QuestionCodes are common across vendors.

Below is the sample data that he was asking me about; he needed to get one question and answer per row into a data set:


DECLARE @T TABLE ( RowID INT, Fragment XML )
INSERT  INTO @T
       
( RowID, Fragment )
VALUES  ( 1, '<Questionnaire xmlns="http://www.xxx.com/schemas/xxxXXXXXxxx.0">
<Question>
<QuestionCode>74</QuestionCode>
<Question>Why did you wreck your car?</Question>
<Answer>I was drunk</Answer>
<Explanation />
</Question>
<Question>
<QuestionCode>75</QuestionCode>
<Question>Why is the rum all gone?</Question>
<Answer>Because I drank it.</Answer>
<Explanation />
</Question>
</Questionnaire>'
)
,       (
2, '<Questionnaire xmlns="http://www.xxx.com/schemas/xxxXXXXXxxx.0">
<Question>
<QuestionCode>74</QuestionCode>
<Question>Why did you wreck your car?</Question>
<Answer>Stuart was drunk</Answer>
<Explanation />
</Question>
<Question>
<QuestionCode>75</QuestionCode>
<Question>Why is the rum all gone?</Question>
<Answer>Because I made mojitos.</Answer>
<Explanation />
</Question>
</Questionnaire>'
)

I thought it was a simple query; simply use the .nodes() method to rip each of the questions and corresponding answers into their own rows, but for some reason, when I ran the following, I got interesting results:

SELECT  t.RowID
     
, QuestionCode = t1.frag.value('(QuestionCode)[1]', 'int')
      ,
Question = t1.frag.value('(Question)[1]', 'varchar(max)')
      ,
Answer = t1.frag.value('(Answer)[1]', 'varchar(max)')
      ,
Explanation = t1.frag.value('(Explanation)[1]', 'varchar(max)')
FROM    @t t
       
CROSS
APPLY Fragment.nodes('//Questionnaire/Question') AS t1 ( frag )

RowID  QuestionCode    Question    Answer  Explanation

That’s right, nothing.  Strange, considering I’ve done variations of this query for a couple of years now to parse out firewall data fragments.  I looked closer, and tried to see what was different about the XML fragment from this example compared to mine, and it was clear: a namespace reference.   Most of the data I deal with is not true XML, but rather fragments I convert to XML in order to facilitate easy transformations.  To test, I stripped the namespace line (xmlns="http://www.xxx.com/schemas/xxxXXXXXxxx.0" ) out, and voila!  Results!

RowID QuestionCode Question Answer Explanation
1 74 Why did you wreck your car? I was drunk
1 75 Why is the rum all gone? Because I drank it.
2 74 Why did you wreck your car? Stuart was drunk
2 75 Why is the rum all gone? Because I made mojitos.

Well, that was great, because it showed me where the problem was but how do I fix it?  I stumbled upon a solution, but to be honest, I’m not sure it’s the best one.  If I modify my query to refer to any namespace (the old wildcard: *) like so:

 

SELECT  t.RowID
     
, QuestionCode = t1.frag.value('(*:QuestionCode)[1]', 'int')
      ,
Question = t1.frag.value('(*:Question)[1]', 'varchar(max)')
      ,
Answer = t1.frag.value('(*:Answer)[1]', 'varchar(max)')
      ,
Explanation = t1.frag.value('(*:Explanation)[1]', 'varchar(max)')
FROM    @t t
       
CROSS
APPLY Fragment.nodes('//*:Questionnaire/*:Question') AS t1 ( frag )

 

I get the correct results.

RowID QuestionCode Question Answer Explanation
1 74 Why did you wreck your car? I was drunk
1 75 Why is the rum all gone? Because I drank it.
2 74 Why did you wreck your car? Stuart was drunk
2 75 Why is the rum all gone? Because I made mojitos.

Here’s the question for any XML guru that stumbles along the way; is there a better way to do this?

T-SQL tip: Use a role for EXECUTE permissions

If you are in a high-security environment, this may not be the best tip for you, but if you’re in a situation like mine where you have a SQL user that is connecting to a database in order to EXECUTE stored procs, and you know that user needs to be able to EXECUTE every proc in that database, you can save a little time by creating a role for that:


CREATE ROLE usp_execute
GO

GRANT EXECUTE ON SCHEMA::dbo TO usp_execute
GO

This means that the user will be able to execute EVERY stored proc belonging to the schema dbo from this point forward; again, be cautious when using this.  Security models should not be taken lightly.

Speaking today: PASS AppDev Virtual Chapter

I know it’s short notice, but to be honest, I totally forgot about this until a couple of weeks ago.  I’ll be presenting today at noon eastern on a LiveMeeting for the Application Developers Virtual Chapter of PASS.  Deets below:

“You Got XML In My Database? What’s Up With That?”
May 11th 12:00 PM EDT (GMT -4)
Add to Calendar
Presenter: Stuart Ainsworth

A brief presentation exploring the marriage of XML and relational databases, including when it works and when it doesn’t. Coverage will include various use case scenarios, and some tips on how to improve performance using design techniques.

Stuart Ainsworth

Stuart R Ainsworth, MA, MEd is a Database Architect working in the realm of Financial Information Security; over the last 15 years, he’s worked as a Research Analyst, a report writer, a DBA, a programmer, and a public speaking professor. He’s one of the chapter leaders for AtlantaMDF, the Atlanta chapter of PASS. A master of air guitar, he has yet to understand the point of Rock Band (“You push buttons? What’s that all about?”).

How do I view the presentation?
Attendee URL:  Live Meeting link

#TSQL2sDay 003: Maslow and relational design

Rob Farley is hosting the third installment of TSQL Tuesday, and it’s a fun one: relationships (in honor of Valentine’s Day).   While I’m not currently in much of a mood to opine on the virtues of love and databases, I did think I wanted to post something a bit more esoteric this time.  Not many of you may know that I don’t have a formal background in Information Technology (some of my more sarcastic friends just held their tongues at that one); I actually have a Master of Arts in Communication, and a Master’s of Education in Instructional Technology.  I tripped into IT when I failed my comprehensive exams for the doctoral program in Speech Communication at the University of Georgia.  Awful time, but ultimately one of the best things to ever happen to me.

Anyway, why is this relevant?  Because the goal of this post is to attempt to extend one of the more famous models of social psychology and communication to database design; bear with me (I’m assuming that many of you either have no background in social psych or slept through it), but I’m hoping that this extension to the metaphor will benefit you in terms of your application design.

Maslow: the crash course.

The following is a BRIEF introduction to the theory; if you want more details, Wikipedia is your friend. In a nutshell, Abraham Maslow proposed that humans, as a social animal, were driven to fulfill certain basic needs in a quest for self-actualization or enlightenment.  He proposed a pyramidic model of five sets (or stages) of these needs, with the four lowest ones being required to achieve before attempting the fifth; few people ever attain the fifth level, but the quest to reach that is part of our collective experience.  I’ve defined the five stages below:

maslows_hierarchy_of_needssvg Physiological:

The basic requirements for human existence; food, water, etc.

Safety:

This often translates into security, but it’s different than the term we use in information technology careers; safety is the ability to acquire and maintain goods for ongoing existence.  The Physiological elements are immediate needs; Safety elements are the ability to fulfill those immediate needs at a future date.

Social:

Where do we belong?  How do we interact with others who need us (and we need)?  What is our role, and how does that affect our definition of the self?

Esteem:

Esteem stems from the social need; once our relationship with others has been established, we can truly begin to define ourselves and the virtue of our importance in the world.

Self-Actualization:

Self-actualization is the ultimate fulfillment of one’s potential; to be what one is, without need for constant reinforcement from other beings, yet able to exist in harmony with purpose.  Few people have ever attained this stage, but as stated before, the quest to reach the top of the pyramid drives human development.

So what does this mean to the database designer?

Why is all of this important?  This is not a perfect analogy, but if we extend Maslow’s model to the area of database design, some interesting questions arise (particularly in the third and fourth stages, which is why I felt like this point would be relevant to the TSQL Tuesday challenge of relationships).  Let’s take each stage, and step through them again.

Physiological:

While applications don’t have physiological needs, they DO require certain basic elements for long term survival.  Questions to consider at this stage are things like: How much space will I need?  What are the server requirements?  Can my database live in cloud or a mobile device?   What sort of I/O concerns do I have?

Safety:

Recall that safety is NOT security (in terms of who has access to the data), but it is security in terms of long-term survival of the application.  Is the database you’re designing intended for a long-term project, or is it “throw-away” code?  Have you designed it in such a way so that it’s easy to replace without impacting the dependent application?

Social:

Speaking of dependent applications (and herein lies the relationship aspect of this post), is your database application designed so that it is loosely related and decoupled from the application?  Does the database fulfill the needed role within the relationship (data storage), without treading too far into business logic?  Can the database handle multiple relationships with various applications (UI/reporting/business services).

Esteem:

Closely related to the social nature of the database within the application stack is the need for self-esteem within the database; can the database meet the the needs of the dependent applications WHILE retaining enough information to establish new relationships?  A classic example of this is the lookup table; a database with low self-esteem will only store the enumerated values provided to it by some other application. 

Without the enabling application, the database lacks sufficient internal definition to validate meaning; in practical terms, this means that the database is not decoupled from the application enough to enable the development of alternate accessing applications.  For example, my day job is to reverse engineer vendor databases; few things in the world are more disturbing than a table full of numbers without any sort of category associated with that number.  The application designer decided to store that enumeration in the application; security through obfuscation IS a method of securing your database, but not a very effective one.

A high-self esteem database will store all of the appropriate lookup values (complete with constraints) in order to provide complete validity within the structure.  The database can then be reused by several different applications, without requiring a complete set of business rules to determine those relationships.    The data layer is definitional; the business layer should be procedural.

hal[1] Self-Actualization:

I have to admit that discussing self-actualization in regards to application design makes me think of HAL.  “I’m sorry, Dave….”

To try and stay on track with this metaphor, self-actualization is the basic premise of BI; when your database can start providing you knowledge instead of just data, it has attained the highest level of potential.  Few apps make it that far without requiring substantial redesign, but the ones that do are invaluable to the enterprise they support.

So where are we?

Dunno.  I hope this little exercise made your brain hurt just a bit, and opened up a new metaphor for understanding database design issues within the application stack.   If you have more questions than answers, that’s a good place to be.

#TSQL2sDay 001: Date/Time Issues and Data Modeling

This is my first post for T-SQL Tuesday (thanks, Adam, for the idea!); I’m hoping this will be a continuing block party, and hopefully connect a lot of us SQL bloggers together.  There’s a lot of great content out there, and sometimes it’s just tough to find.

Just a few notes before I get started on this; for this particular post, I’m pulling from my experiences as a data architect for the last few years.  I spend a lot of time reverse-engineering other databases in order to centralize and analyze data from a variety of sources.  Most of my efforts pre-date SQL 2008, so I don’t have a lot of experience with the newer datetime data types introduced in that version; however, I’ll point you in that direction if I think it might address an issue that I raise.  Also, there’s no particular order to my list of notes; think of it as a random collection of date-time issues.  Some of them may be specific to T-SQL and SQL Server, and others may just be procedural notes about data integration (and thus may be applicable to other platforms).

Ready?  Let’s get started.

1.  If you are a database professional that works with SQL Server, strike the word “timestamp” from your vocabulary.

Way back when SQL Server was in its infancy, some developer decided that timestamp would be a good name for a binary column that changes whenever the row was updated.  I don’t know the origins of the timestamp datatype; all I know is that it has nothing to do with date or time.  So why am I including it in a post about Date and Times?  Because for most speakers of the English Language, a timestamp is a marker that indicates when an event happens.  Many junior SQL Server developers and even senior developers on other platforms will often be confused by the timestamp datatype, and wonder what’s going on with that column that they just added.

My recommendation?  Avoid using the word timestamp in design discussions.   When someone mentions “we need to add a timestamp to that row”, make sure that you understand what they mean.  Usually they want you to add a column to capture what datetime the event happened, NOT an indicator of when the row will change.  Note that I’m suggesting that you avoid the WORD timestamp; obviously, if a database requires a timestamp column, use it.

2.  Does anybody really know what time it is?

One of the benefits of a distributed architecture is that your database server doesn’t have to live on the same box as your application server; in fact, your database server may not be in the same timezone as your application server.  And, if you’re using a webserver as an application host, your user may not be in the same timezone as either your application or your database.

I mention this because as a database designer, you may need to capture multiple points-in-time (or you may not); you cannot assume that the application is consistently providing the same time reference for all users.  If you’re using a truly distributed design, you may have one host in New York and one in Los Angeles, and your users may connect to either one to enter in data; what date do you use?

For this reason, I usually advocate two things: 1) always include a database-centric referential date of event (in other words, include a column with a default value of the database time for important data), and 2) always use UTC for saving time.  That second statement’s important enough to warrant a note of its own.

3.  When in doubt, use UTC.

Time is relative to geography; the application you build today in New York may eventually have clients in New Delhi.  If your processes could ever have the potential of crossing multiple time zones (for example, you want to track shipping from Atlanta to Oregon), you’d get the most benefit out of storing your data as UTC in order to understand the relationship between event A (which occurred in EST) and event B (in PST).  It’s a lot easier to change the display of events to a local time from UTC than it is to convert from local time to UTC, do the math, and then convert back.

One problem with recording data with UTC times that I’ve encountered is that is you have centralized data, but distributed users that want to see their data in local time, building OLAP cubes is a pain.  If you’re wanting to analyze data for Atlanta and data for Oregon, Atlanta’s day begins 3 hours earlier than Oregon’s.  I have not found an easy solution to this yet, but am starting to investigate the new datetimeoffset data type in SQL 2008.

4.  Use a standard notation for calculating differences between dates.

Obviously the new date and time datatypes in SQL 2008 address a desire of many in the development community for a method of storing ONLY the date or time of an event, but if you’re using an application that still stores the datetime of an event and you need to round the data to the nearest day, hour, week, whatever unit of time, then use  a standard method for doing so.

(I am assuming that you are using date-based arithmetic rather than converting to a varchar and back).

SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0)

is the preferred method; you can read why at my previous post.

5.  Use a standard notation for referring to dates.

Americans are weird about dates; we’re one of the few countries that puts the middle container before the smallest container followed by the largest container.  Clueless?  When people ask you what’s your birthdate, how do you answer?  Most people rattle off Month-Day-Year.   An international notation would be Year, Month, Day.

Again, if you’re designing databases and queries that might cross international borders, you’ll save yourself some debugging time in the future if you can get into the habit of using an international standard notation for referring to datetimes.   A query that asks for data WHERE DateOfEvent > ‘11/1/2009’ may work fine in the US (returning data newer than November 1, 2009), but may blow up in the UK (returning data greater than January 11, 2009).

How do you know if you’re a DBA that’s been bitten by this?  You start dating everything (including personal checks) using a notation like ‘2009-11-01’.  I  get odd looks all the time (probably not just because of the way I date my checks).

6.  Finally, get used to the fact that there’s lots of methods to store datetimes.

If you’re doing any sort of database integration, you’re probably familiar with many of the same issues I laid out.  There are others that I haven’t covered, but the important thing I want to focus on is that just because I think these are good habits to get into, others have different opinions.  Some db pro’s prefer to use epoch time rather than a standard datetime; I not only have to be familiar with my own methods of doing something (and prepared to defend those methods), but I also have to be willing to understand other methodologies so that I can do the conversion.

Good Habits To Adopt: Enforcing the natural primary key

I’ve been reading Aaron Bertrand’s great series of blog posts on bad habits to kick, and have been thinking to myself: what are some good habits that SQL Server developers should implement?    I spend most of my day griping about bad design from vendors, yet I hardly ever take the time to document what should be done instead.  This post is my first attempt to do so, and it’s based on the following assumptions:

  • Good habits are going to be a lot more controversial than bad habits, and
  • SQL Server doesn’t enforce many of these good habits for you.

The first point refers to the fact that some of the choices that I make are not necessarily the best way to do things, and they may not satisfy the need of every application.  I’m a firm believer that there is an exception to every rule, but my goal is to at least define what the rules are (and again, these rules are my own creation and someone may have better rules).  The second point refers to the fact that SQL Server enforces the rules of SQL, but leaves some of that enforcement open to interpretation.  For example, the relational model defined by SQL assumes that tables are related, but SQL Server doesn’t require that you define a FOREIGN KEY (or even a PRIMARY KEY).

So here’s my first good habit:

When defining a surrogate primary key for a table, you should enforce the natural primary key with the use of a UNIQUE constraint.

To really understand this, you have to start with defining what a surrogate primary key is versus a natural primary key.  You can search for a variety of definitions, but I’ll use the following:

  • Primary Key: a non-nullable attribute (or combination of attributes) that can be used to uniquely identify a specific instance of an entity.  When used within SQL, a primary key can be mapped to a column (or columns) in a table, and the value of the key uniquely identifies a row.
  • Natural Primary Key: a primary key that is not auto-generated by the database or application.  The key is comprised of attributes that are associated with an entity, and the value of those attributes is defined by some authority beyond the scope of the database or application.  For example, a Social Security number is a “arbitrarily” assigned number that belongs to a specific citizen of the United States; most databases that use the Social Security number do not create the number, but rather use it as a reference to a particular US citizen.
  • Surrogate Primary Key: a primary key that is auto-generated by the database or application to specifically identify the row in the table representing the collection of entities.  Surrogate keys have no meaning outside of the database and have no relationship to the other attributes in the table.  An ID of 1 simply identifies a row in a table; a row representing a person, a squid, or an automobile may all have an id of 1, depending on what table the surrogate key the data lives in.

Sidebar: as I was writing this, Pinal Dave post the following to his blog: http://blog.sqlauthority.com/2009/10/22/sql-server-difference-between-candidate-keys-and-primary-key-2/ 

Most novices recognize that every table needs a primary key, and surrogate keys offer some benefits that natural keys do not, including:

  • Immutability: the ability of a key to stay constant over time.  A natural primary key (such as a person’s name) may change, but a surrogate key does not.
  • Simplicity of relational JOINS: surrogate keys can remain as a singular column for each table they represent.  For example, a complete invoice may need to be represented by a ClientID, an InvoiceID, and the LineID’s for the lines on that invoice.  Joining on the natural keys may require the Client Name and Address, the Invoice Number, and the Line Number. 

However, surrogate keys have one major weakness; they do NOT enforce the unique validity of each row.  If you use an IDENTITY function in SQL Server to auto-generate your surrogate PRIMARY KEY, and you insert Stuart Ainsworth into your table of Employees, and you accidentally run your INSERT script again, you’ve just double-inserted Stuart Ainsworth.  While there are certainly multiple people with my name, I’m the only one at my company.  However, my application never noticed it.

Using a UNIQUE CONSTRAINT on the columns holding the natural key information avoids this problem; you get the benefits of a surrogate key AND the unique validation of a natural primary key.   The hard part is, of course, identifying the appropriate natural primary key to enforce.  However, this exercise should NOT be overlooked when designing a database.