TSQL2sDay

#TSQL2sDay.. For Whom Do You Log?

This month’s T-SQL Tuesday is being hosted by Aaron Nelson [blog | twitter]; since I’m posting this late in the day, I got a chance to sneak a peek at some of the other entries, and most of them were great technical discussions on how to log, what to log, and why you should log.  I’m not feeling technical today; today’s a conceptual day.  So, rather than write about the pros and cons of logging, I thought I would ask you to step back and consider who is your audience?

At my company, we monitor logs for our customers; I’ve had to reverse engineer a bunch of different log formats in my day, and there are some basic principles behind good logging practices; here’s my stab at them:

1. Logging without analysis is useless storage of information.

I realize that our jobs as data professionals have gotten more exciting because of policy requirements that insist upon the storage of all kinds of information for exceedingly long periods of time.  I recently read a requirement that said a company must maintain source code for 20 years; that’s a heckuva long time to keep a log of changes around.  Unfortunately, if no one ever looks at the log, then why store it?  If you’re going to be storing information, you need to have some process that consumes that information in order to get the most value out of it.  Good logging processes assume that someone will be reviewing the log at some point, and using that information to act. 

2. Logging structures should be flexible.

If you are logging information with the mindset that someone will be reviewing the log in the future, then you need to balance precision (i.e, gathering adequate information to describe the logged event) with saturation (don’t over-log; not every event is always important).  For example, if you’re building an audit log to track changes to a customer account, you want to be able to isolate “risky” behavior from normal account maintenance.  If your logs become lengthy result-sets filled with change statements, it’s easy to overlook important events such as a bad command.

Most logging structures attempt to overcome this by having some sort of categorical typing appended to the log; in other words, if we think in tabular terms, the columns of a log dataset might look like:

  • DateOfEvent – datetime
  • Category – Classification of the event (Error, Information, etc)
  • Severity – Some warning of how important the event is
  • ErrorCode – Some short (usually numeric) code that has meaning extrensic to the system
  • Message – The payload; a string description of what happened.

It becomes relatively easy to isolate the error messages from informational messages; however, how do you search non-categorical information with the message itself?  For example, if you want to determine that there was a specific error code associated with a specific logon, and that logon information is embedded in the message of your log file, how do you search it?  The easy answer is to use wildcards, but is there a better way?  In my experience, good logs use some form of intra-message tagging to isolate key elements within the message; the log file remains simple for quick searches, but can easily be adapted for more in-depth searches.  I like XML attributes for payload logging; it’s easy to implement, and can be parsed quickly.  For example:

acct=”Stuart Ainsworth” msg=”Access Denied” object=”SuperSecretPage.aspx”

is an easy message to shred and look for all denied attempts on SuperSecretPage.aspx.  If I wanted to look for all activity by Stuart Ainsworth, I could do that as well.

3.  Logging should have a maintenance plan.

If you log a lot, you know that logs fill up quickly.  How long do you retain your logs (and the answer shouldn’t be “until the drive fills up”)?  Too much information that is easily accessible is both a security risk and a distraction; if you’re trying to find out about a recent transaction by a user, do you really need to know that they’ve been active for the last 10 years?  Also, if your log file does fill up, is your only option to “nuke it” in order to keep it running and collecting new information?

A good logging strategy will have some sort of retention plan, and a method of maintaining older log records separately from new ones.  Look at SQL Server error logs, for example; every time the service starts, a new log file is created.   Is that the best strategy?  I don’t think so, but it does isolate older logs from newer ones.   If you’re designing a logging method, be sure to figure out a way to keep old records separate from new ones; periodically archive your logs, and find a way to focus your resources on the most recent and relevant information.

#TSQL2sDay 28–Jack of All Trades, Master of None

This month’s topic was chosen by Argenis Fernandez (blog | twitter): "Jack of All Trades, Master of None?".  It’s interesting to me lately because I seem to keep encountering blog posts and articles that ask some variant of this question.  I’m not sure if it’s spurred in part by the recent release of SQL Server 2012 (which is going to change a lot of the ways we manage data), or if it’s something in the collective waters bubbling under the Internet.  Or perhaps, it’s just the fact that I’m trying to define myself in a new role at my work.  Don’t know. 

I’ve always been a data person; one of my first jobs in IT was working for a forestry agent doing data entry; my job was to take the handwritten note cards from the tree counter, translate their coding system into digits, and enter it into an old Macintosh computer (this was the late 80’s, and Macs were super sophisticated at the time).  These numbers were important; they represented an estimate of how much timber (and what kind) was on a property, so that the value of that property could be accurately assessed.  It had nothing to do with SQL; wasn’t really even any sort of programming.  Just look at a card, mentally transform the code into a number and tree type, and punch it in.

Later, in college, I got involved in doing social science research; this eventually led to a position as in public health & communication research.  Still digging on numbers; I was using SPSS at the time.  I went to bed at night dreaming of statistical relationships and discovering new stuff.  When our department encountered resistance from the IT department for managing one our studies, I started hacking into Microsoft Access to build a contact management platform.  This was a transition point for me; I went from entering numbers to evaluating numbers to finally programming.  However, it was always data.

I made a career change and dove into true IT work shortly after that; my job title said “Access Report Developer”, but I became the DBA (because there wasn’t one).  I loved SQL 2000, but when the IT department at that company shrank, I became not only a programmer, a report developer, and the DBA, but also the guy that fixed printers.  When I had an opportunity to jump ship to another company to become strictly a DBA, I took it.

I’ve been here for 9 years, now.  In that 9 years, I’ve been the DBA, a database developer, a data architect, and now a manager.  Although I’ve primarily worked with SQL Server, I’ve dabbled in mySQL, XML and Visual Basic.  I’ve looked at backup jobs and maintenance plans, DTS and SSIS (and other solutions); my T-SQL is strong.  I used to think I was a specialist in database development, but I’ve come to realize that there’s people that are a lot more specialized than me.  I’ve also come to realize that data without context is meaningless.

I guess I’m growing up a little; when I was kid counting trees, I didn’t really care about how efficient the system was.  I was just earning spending money at a job that was better than mowing the lawn.   When I was doing research, I realized how important my work was, but even then there was little that I could do to change long-term outcomes; I was just an analyst.  In IT, I could finally get to a point where I could take the numbers that I saw and do something about it.  I’m just not collecting data for a reason; I’m using that data to help make decisions that make my company more profitable (which ultimately leads to more money in my pocket). I collect, I analyze, I interpret, and I react.

So what does this little trip down my history mean in terms of the post topic? For me it means that I’ve got a niche, but it’s a pretty big one.  Most of work is done with a certain tool, but the skills I’ve learned along the way don’t restrict me to that tool.  I’ve worked with people who get the specifics of the platform a lot better than I do, but miss the big picture; I’ve also worked with visionaries who get lost when I discuss normalization and efficiencies.   I’m neither a Jack of All Trades, nor a Master of just One.  I’m just a data guy, and I like it.

#TSQL2sDay Roundup

So the launch was early, and the write-up is delayed.  Time has no meaning….  Truthfully, I’m sorry I didn’t manage to squeeze this summary post in a bit sooner; I have the typical DBA excuse: too much to do, and too little time.

One nice thing about this topic is that it seemed to resonate with several bloggers who had either strayed away from T-SQL Tuesday or had never participated; hopefully, the bug to write will stick (and I’m pointing the finger at myself for this one as well).

Beginning at the Beginning:

Pinal Dave in his gentle teaching style covers several different questions and answers about JOIN techniques. Great way to review the basics and get conversations started with beginners.

KenJ is a bit esoteric, but a great reminder of what resources there are to learn more about SQL Server.

 

JOINs In The Real World:

Tracy McKibben demonstrated a very simple tuning tip based on real production experience: the Nested JOIN

 

Rich Brown reminds us that good query construction often involves understanding how the optimizer handles JOINs under the covers.

Andy Galbraith (Yet Another SQL Andy!) reminds us to Respect Your JOINs. It’s funny how often we assume that database queries are logically well-written when we start looking for performance problems.

 

Bob Pusateri points out that WHERE clauses influence JOIN behaviors. You have to look at the entire SQL statement to make sure you’re getting what you’re supposed to be getting.

Matt Nelson reminds us that unnecessary JOINs are bad, and sometimes you need to dig a little deeper when performance tuning.

 

A Little Deeper:

Rob Farley toasted my brain a little on this post about Joins without JOINs.  It’s a well-written explanation of what an Ant-Semi JOIN is.

 

Brad Schulz finished the cooking job with a take on Semi-JOINs, among other things.  Is Brad really Rob in reverse?

Muthukkumaran Kaliyamoorthy covered the internal join mechanisms, and why the optimizer chooses merge, hash, or nested loop joins.

Robert Matthew Cook has a couple of great metaphors for explaining Merge, Hash, and Nested Loop JOINs.

 

Wayne Sheffield pointed out some interesting syntactical ways of writing JOINs. I’m not sure if I’m comfortable with some of them, but they may be of some use in certain edge scenarios. Or, if you just want to mess with the guy who reviews your code.

Richard Douglas offers up a short-but-sweet explanation of the relationship of Key Lookups to JOINs.

 

The Future is a Ticking Time Bomb…

Thomas Rushton points out that Denali may finally force us to clean up our code.

 

 

My Portugese is Not Up To Par…

 
Finally, Ricardo Leka wrote a post that I had to use Google Translate to interpret.  I think I lost something in  translation, because I’m not sure how the example relates to JOIN’s, but thought I would include it anyway.  http://leka.com.br/2011/10/04/t-sql-tuesday-23-joins/

#TSQL2sDay T-SQL Tuesday 23–Early edition

T-SQL Tuesday Logo

 

Time once again for another edition of T-SQL Tuesday!  What’s that, you say?  Early?  Why yes, indeed.  Due to the potential schedule of conflict of PASS’s Summit 2011 occurring on the second Tuesday of October (the normal date for T-SQL Tuesday), I’ve gotten special permission to bump it up a week. 

To participate in this month’s T-SQL Tuesday, your post must go live between midnight UTC on Tuesday, October 4, 2011 and midnight UTC on Wednesday, October 5.

 

Your post needs to link back to this blog, and if you use the image, anchor it to this post.  Make sure you leave a comment or a trackback to this post (so I can find it later, when I write up the summary).

Topic d’jour?  JOINS (I’m in a fundamentals mood lately).  Note that I also like creative and esoteric posts, so if you can find a way to apply SQL as a metaphorical language for community activity, I’ll read it and enjoy it.  If you just want to tell me in a simple fashion the difference between a HASH and MERGE join, I’m cool with that, too.

Do me a favor, though, and please spread the word about the early date.  Looking forward to reading your submissions.

#TSQL2sday: Emulating a FIRST aggregation

tsql2sday

Jes Borland is hosting this month’s T-SQL Tuesday, and it’s all about aggregations.  Here’s an old coding trick of mine to emulate a FIRST aggregation in T-SQL.  Say we have a table that has three columns:

  • ID, a uniqueidentifier
  • Name, a varchar that represents something, and
  • DateStored, a datetime that is set when the row is written to the table

And we populate that table like so:

CREATE TABLE TSQL2sDay_FirstAgg
   
(
     
ID UNIQUEIDENTIFIER
   
, NAME VARCHAR(20)
    ,
DateStored DATETIME DEFAULT GETUTCDATE()
    )
    
INSERT  INTO TSQL2sDay_FirstAgg
       
( ID, NAME )
VALUES  ( NEWID(), 'Peanut' )

WAITFOR DELAY '00:00:01'

INSERT  INTO TSQL2sDay_FirstAgg
       
( ID, NAME )
VALUES  ( NEWID(), 'Peanut' )

WAITFOR DELAY '00:00:01'

INSERT  INTO TSQL2sDay_FirstAgg
       
( ID, NAME )
VALUES  ( NEWID(), 'Orange' )

 

It’s easy to figure out the number of rows associated with each name:

-- SELECT data to verify order of DateStored
SELECT  ID
     
, NAME
     
, DateStored
FROM    TSQL2sDay_FirstAgg      

-- Basic Row Count by Name
SELECT  NAME
     
, RowCnt = COUNT(*)
FROM    TSQL2sDay_FirstAgg
GROUP BY NAME

 

but how do we figure out what the first ID was for each name along with the number of rows?  You could work something out using the HAVING clause of the SELECT statement, or you could do something like the following:

--SELECT first ID and count of rows by Name
SELECT  FirstID = CONVERT(UNIQUEIDENTIFIER, RIGHT(MIN(CONVERT(VARCHAR(24), DateStored, 121) + CONVERT(VARCHAR(36), ID)),
                                                 
36))
      ,
NAME
     
, RowCnt = COUNT(*)
FROM    TSQL2sDay_FirstAgg
GROUP BY NAME

 

It looks complicated, but it’s not; let’s step through it.

  1. We have to know some basic information about our data; in this case, we know that the datetime value associated with each row with a common name is different.  In other words, there are no two Peanuts with the same DateStored value.  This is important, because in order for there to be a first value, there must be some method of ALWAYS determining which one WAS first.  If two Peanuts showed up at the same time, the model is broken.
  2. The first thing we do is to CONVERT the DateStored value to a varchar; this allows us to concatenate it with other values.  The format of that varchar string is important; it must be precise, and it must sort in an ascending order.  The ODBC canonical format (with milliseconds) is a good candidate for this.
  3. We then CONVERT the uniqueidentifer to a varchar, and append it to the DateStored varchar value.  This gives us a lengthy string which can be sorted by the first 24 characters.
  4. We find the MIN of the string we constructed; this MIN value is determined by the optimizer based on the sorting value of the numbers in the DateStored value.
  5. We then take the RIGHT-most 36 characters (the length of a uniqueidentifier), and convert it back to a uniqueidentifier (so that we have our type back).

There are probably better solutions for this, but this is a simple trick that works under certain circumstances and is portable to several flavors of SQL.

#TSQL2sDay: Resolutions

tsql2sday For this month’s T-SQL Tuesday, Jen McCown asks:

So tell us: what techie resolutions have you been pondering, and why?  Are you heading for a certification? An award? Are you looking to pick up CLR because that guy at the Summit said it’s “bitchin’”? Go crazy…

I’ve already covered a lot of my techie resolutions in this post, but here’s a recap, with some expanded thoughts:

  • I vow to learn something new every month.  I’ve already started on this one, but I need to keep working on it.   For example, I’m working on XML and XQuery this month; next month, I’m thinking SSIS.
  • I vow to be more involved in the technical community.  I’ve slipped out of tweeting (mostly because it’s blocked on our corporate network); I will do more.  I also want to read more blogs, as well as do a LOT more blogging myself.  For example, I plan to participate in every T-SQL Tuesday for 2011.  I also plan to present at least 6 times this year.
  • I will earn my MCITP: Database Developer certification this year.  Been meaning to do it; just haven’t invested the time to do so.

On a personal note, I want to tackle a few more technical projects that have been hovering over my head:

  • I want to do more with pictures and videos.  I have a nice digital camera, and a nice Flip video camera, but I don’t do squat with them.  I’m horrible about leaving them behind when I travel; I will use them as needed.
  • My fiancée is an iPod user (like 90% of the world); I am not (I have an Archos).  Merging our music into iTunes is not going to be fun (especially since I’ve never used it), but in the long run, it’ll be the right thing to do for us.
  • I want to work smarter, not harder, so I can play more.  There’s lots of little services out there (like Remember the Milk, Yodlee.com, Google calendars, etc) which will help me manage my life on the move (shuttling between my apartment, my fiancée’s house, and my office).

Short, sweet, but at least it’s submitted 🙂

#TSQL2sDay – My Least Favorite SQL Server Myth

TSQL2sDay150x150

 

It’s time again for another T-SQL Tuesday, hosted this month by Sankar Reddy; the topic is misconceptions in SQL Server.  It’s been a while since I wrote one of these (I usually forget about them until the following Wednesday), but this topic is a good one.  I’ve had many discussions with people about the following myth for a long time, so it’s nice to be able to put it to rest again.

The myth?  “You should always use stored procedures in your code, because SQL Server is optimized for stored procedure re-use.”  Don’t get me wrong; there are lots of arguments to use stored procedures (security, obfuscation, code isolation), but performance is not necessarily a good one.   This myth has been around for a long time (since SQL Server 7), and Binging “stored procedures SQL Server performance” yields such gems as the following:

SQL Server Performance Tuning for Stored Procedures

As you know, one of the biggest reasons to use stored procedures instead of ad-hoc queries is the performance gained by using them. The problem that is that SQL Server will only … www.sqlserverperformance.com/tips/stored_procedures_p2.aspx · Cached page

Increase SQL Server stored procedure performance with these tips

Database developers often use stored procedures to increase performance. Here are three tips to help you get the most from your SQL Server storedarticles.techrepublic.com.com/5100-10878_11-1045447.html · Cached page

 

The guts of this myth originate from the fact that prior to version 7 (released in 1998), SQL Server WOULD precompile stored procedures and save an execution plan for future reuse of that procedure, BUT THAT CHANGED AS OF VERSION 7.0.  Here’s a quote from Books Online (SQL Server 2000) that tries to explain what happened (emphasis added by me):

Stored Procedures and Execution Plans

In SQL Server version 6.5 and earlier, stored procedures were a way to partially precompile an execution plan. At the time the stored procedure was created, a partially compiled execution plan was stored in a system table. Executing a stored procedure was more efficient than executing an SQL statement because SQL Server did not have to compile an execution plan completely, it only had to finish optimizing the stored plan for the procedure. Also, the fully compiled execution plan for the stored procedure was retained in the SQL Server procedure cache, meaning that subsequent executions of the stored procedure could use the precompiled execution plan.

SQL Server 2000 and SQL Server version 7.0 incorporate a number of changes to statement processing that extend many of the performance benefits of stored procedures to all SQL statements. SQL Server 2000 and SQL Server 7.0 do not save a partially compiled plan for stored procedures when they are created. A stored procedure is compiled at execution time, like any other Transact-SQL statement. SQL Server 2000 and SQL Server 7.0 retain execution plans for all SQL statements in the procedure cache, not just stored procedure execution plans. The database engine uses an efficient algorithm for comparing new Transact-SQL statements with the Transact-SQL statements of existing execution plans. If the database engine determines that a new Transact-SQL statement matches the Transact-SQL statement of an existing execution plan, it reuses the plan. This reduces the relative performance benefit of precompiling stored procedures by extending execution plan reuse to all SQL statements.

SQL Server 2000 and SQL Server version 7.0 offer new alternatives for processing SQL statements. For more information, see Query Processor Architecture.

Note from the above quote that the query optimizer now uses execution plans for ALL T-SQL statements, not just stored procedures.  The perceived performance gain from stored procedures stems not from some magic use of CREATE PROC, but rather in plan re-use, which is available to ad-hoc queries as well. 

So what promotes plan re-use?  The simplest answer is parameterization; SQL statements which use parameters efficiently (which includes many stored procedures) will be more likely to reuse a plan.  Developers should focus on making the most out of parameters, rather than simply assuming that a stored procedure will be efficient simply because of some magical aspect of said procs.

A final thought: For a great starting place on understanding SQL Server plan reuse, see http://msdn.microsoft.com/en-us/library/ee343986(SQL.100).aspx, including the Appendix A: When Does SQL Server Not Auto-Parameterize Queries.  Also, this post by Zeeshan Hirani explains why LINQ to SQL query plans don’t get reused.

#TSQL2sDay 003: Maslow and relational design

Rob Farley is hosting the third installment of TSQL Tuesday, and it’s a fun one: relationships (in honor of Valentine’s Day).   While I’m not currently in much of a mood to opine on the virtues of love and databases, I did think I wanted to post something a bit more esoteric this time.  Not many of you may know that I don’t have a formal background in Information Technology (some of my more sarcastic friends just held their tongues at that one); I actually have a Master of Arts in Communication, and a Master’s of Education in Instructional Technology.  I tripped into IT when I failed my comprehensive exams for the doctoral program in Speech Communication at the University of Georgia.  Awful time, but ultimately one of the best things to ever happen to me.

Anyway, why is this relevant?  Because the goal of this post is to attempt to extend one of the more famous models of social psychology and communication to database design; bear with me (I’m assuming that many of you either have no background in social psych or slept through it), but I’m hoping that this extension to the metaphor will benefit you in terms of your application design.

Maslow: the crash course.

The following is a BRIEF introduction to the theory; if you want more details, Wikipedia is your friend. In a nutshell, Abraham Maslow proposed that humans, as a social animal, were driven to fulfill certain basic needs in a quest for self-actualization or enlightenment.  He proposed a pyramidic model of five sets (or stages) of these needs, with the four lowest ones being required to achieve before attempting the fifth; few people ever attain the fifth level, but the quest to reach that is part of our collective experience.  I’ve defined the five stages below:

maslows_hierarchy_of_needssvg Physiological:

The basic requirements for human existence; food, water, etc.

Safety:

This often translates into security, but it’s different than the term we use in information technology careers; safety is the ability to acquire and maintain goods for ongoing existence.  The Physiological elements are immediate needs; Safety elements are the ability to fulfill those immediate needs at a future date.

Social:

Where do we belong?  How do we interact with others who need us (and we need)?  What is our role, and how does that affect our definition of the self?

Esteem:

Esteem stems from the social need; once our relationship with others has been established, we can truly begin to define ourselves and the virtue of our importance in the world.

Self-Actualization:

Self-actualization is the ultimate fulfillment of one’s potential; to be what one is, without need for constant reinforcement from other beings, yet able to exist in harmony with purpose.  Few people have ever attained this stage, but as stated before, the quest to reach the top of the pyramid drives human development.

So what does this mean to the database designer?

Why is all of this important?  This is not a perfect analogy, but if we extend Maslow’s model to the area of database design, some interesting questions arise (particularly in the third and fourth stages, which is why I felt like this point would be relevant to the TSQL Tuesday challenge of relationships).  Let’s take each stage, and step through them again.

Physiological:

While applications don’t have physiological needs, they DO require certain basic elements for long term survival.  Questions to consider at this stage are things like: How much space will I need?  What are the server requirements?  Can my database live in cloud or a mobile device?   What sort of I/O concerns do I have?

Safety:

Recall that safety is NOT security (in terms of who has access to the data), but it is security in terms of long-term survival of the application.  Is the database you’re designing intended for a long-term project, or is it “throw-away” code?  Have you designed it in such a way so that it’s easy to replace without impacting the dependent application?

Social:

Speaking of dependent applications (and herein lies the relationship aspect of this post), is your database application designed so that it is loosely related and decoupled from the application?  Does the database fulfill the needed role within the relationship (data storage), without treading too far into business logic?  Can the database handle multiple relationships with various applications (UI/reporting/business services).

Esteem:

Closely related to the social nature of the database within the application stack is the need for self-esteem within the database; can the database meet the the needs of the dependent applications WHILE retaining enough information to establish new relationships?  A classic example of this is the lookup table; a database with low self-esteem will only store the enumerated values provided to it by some other application. 

Without the enabling application, the database lacks sufficient internal definition to validate meaning; in practical terms, this means that the database is not decoupled from the application enough to enable the development of alternate accessing applications.  For example, my day job is to reverse engineer vendor databases; few things in the world are more disturbing than a table full of numbers without any sort of category associated with that number.  The application designer decided to store that enumeration in the application; security through obfuscation IS a method of securing your database, but not a very effective one.

A high-self esteem database will store all of the appropriate lookup values (complete with constraints) in order to provide complete validity within the structure.  The database can then be reused by several different applications, without requiring a complete set of business rules to determine those relationships.    The data layer is definitional; the business layer should be procedural.

hal[1] Self-Actualization:

I have to admit that discussing self-actualization in regards to application design makes me think of HAL.  “I’m sorry, Dave….”

To try and stay on track with this metaphor, self-actualization is the basic premise of BI; when your database can start providing you knowledge instead of just data, it has attained the highest level of potential.  Few apps make it that far without requiring substantial redesign, but the ones that do are invaluable to the enterprise they support.

So where are we?

Dunno.  I hope this little exercise made your brain hurt just a bit, and opened up a new metaphor for understanding database design issues within the application stack.   If you have more questions than answers, that’s a good place to be.

#TSQL2sDay 001: Date/Time Issues and Data Modeling

This is my first post for T-SQL Tuesday (thanks, Adam, for the idea!); I’m hoping this will be a continuing block party, and hopefully connect a lot of us SQL bloggers together.  There’s a lot of great content out there, and sometimes it’s just tough to find.

Just a few notes before I get started on this; for this particular post, I’m pulling from my experiences as a data architect for the last few years.  I spend a lot of time reverse-engineering other databases in order to centralize and analyze data from a variety of sources.  Most of my efforts pre-date SQL 2008, so I don’t have a lot of experience with the newer datetime data types introduced in that version; however, I’ll point you in that direction if I think it might address an issue that I raise.  Also, there’s no particular order to my list of notes; think of it as a random collection of date-time issues.  Some of them may be specific to T-SQL and SQL Server, and others may just be procedural notes about data integration (and thus may be applicable to other platforms).

Ready?  Let’s get started.

1.  If you are a database professional that works with SQL Server, strike the word “timestamp” from your vocabulary.

Way back when SQL Server was in its infancy, some developer decided that timestamp would be a good name for a binary column that changes whenever the row was updated.  I don’t know the origins of the timestamp datatype; all I know is that it has nothing to do with date or time.  So why am I including it in a post about Date and Times?  Because for most speakers of the English Language, a timestamp is a marker that indicates when an event happens.  Many junior SQL Server developers and even senior developers on other platforms will often be confused by the timestamp datatype, and wonder what’s going on with that column that they just added.

My recommendation?  Avoid using the word timestamp in design discussions.   When someone mentions “we need to add a timestamp to that row”, make sure that you understand what they mean.  Usually they want you to add a column to capture what datetime the event happened, NOT an indicator of when the row will change.  Note that I’m suggesting that you avoid the WORD timestamp; obviously, if a database requires a timestamp column, use it.

2.  Does anybody really know what time it is?

One of the benefits of a distributed architecture is that your database server doesn’t have to live on the same box as your application server; in fact, your database server may not be in the same timezone as your application server.  And, if you’re using a webserver as an application host, your user may not be in the same timezone as either your application or your database.

I mention this because as a database designer, you may need to capture multiple points-in-time (or you may not); you cannot assume that the application is consistently providing the same time reference for all users.  If you’re using a truly distributed design, you may have one host in New York and one in Los Angeles, and your users may connect to either one to enter in data; what date do you use?

For this reason, I usually advocate two things: 1) always include a database-centric referential date of event (in other words, include a column with a default value of the database time for important data), and 2) always use UTC for saving time.  That second statement’s important enough to warrant a note of its own.

3.  When in doubt, use UTC.

Time is relative to geography; the application you build today in New York may eventually have clients in New Delhi.  If your processes could ever have the potential of crossing multiple time zones (for example, you want to track shipping from Atlanta to Oregon), you’d get the most benefit out of storing your data as UTC in order to understand the relationship between event A (which occurred in EST) and event B (in PST).  It’s a lot easier to change the display of events to a local time from UTC than it is to convert from local time to UTC, do the math, and then convert back.

One problem with recording data with UTC times that I’ve encountered is that is you have centralized data, but distributed users that want to see their data in local time, building OLAP cubes is a pain.  If you’re wanting to analyze data for Atlanta and data for Oregon, Atlanta’s day begins 3 hours earlier than Oregon’s.  I have not found an easy solution to this yet, but am starting to investigate the new datetimeoffset data type in SQL 2008.

4.  Use a standard notation for calculating differences between dates.

Obviously the new date and time datatypes in SQL 2008 address a desire of many in the development community for a method of storing ONLY the date or time of an event, but if you’re using an application that still stores the datetime of an event and you need to round the data to the nearest day, hour, week, whatever unit of time, then use  a standard method for doing so.

(I am assuming that you are using date-based arithmetic rather than converting to a varchar and back).

SELECT DATEADD(day, DATEDIFF(day, 0, GETDATE()), 0)

is the preferred method; you can read why at my previous post.

5.  Use a standard notation for referring to dates.

Americans are weird about dates; we’re one of the few countries that puts the middle container before the smallest container followed by the largest container.  Clueless?  When people ask you what’s your birthdate, how do you answer?  Most people rattle off Month-Day-Year.   An international notation would be Year, Month, Day.

Again, if you’re designing databases and queries that might cross international borders, you’ll save yourself some debugging time in the future if you can get into the habit of using an international standard notation for referring to datetimes.   A query that asks for data WHERE DateOfEvent > ‘11/1/2009’ may work fine in the US (returning data newer than November 1, 2009), but may blow up in the UK (returning data greater than January 11, 2009).

How do you know if you’re a DBA that’s been bitten by this?  You start dating everything (including personal checks) using a notation like ‘2009-11-01’.  I  get odd looks all the time (probably not just because of the way I date my checks).

6.  Finally, get used to the fact that there’s lots of methods to store datetimes.

If you’re doing any sort of database integration, you’re probably familiar with many of the same issues I laid out.  There are others that I haven’t covered, but the important thing I want to focus on is that just because I think these are good habits to get into, others have different opinions.  Some db pro’s prefer to use epoch time rather than a standard datetime; I not only have to be familiar with my own methods of doing something (and prepared to defend those methods), but I also have to be willing to understand other methodologies so that I can do the conversion.