Code

Three Myths about Agile Development

I recently attended Microsoft Tech Ed in Atlanta, and while there wasn’t much new being announced about SQL Server (I had heard about many of the features for Denali at PASS Summit 2010), I did find myself drawn to several sessions regarding Agile principles and development.  My shop has been using the Scrum method for about 2 years now, and it was nice to have a refresher.  I also participated in (and overheard) a lot of conversations about Agile methods, and it made me realize two very important things:

  1. Many people who claimed to be using Agile methods had never read the Agile Manifesto, and
  2. There are several misconceptions in play regarding Agile development.

The point of this blog post is dual-fold; first I want to encourage you to read the Agile Manifesto.  If you’ve read it before, read it again.  And then, read it a third time (it’s short, so easy to read).  Done that?  Good, because here’s the crux of my argument:

If you want to do Agile development, you must adhere to the principles of the Agile Manifesto.

It’s simple, really; you shouldn’t claim to be a SQL Server developer if you’ve never written a T-SQL statement.  You can’t call yourself a cubist if you haven’t studied the works of Picasso.  You shouldn’t claim to be doing Agile development if you don’t adhere to the principles of the Agile Manifesto.

And, that leads us to the second part of this post; I believe that lots of us think we’re adhering to the methods and principles of Agile development, but there are at least three basic myths about Agile development which keep development teams from being as agile as they can be; here’s my take on them:

Myth 1: Daily meetings with business people are an impediment to rapid development.

I actually got into a fervent discussion with gentleman at TechEd about this subject during a Birds of the Feather Session on Scrum.  he claimed to be a Scrum Master for 6 teams (including several overseas), and that he barred business people from entering into the daily standup in order to keep them from dragging the meeting astray.  I think that’s wrong, and here’s why (from the Agile Manifesto principles):

Business people and developers must work together daily throughout the project.

While it’s true that the daily standup in Scrum need not be the daily interaction, it makes sense that business people LISTEN (but not INTERACT) in that meeting in order to understand on what issues the development team is working, and how those issues interplay with each other (Note: scrum calls this the chicken and the pig; business people need to know what’s going on across the development team, but shouldn’t be involved at this point.  However, the daily standup can spur additional conversations).   If your development team chooses to have a daily standup without business people, your team members MUST interact with business people in order to handle changing requirements; they must also communicate at that time what the priorities of the development organization are, and why this particular project is not progressing because some other project takes priority.

Agile development depends on the interaction between developers and business people; to isolate half of the team from the other half of the team will cause disruption to the process.  That leads us to our second myth:

Myth 2: Your development team can be agile in a vacuum.

I call this the Agile-Waterfall mindset; your business organization is separate from your development team.  Your developers are practicing some form of Agile development, but the organization is used to handing off a set of requirements to the developers, and then having them return a product at periodic intervals.  Think of this as the complement to Myth 1; Business people aren’t deemed to be an impediment, but the organization hasn’t endorsed agile development throughout.  Daily meetings with developers aren’t deemed to be a priority by the business people; the organization has developed a culture of handing off responsibilities, and expecting them to be fulfilled without daily guidance.

By definition, you can not have an agile team without input from both developers and business people.   If you want to respond to changing requirements (as frustrating as that can be to developers), you must have input from business people as soon as those requirements change.  Again, you need to handle prioritization, as changing requirements do not necessarily merit immediate priority.

Myth 3: Self-organizing teams self-manage efficiently.

A couple of great principles from the Agile Manifesto deal with communication:

The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.

The best architectures, requirements, and designs emerge from self-organizing teams.

While I believe in the wisdom of these two principles, I don’t want to de-emphasize the need for good, basic software design principles.  Most enterprise development consists of intertwined projects and resources; in order to minimize maintenance issues, adherence to consistent programming standards is a must.  Developers have different naming standards, procedural methodology, and architectural perspectives; a good team has a playbook that ALL members of the development team (regardless of what project team they serve) follow. If you have one database developer that makes heavy use of schemas, and another one that doesn’t, maintaining each other’s code requires some additional effort on their parts.  Furthermore, when teams are self-formed of roughly equally-experienced developers, resolution of architectural decisions can be difficult.

Development teams need an enforcer; a good manager goes a long way toward resolving interpersonal conflicts before they get started.  Just because teams communicate well (and good communication includes conflict), it does not necessarily mean that those same teams will develop quality code in an efficient manner.  Good teams need good direction.

Summing Up.

If you’ve made it this far, I hope I’ve given you some food for thought, as well as encouraged you to go back and revisit the Agile Manifesto, as well as your own organizational processes.  Let me sum up with a final thought from the Agile Manifesto:

At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

#msteched Columnstore indexes unveiled–DBI312

Live blogging again; hope you find my notes useful (scattered though they are).  I’ve been waiting on this session because it’s a very specific area of interest.  I work a lot with VLDB’s, and performance is always a concern; claims are that Denali’s columnstore may boost performance of certain queries hundred-fold.  Let’s see how they work, and I’m hoping I can convince my boss to set up a test bed to try this out.

Presenter is Eric N Hanson from Microsoft (Twitter). 

We start off with a story; I like story-time.  Actually, it’s a very effective way to break out user cases.

Buzzphrase for Columnstore: “Enabling interaction with data”.  Supposed to be super efficient, and get large amounts of data back from SQL Server Denali.  Internal project name is Apollo; columnstore is only part of the picture.

Area of focus is BI & DW: load large amounts of data, high-read, incremental loads.  Partitioning is mandatory for this feature.

Curious as to why the examples join tables in the WHERE clause, and not the more accepted syntax of JOIN.

K, here comes the magic: example uses a Fact Table with 100 million rows in it.  Clustered on a date column, and a columnstore index.  Clustered index is still B-TREE; columnstore indexes are nonclustered.

Running duplicate queries; using index hint to force optimizer to use the clustered index in one example.  Wow; 100,000,000 rows of data aggregated in a second on a two-year old laptop.  50x speedup on this particular hardware.  According to presenter: “this is the biggest enhancement to SQL Server since we bought the code from Sybase".

And here’s the meat and potatoes; how does this work?  Vertical partitioning stores each column in a separate page.  Columnstore is based on the same code as PowerPivot and the BI engine;  Vertipaq if you want to do more reading on this.  Columnstore data is highly compressed, so smaller footprint to read from disk and can be stored in main memory.

New query execution plan: batch processing.  “the edsel is the way of the future”.  Actually, the idea is that batches of vectors are stored in query plan; highly efficient data representation.  We can also scale to more cores: tests are showing linear acceleration up to 32 cores.

Instead of storing data as a page, data is stored as a column segment which represents about 1,000,000 rows.

Questions have begun; some questions are good, but this is a 300 level session, folks.  If you don’t understand basic SQL syntax (like how to create an index), this may not be the session for you.  Great question about the relevance of traditional indexes after this is unveiled, and Hanson’s response: in most Decision Support Applications, columnstore is the way to go particularly for scans.

Some index hints for choosing the columnstore or ignoring it:

WITH (index(index_name))

OPTION (ignore_nonclustered_columnstore_index) <—use for bad plan selection if necessary.

Same traditional rules for index hints: trust the optimizer first, rewrite second, and then use hints last.

A couple of new icons for query execution plans: columnstore scan, and batch hash table processing.  Each execution operator now operates in either batch mode or row mode; batch mode is what you want for speed. 

New term of interest: dictionary.  A dictionary is storage for unique values with a lookup so that a column can stores highly compressed information.

Most things just work with SQL Server; Backup and Restore, Mirroring, SSMS, etc.

Lots of datatypes don’t work with column store: long decimals, binary, BLOB, uniqueidentifier, long datetimes, CLR,  (n)varchar(max).

query performance restrictions: outer joins, Unions; Stick with Inner Joins, Star Joins (need to look this one up) Aggregation.  About to show a query which doesn’t benefit from batch processing.  Essence is below:

SELECT t.ID, COUNT(t2.ID)
FROM t LEFT JOIN t2 ON t.ID=t2.ID
GROUP BY t.ID

Left Join knocks it out of batch processing; need to rewrite as an INNER JOIN, but note that you lose the NULL values, so you have to use a CTE; need to get slides for his sample, but you do an INNER JOIN in the CTE, and then do an OUTER JOIN. 

WITH CTE( INNER JOIN)
SELECT blah
FROM t OUTER JOIN CTE ON t.ID yada yada.

Adding data to columnstore; basic methods:

1.  Drop and re-add the index before load.  Expensive, but works well with traditional daily builds

2.  Partition switching.  Sweet spot needs to be tested, but easy one is the hour.  NOLOCK queries pre-empt the ability to do paritioned queries.  Need to read up on this, but may be fixed in future version

3  trickle load can be done, but needs to be tested.

Very awesome; I cannot wait until this is actually released in CTP 3, so I can play around with it.

#TSQL2sday: Emulating a FIRST aggregation

tsql2sday

Jes Borland is hosting this month’s T-SQL Tuesday, and it’s all about aggregations.  Here’s an old coding trick of mine to emulate a FIRST aggregation in T-SQL.  Say we have a table that has three columns:

  • ID, a uniqueidentifier
  • Name, a varchar that represents something, and
  • DateStored, a datetime that is set when the row is written to the table

And we populate that table like so:

CREATE TABLE TSQL2sDay_FirstAgg
   
(
     
ID UNIQUEIDENTIFIER
   
, NAME VARCHAR(20)
    ,
DateStored DATETIME DEFAULT GETUTCDATE()
    )
    
INSERT  INTO TSQL2sDay_FirstAgg
       
( ID, NAME )
VALUES  ( NEWID(), 'Peanut' )

WAITFOR DELAY '00:00:01'

INSERT  INTO TSQL2sDay_FirstAgg
       
( ID, NAME )
VALUES  ( NEWID(), 'Peanut' )

WAITFOR DELAY '00:00:01'

INSERT  INTO TSQL2sDay_FirstAgg
       
( ID, NAME )
VALUES  ( NEWID(), 'Orange' )

 

It’s easy to figure out the number of rows associated with each name:

-- SELECT data to verify order of DateStored
SELECT  ID
     
, NAME
     
, DateStored
FROM    TSQL2sDay_FirstAgg      

-- Basic Row Count by Name
SELECT  NAME
     
, RowCnt = COUNT(*)
FROM    TSQL2sDay_FirstAgg
GROUP BY NAME

 

but how do we figure out what the first ID was for each name along with the number of rows?  You could work something out using the HAVING clause of the SELECT statement, or you could do something like the following:

--SELECT first ID and count of rows by Name
SELECT  FirstID = CONVERT(UNIQUEIDENTIFIER, RIGHT(MIN(CONVERT(VARCHAR(24), DateStored, 121) + CONVERT(VARCHAR(36), ID)),
                                                 
36))
      ,
NAME
     
, RowCnt = COUNT(*)
FROM    TSQL2sDay_FirstAgg
GROUP BY NAME

 

It looks complicated, but it’s not; let’s step through it.

  1. We have to know some basic information about our data; in this case, we know that the datetime value associated with each row with a common name is different.  In other words, there are no two Peanuts with the same DateStored value.  This is important, because in order for there to be a first value, there must be some method of ALWAYS determining which one WAS first.  If two Peanuts showed up at the same time, the model is broken.
  2. The first thing we do is to CONVERT the DateStored value to a varchar; this allows us to concatenate it with other values.  The format of that varchar string is important; it must be precise, and it must sort in an ascending order.  The ODBC canonical format (with milliseconds) is a good candidate for this.
  3. We then CONVERT the uniqueidentifer to a varchar, and append it to the DateStored varchar value.  This gives us a lengthy string which can be sorted by the first 24 characters.
  4. We find the MIN of the string we constructed; this MIN value is determined by the optimizer based on the sorting value of the numbers in the DateStored value.
  5. We then take the RIGHT-most 36 characters (the length of a uniqueidentifier), and convert it back to a uniqueidentifier (so that we have our type back).

There are probably better solutions for this, but this is a simple trick that works under certain circumstances and is portable to several flavors of SQL.

A simple codebuilder for parsing in T-SQL

If you’ve ever tried to parse a wide character column in T-SQL, you know two things:

  1. It’s a pain to do, and
  2. It’s a pain to do.

A lot of the data I deal with comes in syslog format, which can come in one of two formats: positional (the location of the data element is related to the type of data), and named attributes (which usually only include delimiters for complex strings).  Although I haven’t had much luck automating positional parsing, I’ve recently begun using Excel to help me with the named attributes. 

Here’s an example; I have a table with a message column that is pulling over syslog data from a firewall.  In a given day, I may have millions of rows like the following:

sn=AA17D5028EAA time="2011-01-26 13:40:14 UTC" fw=10.1.100.1 pri=1 c=512 m=522 msg="Malformed or unhandled IP packet dropped" n=1 src=10.1.1.23:32795:X1: dst=10.1.1.1:514:: proto=udp/17

Note that each attribute of this particular syslog message is identified with an attribute name (eg, sn, time, fw, etc).  In order to break out each of the elements in T-SQL, we can split the string using a combination of SUBSTRING and CHARINDEX, like so:

SELECT TOP 1
        m
= CONVERT(INT, SUBSTRING(MESSAGE, CHARINDEX(' m=', MESSAGE) + 3,
                                  
CHARINDEX(' ', MESSAGE, CHARINDEX(' m=', MESSAGE) + 3) - ( CHARINDEX(' m=', MESSAGE)
                                                                                              +
3 )))
      ,
time = CONVERT(DATETIME, SUBSTRING(MESSAGE, CHARINDEX(' time="', MESSAGE) + 7,
                                          
CHARINDEX('UTC"', MESSAGE, CHARINDEX(' time="', MESSAGE) + 7)
                                           - (
CHARINDEX(' time="', MESSAGE) + 7 )))
      ,
fw = CONVERT(VARCHAR(20), SUBSTRING(MESSAGE, CHARINDEX(' fw=', MESSAGE) + 4,
                                           
CHARINDEX(' ', MESSAGE, CHARINDEX(' fw=', MESSAGE) + 4) - ( CHARINDEX(' fw=',
                                                                                                       
MESSAGE) + 4 )))
FROM    syslogng (NOLOCK)

Note the repetition for each column; you need to find the position of a starting delimiter, the position of an ending delimiter, and supply to the SUBSTRING function the position of the starting delimiter, and the difference between the two.  You also need to determine the lingth of the starting identifier, and then I CONVERT to a specific data type.  Whee!

It gets even more fun when the attributes are optional; some syslog messages may have a proto code, and some may not.   When faced with this, you need to include a CASE option, like so:

SELECT TOP 1
        proto
= CONVERT(VARCHAR(20), CASE WHEN CHARINDEX(' proto=', MESSAGE) = 0 THEN NULL
                                         
ELSE SUBSTRING(MESSAGE, CHARINDEX(' proto=', MESSAGE) + 7,
                                                        
CHARINDEX(' ', MESSAGE, CHARINDEX(' proto=', MESSAGE) + 7)
                                                         - (
CHARINDEX(' proto=', MESSAGE) + 7 ))
                                    
END)
FROM    syslogng (NOLOCK)

 

One of our developers is working on a syslog parser in .NET code, but I needed a proof-of-concept, and I didn’t want to keep cutting and pasting to see if it was working.  Looking at the parsing, it’s very formulaic SQL.  When I think formulas, I think Excel, and so I whipped out the following:

image

Note that I have several input columns:

  • start, the starting delimiter
  • end, the ending delimiter (usually a space)
  • colname, the column name I want to use; usually the same as start, but stripped of extra characters.
  • type, the SQL type I want to convert the data to, and
  • optional, a column to decide if the attribute is optional per row or not.

I also have a hidden column (column F), which generates most of the SQL code:

=CONCATENATE("SUBSTRING(message, CHARINDEX(‘", A2, "’, message)+ ", LEN(A2), ", CHARINDEX(‘", B2, "’, message, CHARINDEX(‘", A2, "’, message)+", LEN(A2), ") – (CHARINDEX(‘", A2, "’, message)+", LEN(A2), "))")

This takes the starting and ending delimiters, the length of the starting delimiter, and plugs those values into a valid SQL statement.  I then create a SQL column, using the following formula:

=CONCATENATE(", ", C2,"CONVERT(", D2, ", ",  IF(E2="Y", CONCATENATE("CASE WHEN CHARINDEX(‘", A2, "’, message) = 0 THEN NULL ELSE ",F2, " END"), F2), ")")

If I were better at Excel, I’d use named ranges, but for my purposes, this is OK.   I append a column to the beginning, specify the type, and include a CASE statement based on whether or not my optional column includes a “Y”.

It took me longer to write this blog post than it did to generate a proof-of-concept, parsing each of the named attributes out from a syslog message.

Something new for 2011: XML & XSD, part 2

I’m continuing my study of XML and XSD’s for January, and I realize that I ended my last post a bit abruptly.  I explained that I can cast an XML datatype to a SQL Server datatype, without giving a lot of background on WHY that’s important.  

Understanding Types.

Without going into too much detail about type, the basic reason for specifying a type for data transformations is validity; if you are expecting integer data, and the XML provides a string, then the basic contract is broken.  An XSD defines a type of data expected, and if some other type is provided, the XML is invalid.

For example, run the following code:

IF NOT EXISTS ( SELECT  *
               
FROM    sys.xml_schema_collections xsc
               
WHERE   name = 'MismatchDataType' )
       
CREATE XML SCHEMA COLLECTION MismatchDataType  AS
       
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
        <xsd:element name="IntValue" type="xsd:integer"/>
        </xsd:schema>'
       
GO

DECLARE @x XML(MismatchDataType)
SET @x = '<IntValue>100</IntValue>'
--SET @x = '<IntValue>String</IntValue>'

DROP XML SCHEMA COLLECTION MismatchDataType
GO

It runs fine, but if you uncomment the second SET statement (where a string value is specified), you get the following error:

Msg 6926, Level 16, State 1, Line 4
XML Validation: Invalid simple type value: ‘String’. Location: /*:IntValue[1]

What’s important to remember is that once you specify a type for an element, you may only cast that XML type to a matching SQL Server type (i.e., integer to integer, string to (n)varchar, etc.) when using the XQuery methods in SQL Server (.value(), etc.).  This is easily debuggable to a seasoned database professional; if the XML type is string, and you store a value as 100, you can easily convert that to either an integer or varchar value:

SELECT @x.value('IntValue[1]', 'integer'), @x.value('IntValue[1]', 'varchar(3)')

 

If you don’t specify a type, SQL Server can make certain assumptions regarding type conversion; however, typing your XML is one of those basic “good habits” that is foundational to application design.  Knowing what to expect from your data, regardless of whether or not it’s stored in XML or a database makes troubleshooting a lot easier in the future.

Complex vs. Simple Types

The examples I’ve used so far all rely on what is known as a simple type in XML; a simple type contains no sub-elements or attributes.  A complex element can  contain either sub-elements or attributes.  An XSD collection is especially useful when defining complex elements; the XSD allows database professionals to enforce validity in the shape of their XML, including which elements are required or not.

Most of the examples I’ve used so far have been simple elements, but a complex element enforced via an XSD would look something like  (apologies for the formatting):

CREATE XML SCHEMA COLLECTION XMLSample  AS
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="Parent">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Child" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
        </xsd:schema>'
       
GO

 

In essence, a complex type is the heart of a strongly-typed XML document;  one of the major benefits of XML is the ability to encapsulate hierarchical data, and a complex type enforces the relationship between the elements (and attributes) encapsulated in that hierarchy much like foreign keys do for a relational database.   The presence or absence of elements in the data when compared to the XSD validate the nature of the dataset.

A stopping point…

Unfortunately for you, I need to stop at this point.  I promised myself to learn something new every month, and I feel like I have.  However, there’s so much more to learn about this topic, and I’ve simply run out of time.  I debated about spending a few more weeks on this, but then realized that I need to move on (I can always return to it in a few months) in order to stay energized about learning something new.  When I do return to this topic, I’ll be sure to post a few summary links to keep everything related.

#TSQL2sDay: Resolutions

tsql2sday For this month’s T-SQL Tuesday, Jen McCown asks:

So tell us: what techie resolutions have you been pondering, and why?  Are you heading for a certification? An award? Are you looking to pick up CLR because that guy at the Summit said it’s “bitchin’”? Go crazy…

I’ve already covered a lot of my techie resolutions in this post, but here’s a recap, with some expanded thoughts:

  • I vow to learn something new every month.  I’ve already started on this one, but I need to keep working on it.   For example, I’m working on XML and XQuery this month; next month, I’m thinking SSIS.
  • I vow to be more involved in the technical community.  I’ve slipped out of tweeting (mostly because it’s blocked on our corporate network); I will do more.  I also want to read more blogs, as well as do a LOT more blogging myself.  For example, I plan to participate in every T-SQL Tuesday for 2011.  I also plan to present at least 6 times this year.
  • I will earn my MCITP: Database Developer certification this year.  Been meaning to do it; just haven’t invested the time to do so.

On a personal note, I want to tackle a few more technical projects that have been hovering over my head:

  • I want to do more with pictures and videos.  I have a nice digital camera, and a nice Flip video camera, but I don’t do squat with them.  I’m horrible about leaving them behind when I travel; I will use them as needed.
  • My fiancée is an iPod user (like 90% of the world); I am not (I have an Archos).  Merging our music into iTunes is not going to be fun (especially since I’ve never used it), but in the long run, it’ll be the right thing to do for us.
  • I want to work smarter, not harder, so I can play more.  There’s lots of little services out there (like Remember the Milk, Yodlee.com, Google calendars, etc) which will help me manage my life on the move (shuttling between my apartment, my fiancée’s house, and my office).

Short, sweet, but at least it’s submitted 🙂

Something new for 2011: XML and XSD

As part of my New Year’s resolution for 2011, I vowed to do a deep-dive on something technical every month; for January, I’m focusing on XML.  I’ve been using XML and XQuery in SQL Server for a while now (even presenting on it), but I still don’t consider myself an expert in the area.  For example, I use a lot of untyped XML to transfer data between databases; I’ve never really tackled XSD (XML Schema Definition Language), and now’s the time.  I’m reading The Art of XSD by Jacob Sebastian to help get me started.

What’s XSD?  In a nutshell, it’s an XML document which validates the structure of another XML document.  From the perspective of a database developer, an XSD document describes how data should look in a dataset; if the data doesn’t match the description (i.e, if a table is missing a column), that dataset is invalid.  The XSD document can be very precise, or it can offer options for the dataset, but in either case, the point of an XSD is to document the expectations about the dataset.  XML without XSD is untyped; XML with an XSD is typed (although XSD’s do more than just provide information about the data types contained within the XML).

Let’s take a look at an untyped XML statement:

DECLARE @NoXSD XML
SET
@NoXSD = '<Test1>Hello World!</Test1>'
SELECT @NoXSD

 

Simple and straightforward; I created an XML variable, and populated it with an XML fragment.  I then pulled the data out of that fragment.  In this example, we have an element named Test1; what happens if we have a typo when we populate the variable?

SET @NoXSD = '<Test2>Hello World!</Test2>'
SELECT @NoXSD

 

Nothing happens.  It’s a well-formed XML fragment (no root tag, but it does have starting and ending tags); the XML engine in SQL Server doesn’t know that our fragment is supposed to have an element named Test1, so it accepts the fragment as valid.  This is where an XSD comes in:

IF EXISTS( SELECT * FROM sys.xml_schema_collections WHERE name = 'TestSchema' )
DROP XML SCHEMA  COLLECTION TestSchema
GO

CREATE XML SCHEMA COLLECTION TestSchema AS
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="Test1" />
</xsd:schema>'
GO

DECLARE @XSD XML ( TestSchema --use the schema to validate (type) the xml
SET @XSD = '<Test1>Hello, World!</Test1>'

SELECT @XSD

 

Since the XML fragment matches the XSD,  the assignment of data works; what happens when we assign a fragment that doesn’t match?

SET @XSD = '<Test2>Hello, World!</Test2>'

We get a big fat error message:

XML Validation: Declaration not found for element ‘Test2’. Location: /*:Test2[1]

Straightforward, right?  But now what?  Well, let’s type the data in our schema:

IF EXISTS( SELECT * FROM sys.xml_schema_collections WHERE name = 'TestSchema' )
DROP XML SCHEMA  COLLECTION TestSchema
GO

CREATE XML SCHEMA COLLECTION TestSchema AS
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="Test1" type="xsd:string"/>
</xsd:schema>'
GO

DECLARE @XSD XML ( TestSchema )
SET @XSD = '<Test1>Hello, World!</Test1>'

 

So; what does this mean?  It means that we can now use the XQuery methods built into SQL Server to cast the data from the XML datatype to a SQL Server data type.

SELECT @XSD .value ( '(//Test1)[1]' , 'varchar(50)' )

 

More to come, but that’s a good stopping place for now; we’ve built a simple XSD, and validated a simple datatype.  I plan to spend some time learning about optional data elements next.

The 12th day of SQL

Dadmas
My Christmas tree is like this post; short, easy to assemble, and a little tacky.

So, at the first FreeCon, a bunch of writers gathered together and talked about stuff. Most of the stuff we talked about was how to be a better technical writer, as well as how to blend our writing skills with our own personal and professional goals.  We dismissed from that conference eager to write, and looking for opportunities to hone our skills; this particular series of posts was born of that collaboration, and I hope that other series will follow.  While I could list out each individual post in the Twelve Days of SQL series, it’s probably more fun to start at the beginning.  You’ll eventually get back to this one, I hope (if you don’t poke out your eyes after seeing David Stein’s Christmas ornament).

Most of the other posts have described their favorite post of the year.  Me?  I wanna go out with a bang, a celebration of those posts that we all rely on but rarely celebrate.  At the heart of the technical blogging community is, well, the technical blog post, and it’s these posts which rarely get attention.  We often celebrate those witty and well crafted posts, but we rarely celebrate the “how to do this” posts.  Sometimes these posts are little more than scripts; sometimes they’re well-crafted opera describing how to do a single thing.

Why do I sing praises of these short-but-sweet articles?  I’ll answer that in the form of a metaphor…

The Ghost of SQL Past

All blogs begin with a first post, and that first post leads to another.  Many of us that are regular (or irregular in my case) bloggers began our blogs with a few scripts and sample code.  Why?  Because it was a useful place to dump content that we wanted to remember.   Some fine examples of this are Aaron Nelson’s early posts on PowerShell and Ken Simpson’s XML to Pivot scripts.  These early works are indicators of great things to come; their blogs are littered with samples and ideas.

But good technical blogs are born not only of coding tricks; writers craft their works by expanding their repertoire beyond scripts and samples, and move on to include their observations of life.  Sometimes these observations are a bit too revealing (as in Brent Ozar’s self-professed love of amphibians); usually they are fascinating insights into the character of a person.  When Andy Leonard comments that Iteration = Maturity, he’s not just talking about software.

The Ghost of SQL Present

In recent days, newer bloggers have carried on the tradition of the technical post, but are finding ways to blend in a sense of community as well (like David Taylor’s exposition on #sqlhelp).   A quirky sense of humor works as well, as in Julie Smith’s opera of concatenation (I won’t spoil it for you, but there may be magic involved).  Successful technical blogs should be both fun to read, as well as provide some insight into how to do something.

The Ghost of SQL Future

Not much to say here, because we’re not there yet.  Hopefully, what I’ll see in the future is an evolution of what we’ve seen so far in the Past and the Present, but I hope that you’re reading this because you want to understand how to be a better blogger.   Technical blogs need technical content, but good technical blogs need a sense of whimsy, a touch of your personal style, and a nod to the community of content out there. Others have far better posts than I on that subject, but the simplest piece of advice I can give you is:

Write.

That’s it.  Write, because when you write, you force yourself to think, and thinking is the strongest tool in the toolbox for a technical person.   Believe me, I’m pointing the finger squarely at myself on this one as well; I have been far too reticent in my writing as of late, and I hope to rectify that shortly.  But back to you; next year, I hope to celebrate your writing in a similar post.  Tell me how to do something; share your experiences, and educate your peers. 

Up Next?  Steve Jones, for the cleanup!

How many licks DOES it take…?

So, it’s been a while since I’ve posted, but I’ve finally managed to eke out some time after Summit 2010, and wanted to follow up on a conversation that Tim Mitchell, Arnie Rowland, and I had at the Friends of Red Gate dinner.  We were doing a SQL Server oriented trivia contest which asked the following question:

How many nonclustered indexes does SQL Server 2008 support on a single table?

And the answer is: 999, which is enough to cause most DBA’s to grumble under their breath abut the consequences of setting up that many indexes and what they would do to if they ever found that many indexes on a table.  Of course, being the amateur math geek that I am, I casually asked Tim and Arnie:

What’s the smallest table (in terms of amount of columns) that would support 999 indexes?

After knocking it around for a bit, we came up a estimate of 6, which actually isn’t too far off; however, our method of getting there was mostly intuitive, and I wanted to figure out the actual formula for calculating that number.  I knew it had to with factorials, but I wasn’t exactly sure how to get there.  After searching the internet, I finally figured out the following principles:

  • Column order matters when building indexes, so when choosing pairs from a set of columns, a set of ab <> ba.
  • The more columns on the table, the wider the indexes could be; choosing columns from a wider set would require iteration.  In other words, if you have 3 columns on a table, you would have 3 single-column indexes, 6 double-column indexes, and 6 triple-column indexes.

The formula that represents this is SUM(n!/(n-k)!), where n represents the number of columns in the table and k represents the number of columns in the index.  Plugging this into an spreadsheet, you get the following matrix:

    Number of Columns in Index (k)
  1 2 3 4 5 6 SUM
Number of Possible Columns (n) 1 1           1
2 2 2         4
3 3 6 6       15
4 4 12 24 24     64
5 5 20 60 120 120   325
6 6 30 120 360 720 720 1956

 

At first glance, we’re done; it looks like 6 was the right answer, because with only 6 columns in a table, you have a whopping 1,956 possible indexes to choose from.  However, there’s more to the story: SQL Server 2005 introduced the INCLUDE option to indexes, which throws a kink in the above formula. 

At first, I thought it was relatively simple; you had two subsets for each n, where the elements in each subset couldn’t be in the other one, but it’s a little more deceptive.  Here’s the principles for generating it:

  • For a set (n) of possible columns, there are two mutually exclusive subsets: the base (k) and the included columns (l).  The number of elements in the two subsets must be less than or equal to the number of elements in the master set.
  • Column order matters in the base columns, but not the included columns, so the formula above can work for a base set of columns, but iterating through the included columns requires only the unique set of elements.

And here’s the part where my brain exploded; I couldn’t figure out a way to mathematically demonstrate the two relationships, so I built a SQL script, iterating through a set of 5 columns; all in all I ended up with a listing of 845 possible combinations, which means that 6 still stands as the minimum number of columns on a table needed to generate the maximum number of nonclustered indexes.

The point to this story?  None, really.  Just a silly geek exercise.  However, I think it does point out that index strategy is a complex problem, and there are multiple ways to index any given table.  Choosing the right one is more difficult than it looks.


DECLARE @c TABLE ( NAME VARCHAR(100) ) ; INSERT  INTO @c
       
( NAME )
VALUES  ( 'a' ),
        (
'b' ),
        (
'c' ),
        (
'd' ),
        (
'e' )
      
SELECT  n = 1
     
, k = 1
     
, l = 0
     
, NAME
     
, INCLUDE = NULL
INTO    #tmp
FROM    @c
UNION ALL
SELECT  n = 2
     
, k = 2
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.NAME
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
WHERE   c1.name <> c2.name
UNION ALL
SELECT  n = 2
     
, k = 1
     
, l = 1
     
, NAME = c1.NAME
     
, INCLUDE = c2.NAME
FROM    @c c1
       
CROSS JOIN @c c2
WHERE   c1.name <> c2.name
UNION ALL
SELECT  n = 3
     
, k = 3
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.NAME + ',' + c3.NAME
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
UNION ALL
SELECT  n = 3
     
, k = 2
     
, l = 1
     
, NAME = c1.NAME + ',' + c2.name
     
, INCLUDE = c3.name
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
UNION ALL
SELECT  n = 3
     
, k = 1
     
, l = 2
     
, NAME = c1.NAME
     
, INCLUDE = c2.NAME + ',' + c3.name
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c2.name < c3.name
UNION ALL
SELECT  n = 4
     
, k = 4
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.NAME + ',' + c3.NAME + ',' + c4.name
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
UNION ALL
SELECT  n = 4
     
, k = 3
     
, l = 1
     
, NAME = c1.NAME + ',' + c2.NAME + ',' + c3.NAME
     
, INCLUDE = c4.name
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
UNION ALL
SELECT  n = 4
     
, k = 2
     
, l = 2
     
, NAME = c1.NAME + ',' + c2.NAME
     
, INCLUDE = c3.name + ',' + c4.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c3.name < c4.name
UNION ALL
SELECT  n = 4
     
, k = 1
     
, l = 3
     
, NAME = c1.NAME
     
, INCLUDE = c2.name + ',' + c3.NAME + ',' + c4.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name < c3.name
       
AND c2.name < c4.NAME
       
AND c3.name < c4.name
UNION ALL
SELECT  n = 5
     
, k = 5
     
, l = 0
     
, NAME = c1.NAME + ',' + c2.name + ',' + c3.NAME + ',' + c4.NAME + ',' + c5.NAME
     
, INCLUDE = NULL
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
       
AND c1.name <> c5.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name <> c5.name
       
AND c4.name <> c5.name
UNION ALL
SELECT  n = 5
     
, k = 4
     
, l = 1
     
, NAME = c1.NAME + ',' + c4.name + ',' + c3.NAME + ',' + c2.NAME
     
, INCLUDE = c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c2.name <> c3.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c2.name <> c4.NAME
       
AND c3.name <> c4.name
       
AND c1.name <> c5.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name <> c5.name
       
AND c4.name <> c5.name
UNION ALL
SELECT  n = 5
     
, k = 3
     
, l = 2
     
, NAME = c1.NAME + ',' + c2.name + ',' + c3.NAME
     
, INCLUDE = c4.NAME + ',' + c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c1.name <> c5.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name <> c4.name
       
AND c3.name <> c5.name
       
AND c4.name < c5.name
UNION ALL
SELECT  n = 5
     
, k = 2
     
, l = 3
     
, NAME = c1.NAME + ',' + c2.name
     
, INCLUDE = c3.NAME + ',' + c4.NAME + ',' + c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c1.name <> c5.NAME
       
AND c2.name <> c3.name
       
AND c2.name <> c4.NAME
       
AND c2.name <> c5.NAME
       
AND c3.name < c4.name
       
AND c3.name < c5.name
       
AND c4.name < c5.name
UNION ALL
SELECT  n = 5
     
, k = 1
     
, l = 4
     
, NAME = c1.NAME
     
, INCLUDE = c2.name + ',' + c3.NAME + ',' + c4.NAME + ',' + c5.NAME
FROM    @c c1
       
CROSS JOIN @c c2
       
CROSS JOIN @c c3
       
CROSS JOIN @c c4
       
CROSS JOIN @c c5
WHERE   c1.name <> c2.name
       
AND c1.name <> c3.name
       
AND c1.name <> c4.NAME
       
AND c1.name <> c5.NAME
       
AND c2.name < c3.name
       
AND c2.name < c4.NAME
       
AND c2.name < c5.NAME
       
AND c3.name < c4.name
       
AND c3.name < c5.name
       
AND c4.name < c5.name SELECT n, COUNT(*)
FROM #tmp
GROUP BY n
ORDER BY n DROP TABLE #tmp   

#TSQL2sDay – My Least Favorite SQL Server Myth

TSQL2sDay150x150

 

It’s time again for another T-SQL Tuesday, hosted this month by Sankar Reddy; the topic is misconceptions in SQL Server.  It’s been a while since I wrote one of these (I usually forget about them until the following Wednesday), but this topic is a good one.  I’ve had many discussions with people about the following myth for a long time, so it’s nice to be able to put it to rest again.

The myth?  “You should always use stored procedures in your code, because SQL Server is optimized for stored procedure re-use.”  Don’t get me wrong; there are lots of arguments to use stored procedures (security, obfuscation, code isolation), but performance is not necessarily a good one.   This myth has been around for a long time (since SQL Server 7), and Binging “stored procedures SQL Server performance” yields such gems as the following:

SQL Server Performance Tuning for Stored Procedures

As you know, one of the biggest reasons to use stored procedures instead of ad-hoc queries is the performance gained by using them. The problem that is that SQL Server will only … www.sqlserverperformance.com/tips/stored_procedures_p2.aspx · Cached page

Increase SQL Server stored procedure performance with these tips

Database developers often use stored procedures to increase performance. Here are three tips to help you get the most from your SQL Server storedarticles.techrepublic.com.com/5100-10878_11-1045447.html · Cached page

 

The guts of this myth originate from the fact that prior to version 7 (released in 1998), SQL Server WOULD precompile stored procedures and save an execution plan for future reuse of that procedure, BUT THAT CHANGED AS OF VERSION 7.0.  Here’s a quote from Books Online (SQL Server 2000) that tries to explain what happened (emphasis added by me):

Stored Procedures and Execution Plans

In SQL Server version 6.5 and earlier, stored procedures were a way to partially precompile an execution plan. At the time the stored procedure was created, a partially compiled execution plan was stored in a system table. Executing a stored procedure was more efficient than executing an SQL statement because SQL Server did not have to compile an execution plan completely, it only had to finish optimizing the stored plan for the procedure. Also, the fully compiled execution plan for the stored procedure was retained in the SQL Server procedure cache, meaning that subsequent executions of the stored procedure could use the precompiled execution plan.

SQL Server 2000 and SQL Server version 7.0 incorporate a number of changes to statement processing that extend many of the performance benefits of stored procedures to all SQL statements. SQL Server 2000 and SQL Server 7.0 do not save a partially compiled plan for stored procedures when they are created. A stored procedure is compiled at execution time, like any other Transact-SQL statement. SQL Server 2000 and SQL Server 7.0 retain execution plans for all SQL statements in the procedure cache, not just stored procedure execution plans. The database engine uses an efficient algorithm for comparing new Transact-SQL statements with the Transact-SQL statements of existing execution plans. If the database engine determines that a new Transact-SQL statement matches the Transact-SQL statement of an existing execution plan, it reuses the plan. This reduces the relative performance benefit of precompiling stored procedures by extending execution plan reuse to all SQL statements.

SQL Server 2000 and SQL Server version 7.0 offer new alternatives for processing SQL statements. For more information, see Query Processor Architecture.

Note from the above quote that the query optimizer now uses execution plans for ALL T-SQL statements, not just stored procedures.  The perceived performance gain from stored procedures stems not from some magic use of CREATE PROC, but rather in plan re-use, which is available to ad-hoc queries as well. 

So what promotes plan re-use?  The simplest answer is parameterization; SQL statements which use parameters efficiently (which includes many stored procedures) will be more likely to reuse a plan.  Developers should focus on making the most out of parameters, rather than simply assuming that a stored procedure will be efficient simply because of some magical aspect of said procs.

A final thought: For a great starting place on understanding SQL Server plan reuse, see http://msdn.microsoft.com/en-us/library/ee343986(SQL.100).aspx, including the Appendix A: When Does SQL Server Not Auto-Parameterize Queries.  Also, this post by Zeeshan Hirani explains why LINQ to SQL query plans don’t get reused.