XML

Something new for 2011: XML and XSD

As part of my New Year’s resolution for 2011, I vowed to do a deep-dive on something technical every month; for January, I’m focusing on XML.  I’ve been using XML and XQuery in SQL Server for a while now (even presenting on it), but I still don’t consider myself an expert in the area.  For example, I use a lot of untyped XML to transfer data between databases; I’ve never really tackled XSD (XML Schema Definition Language), and now’s the time.  I’m reading The Art of XSD by Jacob Sebastian to help get me started.

What’s XSD?  In a nutshell, it’s an XML document which validates the structure of another XML document.  From the perspective of a database developer, an XSD document describes how data should look in a dataset; if the data doesn’t match the description (i.e, if a table is missing a column), that dataset is invalid.  The XSD document can be very precise, or it can offer options for the dataset, but in either case, the point of an XSD is to document the expectations about the dataset.  XML without XSD is untyped; XML with an XSD is typed (although XSD’s do more than just provide information about the data types contained within the XML).

Let’s take a look at an untyped XML statement:

DECLARE @NoXSD XML
SET
@NoXSD = '<Test1>Hello World!</Test1>'
SELECT @NoXSD

 

Simple and straightforward; I created an XML variable, and populated it with an XML fragment.  I then pulled the data out of that fragment.  In this example, we have an element named Test1; what happens if we have a typo when we populate the variable?

SET @NoXSD = '<Test2>Hello World!</Test2>'
SELECT @NoXSD

 

Nothing happens.  It’s a well-formed XML fragment (no root tag, but it does have starting and ending tags); the XML engine in SQL Server doesn’t know that our fragment is supposed to have an element named Test1, so it accepts the fragment as valid.  This is where an XSD comes in:

IF EXISTS( SELECT * FROM sys.xml_schema_collections WHERE name = 'TestSchema' )
DROP XML SCHEMA  COLLECTION TestSchema
GO

CREATE XML SCHEMA COLLECTION TestSchema AS
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="Test1" />
</xsd:schema>'
GO

DECLARE @XSD XML ( TestSchema --use the schema to validate (type) the xml
SET @XSD = '<Test1>Hello, World!</Test1>'

SELECT @XSD

 

Since the XML fragment matches the XSD,  the assignment of data works; what happens when we assign a fragment that doesn’t match?

SET @XSD = '<Test2>Hello, World!</Test2>'

We get a big fat error message:

XML Validation: Declaration not found for element ‘Test2’. Location: /*:Test2[1]

Straightforward, right?  But now what?  Well, let’s type the data in our schema:

IF EXISTS( SELECT * FROM sys.xml_schema_collections WHERE name = 'TestSchema' )
DROP XML SCHEMA  COLLECTION TestSchema
GO

CREATE XML SCHEMA COLLECTION TestSchema AS
'<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="Test1" type="xsd:string"/>
</xsd:schema>'
GO

DECLARE @XSD XML ( TestSchema )
SET @XSD = '<Test1>Hello, World!</Test1>'

 

So; what does this mean?  It means that we can now use the XQuery methods built into SQL Server to cast the data from the XML datatype to a SQL Server data type.

SELECT @XSD .value ( '(//Test1)[1]' , 'varchar(50)' )

 

More to come, but that’s a good stopping place for now; we’ve built a simple XSD, and validated a simple datatype.  I plan to spend some time learning about optional data elements next.

Querying XML in SQL Server 2005+: namespaces

I recently helped a friend solve an XML problem, and thought I would post the solution here.  Although there are lots of notes on how to use XQuery in SQL Server 2005+, this was a real world scenario that was trickier than I expected.  The friend works for an insurance company broker, and in one of their applications, accident questionnaires (and their answers) are stored in XML.  This allows them to treat all questionnaires as the same, regardless of their origin as long as the QuestionCodes are common across vendors.

Below is the sample data that he was asking me about; he needed to get one question and answer per row into a data set:


DECLARE @T TABLE ( RowID INT, Fragment XML )
INSERT  INTO @T
       
( RowID, Fragment )
VALUES  ( 1, '<Questionnaire xmlns="http://www.xxx.com/schemas/xxxXXXXXxxx.0">
<Question>
<QuestionCode>74</QuestionCode>
<Question>Why did you wreck your car?</Question>
<Answer>I was drunk</Answer>
<Explanation />
</Question>
<Question>
<QuestionCode>75</QuestionCode>
<Question>Why is the rum all gone?</Question>
<Answer>Because I drank it.</Answer>
<Explanation />
</Question>
</Questionnaire>'
)
,       (
2, '<Questionnaire xmlns="http://www.xxx.com/schemas/xxxXXXXXxxx.0">
<Question>
<QuestionCode>74</QuestionCode>
<Question>Why did you wreck your car?</Question>
<Answer>Stuart was drunk</Answer>
<Explanation />
</Question>
<Question>
<QuestionCode>75</QuestionCode>
<Question>Why is the rum all gone?</Question>
<Answer>Because I made mojitos.</Answer>
<Explanation />
</Question>
</Questionnaire>'
)

I thought it was a simple query; simply use the .nodes() method to rip each of the questions and corresponding answers into their own rows, but for some reason, when I ran the following, I got interesting results:

SELECT  t.RowID
     
, QuestionCode = t1.frag.value('(QuestionCode)[1]', 'int')
      ,
Question = t1.frag.value('(Question)[1]', 'varchar(max)')
      ,
Answer = t1.frag.value('(Answer)[1]', 'varchar(max)')
      ,
Explanation = t1.frag.value('(Explanation)[1]', 'varchar(max)')
FROM    @t t
       
CROSS
APPLY Fragment.nodes('//Questionnaire/Question') AS t1 ( frag )

RowID  QuestionCode    Question    Answer  Explanation

That’s right, nothing.  Strange, considering I’ve done variations of this query for a couple of years now to parse out firewall data fragments.  I looked closer, and tried to see what was different about the XML fragment from this example compared to mine, and it was clear: a namespace reference.   Most of the data I deal with is not true XML, but rather fragments I convert to XML in order to facilitate easy transformations.  To test, I stripped the namespace line (xmlns="http://www.xxx.com/schemas/xxxXXXXXxxx.0" ) out, and voila!  Results!

RowID QuestionCode Question Answer Explanation
1 74 Why did you wreck your car? I was drunk
1 75 Why is the rum all gone? Because I drank it.
2 74 Why did you wreck your car? Stuart was drunk
2 75 Why is the rum all gone? Because I made mojitos.

Well, that was great, because it showed me where the problem was but how do I fix it?  I stumbled upon a solution, but to be honest, I’m not sure it’s the best one.  If I modify my query to refer to any namespace (the old wildcard: *) like so:

 

SELECT  t.RowID
     
, QuestionCode = t1.frag.value('(*:QuestionCode)[1]', 'int')
      ,
Question = t1.frag.value('(*:Question)[1]', 'varchar(max)')
      ,
Answer = t1.frag.value('(*:Answer)[1]', 'varchar(max)')
      ,
Explanation = t1.frag.value('(*:Explanation)[1]', 'varchar(max)')
FROM    @t t
       
CROSS
APPLY Fragment.nodes('//*:Questionnaire/*:Question') AS t1 ( frag )

 

I get the correct results.

RowID QuestionCode Question Answer Explanation
1 74 Why did you wreck your car? I was drunk
1 75 Why is the rum all gone? Because I drank it.
2 74 Why did you wreck your car? Stuart was drunk
2 75 Why is the rum all gone? Because I made mojitos.

Here’s the question for any XML guru that stumbles along the way; is there a better way to do this?

Speaking today: PASS AppDev Virtual Chapter

I know it’s short notice, but to be honest, I totally forgot about this until a couple of weeks ago.  I’ll be presenting today at noon eastern on a LiveMeeting for the Application Developers Virtual Chapter of PASS.  Deets below:

“You Got XML In My Database? What’s Up With That?”
May 11th 12:00 PM EDT (GMT -4)
Add to Calendar
Presenter: Stuart Ainsworth

A brief presentation exploring the marriage of XML and relational databases, including when it works and when it doesn’t. Coverage will include various use case scenarios, and some tips on how to improve performance using design techniques.

Stuart Ainsworth

Stuart R Ainsworth, MA, MEd is a Database Architect working in the realm of Financial Information Security; over the last 15 years, he’s worked as a Research Analyst, a report writer, a DBA, a programmer, and a public speaking professor. He’s one of the chapter leaders for AtlantaMDF, the Atlanta chapter of PASS. A master of air guitar, he has yet to understand the point of Rock Band (“You push buttons? What’s that all about?”).

How do I view the presentation?
Attendee URL:  Live Meeting link

SQL Server & XML: links of interest

I’ve been working on a project recently that requires using XML in SQL Server, so I’ve been searching the web to find references. Books Online is unusually obtuse on this subject (the samples are sub-par for this complex subject), so I’ve been trying to find good examples of how it works. Here’s some of the better ones:

Passing lists to SQL Server 2005 with XML Parameters Jon Galloway gives a short and sweet example of shredding an XML variable into rows.

XQuery Inside SQL Server 2005 A longer set of examples using CROSS APPLY to associate XML data in a column to a value in a standard column. This is probably the most important query I needed, because we have a lot of XML values (config files) that are stored in a table; those config files are specific to particular clients.

SQL Server 2008 Books Online Here’s the basic data elements; again, the vocabulary is dense, but you at least need to know where it is.

Now, on to FLWOR (which I’m just beginning to study): Blooming FLWOR – An Introduction to the XQuery FLWOR Expression

it’s all in how you look at it…

Not sure how to classify this bit of useless knowledge, but I’m putting it out there in case it sparks some-one’s creative interest.  Most of my day job entails parsing firewall syslog data; syslog may be a standard method of streaming data from a firewall, but there’s really very little standard about what is contained inside the syslog stream.  Each vendor has their own method of recording what events occurred on the firewall; if you’re trying to compare data between firewall vendors, it gets complicated.

Here’s a stream from a Watchguard (version 8.x) firewall:

firewalld[126]: deny out eth1 293 udp 20 127 192.168.10.211 192.9.202.3 137 137 (default)

Here’s one from a Cisco Pix:

%PIX-6-710005: UDP request discarded from 10.1.1.26/137 to inside:10.1.255.255/netbios-ns

Now, there are various methods of parsing the data; if you’ve ever used SQL Server to parse a message string, you know that it can get quite ugly.  Especially if the data is irregular.  Speaking of regularity, you can use regular expressions if you’re parsing the data with a client (such as VBScript or .NET), but it’s not always the easiest thing to do.  However, some of the firewall vendors (Watchguard,version 9) have recently become my friends (as in I don’t really know them, but I like what they’re doing), and adopted a quote-delimited standard for syslog traffic, like so:

disp="Deny"   pri="1" policy="Unhandled Internal Packet-00" src_ip="192.168.16.141" dst_ip="192.168.16.255" pr="netbios-dgm/udp" src_port="138" dst_port="138" src_intf="1-Trusted" dst_intf="Firebox"      rc="101" msg="denied" pckt_len="229" ttl="128"

Why do I like this so much?  Parsing in SQL is still a bear, and the streams may still be irregular (different events may have different attributes in different orders), so using VBScript and regular expressions haven’t really improved.  Look at the sample again; what happens if we slap a little extra information to the beginning and end?

<Msg disp="Deny"   pri="1" policy="Unhandled Internal Packet-00" src_ip="192.168.16.141" dst_ip="192.168.16.255" pr="netbios-dgm/udp" src_port="138" dst_port="138" src_intf="1-Trusted" dst_intf="Firebox"      rc="101" msg="denied" pckt_len="229" ttl="128" />

Suddenly, we have an XML fragment, which is MUCH easier to parse and manipulate, even in T-SQL (2005):

DECLARE @Msg XML
SET
@Msg =
<Msg disp="Deny" pri="1" policy="Unhandled Internal Packet-00" src_ip="192.168.16.141" dst_ip="192.168.16.255" pr="netbios-dgm/udp" src_port="138" dst_port="138" src_intf="1-Trusted" dst_intf="Firebox" rc="101" msg="denied" pckt_len="229" ttl="128" />


SELECT @Msg.value (
(Msg/@disp)[1],VARCHAR(50))


I realize this is a very specific tip for a very select group of people, but I thought the exercise was worth pointing out; raw data may not always be in the format we want, but if it is standard and predictable, there may be simple methods of manipulating it.  XML is where you find it.