A simple codebuilder for parsing in T-SQL

If you’ve ever tried to parse a wide character column in T-SQL, you know two things:

  1. It’s a pain to do, and
  2. It’s a pain to do.

A lot of the data I deal with comes in syslog format, which can come in one of two formats: positional (the location of the data element is related to the type of data), and named attributes (which usually only include delimiters for complex strings).  Although I haven’t had much luck automating positional parsing, I’ve recently begun using Excel to help me with the named attributes. 

Here’s an example; I have a table with a message column that is pulling over syslog data from a firewall.  In a given day, I may have millions of rows like the following:

sn=AA17D5028EAA time="2011-01-26 13:40:14 UTC" fw=10.1.100.1 pri=1 c=512 m=522 msg="Malformed or unhandled IP packet dropped" n=1 src=10.1.1.23:32795:X1: dst=10.1.1.1:514:: proto=udp/17

Note that each attribute of this particular syslog message is identified with an attribute name (eg, sn, time, fw, etc).  In order to break out each of the elements in T-SQL, we can split the string using a combination of SUBSTRING and CHARINDEX, like so:

SELECT TOP 1
        m
= CONVERT(INT, SUBSTRING(MESSAGE, CHARINDEX(' m=', MESSAGE) + 3,
                                  
CHARINDEX(' ', MESSAGE, CHARINDEX(' m=', MESSAGE) + 3) - ( CHARINDEX(' m=', MESSAGE)
                                                                                              +
3 )))
      ,
time = CONVERT(DATETIME, SUBSTRING(MESSAGE, CHARINDEX(' time="', MESSAGE) + 7,
                                          
CHARINDEX('UTC"', MESSAGE, CHARINDEX(' time="', MESSAGE) + 7)
                                           - (
CHARINDEX(' time="', MESSAGE) + 7 )))
      ,
fw = CONVERT(VARCHAR(20), SUBSTRING(MESSAGE, CHARINDEX(' fw=', MESSAGE) + 4,
                                           
CHARINDEX(' ', MESSAGE, CHARINDEX(' fw=', MESSAGE) + 4) - ( CHARINDEX(' fw=',
                                                                                                       
MESSAGE) + 4 )))
FROM    syslogng (NOLOCK)

Note the repetition for each column; you need to find the position of a starting delimiter, the position of an ending delimiter, and supply to the SUBSTRING function the position of the starting delimiter, and the difference between the two.  You also need to determine the lingth of the starting identifier, and then I CONVERT to a specific data type.  Whee!

It gets even more fun when the attributes are optional; some syslog messages may have a proto code, and some may not.   When faced with this, you need to include a CASE option, like so:

SELECT TOP 1
        proto
= CONVERT(VARCHAR(20), CASE WHEN CHARINDEX(' proto=', MESSAGE) = 0 THEN NULL
                                         
ELSE SUBSTRING(MESSAGE, CHARINDEX(' proto=', MESSAGE) + 7,
                                                        
CHARINDEX(' ', MESSAGE, CHARINDEX(' proto=', MESSAGE) + 7)
                                                         - (
CHARINDEX(' proto=', MESSAGE) + 7 ))
                                    
END)
FROM    syslogng (NOLOCK)

 

One of our developers is working on a syslog parser in .NET code, but I needed a proof-of-concept, and I didn’t want to keep cutting and pasting to see if it was working.  Looking at the parsing, it’s very formulaic SQL.  When I think formulas, I think Excel, and so I whipped out the following:

image

Note that I have several input columns:

  • start, the starting delimiter
  • end, the ending delimiter (usually a space)
  • colname, the column name I want to use; usually the same as start, but stripped of extra characters.
  • type, the SQL type I want to convert the data to, and
  • optional, a column to decide if the attribute is optional per row or not.

I also have a hidden column (column F), which generates most of the SQL code:

=CONCATENATE("SUBSTRING(message, CHARINDEX(‘", A2, "’, message)+ ", LEN(A2), ", CHARINDEX(‘", B2, "’, message, CHARINDEX(‘", A2, "’, message)+", LEN(A2), ") – (CHARINDEX(‘", A2, "’, message)+", LEN(A2), "))")

This takes the starting and ending delimiters, the length of the starting delimiter, and plugs those values into a valid SQL statement.  I then create a SQL column, using the following formula:

=CONCATENATE(", ", C2,"CONVERT(", D2, ", ",  IF(E2="Y", CONCATENATE("CASE WHEN CHARINDEX(‘", A2, "’, message) = 0 THEN NULL ELSE ",F2, " END"), F2), ")")

If I were better at Excel, I’d use named ranges, but for my purposes, this is OK.   I append a column to the beginning, specify the type, and include a CASE statement based on whether or not my optional column includes a “Y”.

It took me longer to write this blog post than it did to generate a proof-of-concept, parsing each of the named attributes out from a syslog message.

Share