SQL Server XQuery: MORE deleting nodes using .modify()

So after my last post, my developer friend came back to me and noted that I hadn’t really demonstrated the situation we had discussed; our work was a little more challenging than the sample script I had provided.  In contrast to what I previously posted, the challenge was to delete nodes where a sub-node contained an attribute of interest.  Let me repost the same sample code as an illustration:

DECLARE @X XML = 
'<root>
  <class teacher="Smith" grade="5">
    <student name="Ainsworth" />
    <student name="Miller" />
  </class>
  <class teacher="Jones" grade="5">
    <student name="Davis" />
    <student name="Mark" />
  </class>
  <class teacher="Smith" grade="4">
    <student name="Baker" />
    <student name="Smith" />
  </class>
</root>'

SELECT  @x

If I wanted to delete the class nodes which contain a student node with a name of “Miller”, there are a couple of ways to do it; the first method involves two passes:

SET @X.modify('delete /root/class//.[@name = "Miller"]/../*')
SET @X.modify('delete /root/class[not (node())]')
SELECT @x

In this case, we walk the axis and find a node test of class (/root/class); we then apply a predicate to look for an attribute of name with a value of Miller ([@name=”Miller”]) in any node below the node of class (//.).  We then walk back up a node (/..), and delete all subnodes (/*).

That leaves us with an XML document that has three nodes for class, one of which is empty (the first one).  We then have to do a second pass through the XML document to delete any class node that does not have nodes below it (/root/class[not (node())]).

The second method accomplishes the same thing in a single pass:

SET @x.modify('delete /root/class[student/@name="Miller"]')
SELECT @x

In this case, walk the axis to class (/root/class), and then apply a predicate that looks for a node of student with an attribute of name with a value of Miller ([student/@name=”Miller”); the difference in this syntax is that the pointer for the context of the delete statement is left at the specific class as opposed to stepping down a node, and then back up.

SQL Server XQuery: deleting nodes using .modify()

Quick blog post; got asked today by a developer friend of mine about how to delete nodes in an XML fragment using the .modify() method.  After some head-scratching and some fumbling around (its been a few months since I’ve done any work with XML), we came up with a version of the following script:

DECLARE @X XML = 
'<root>
  <class teacher="Smith" grade="5">
    <student name="Ainsworth" />
    <student name="Miller" />
  </class>
  <class teacher="Jones" grade="5">
    <student name="Davis" />
    <student name="Mark" />
  </class>
  <class teacher="Smith" grade="4">
    <student name="Baker" />
    <student name="Smith" />
  </class>
</root>'

SELECT  @x

--delete the classes that belong to teacher Smith
SET @X.modify('delete /root/class/.[@teacher="Smith"]')
SELECT @X 

Now, let me try to explain it:

  1. Given a simple document that has a root with classes, and students in each class, we want to delete all classes that are being taught by a teacher named “Smith”.
  2. First, we delete the nodes under those classes that belong to Smith
    1. Using XPath, we walk the axis and use a node test to restrict to /root/class/. (the current node under class).
    2. We then apply a predicate looking for a teacher attribute with a value of “Smith”
    3. The .modify() clause applies the delete command to the @X variable, and updates the XML

First few bites of the elephant: working with Hortonworks Hadoop

So a few weeks ago, I mentioned that I was starting to diversify my data interests in hopes of steering my career path a bit; I’ve built a home brewed server, and downloaded a copy of the Hortonworks Sandbox for Hadoop.  I’ve started working through a few tutorials, and thought I would share my experiences so far.

My setup….

I don’t have a lot of free cash to setup a super-duper learning environment, but I wanted to do something on-premise.  I know that Microsoft has HDInsight, the cloud-based version of Hortonworks, but I’m trying to understand the administrative side of Hadoop as well as the general interface.  I opted to upgrade my old fileserver to a newer rig; costs ran about $600 for the following:

ASUS|M5A97 R2.0 970 AM3+ Motherboard   
AMD|8-CORE FX-8350 4.0G 8M CPU   
8Gx4|GSKILL F3-1600C9Q-32GSR Memory   
DVD BURN SAMSUNG | SH-224DB/BEBE  DVD Burner

I already had a case, power supply, and a couple of SATA drives (sadly, my IDE’s no longer work; also the reason for purchasing a DVD burner).  I also had a licensed copy of Windows 7 64 bit, as well as a few development copies for Microsoft applications from a few years ago (oh, how I wish I was an MVP….).

As a sidebar, I will NEVER purchase computer equipment from Newegg again; their customer service was horrible.  A few pins were bent on the CPU, and it took nearly 30 days to get a replacement, and most of that time was spent with little or no notification.

I downloaded and installed the Hortonworks Sandbox using the VirtualBox version.  Of course, I had to reinstall after a few tutorials because I had skipped a few steps; after going back and following the instructions, everything is just peachy.  One of the nice benefits of the Virtualbox setup is that once I fire up the Hortonworks VM on my server, I can use a web browser on my laptop pointed to the server’s IP address with the appropriate port added (e.g., xxx.xxx.xxx.xxx:8888), and bam, I’m up and running.

Working my way through a few tutorials

First, I have to say, I really like the way the Sandbox is organized; it’s basically two frames: the tutorials on the left, and the actual interface into a working version of Hadoop on the right.  It makes it very easy to go through the steps of the tutorial.

image

The Sandbox has lots of links and video clips to help augment the experience, but it’s pretty easy to get up and running on Hadoop; after only a half-hour or so of clicking through the first couple of tutorials, I got some of the basics down for understanding what Hadoop is (and is not); below is a summary of my initial thoughts (WARNING: these may change as I learn more).

Summary:

  • Hadoop is comprised of several different data access components, all of which have their own history.  Unlike a tool like SQL Server Management Studio, the experience may vary depending on what tool you are using at a given time.  The tools include (but are not limited to):
    • Beeswax (Hive UI): Hive is a SQL-like language, and so the UI is probably the most familiar to those of us with RDBMS experience.  It’s a query editor.
    • Pig is a procedural language that abstracts the data manipulation away from MapReduce (the underlying engine of Hadoop).  Pig and Hive have some overlapping capabilities, but there are differences (many of which I’m still learning).
    • HCatalog is a relational abstraction of data across HDFS (Hadoop Distributed File System); think of it like the DDL of SQL.  It defines databases and tables from the files where your actual data is stored; Hive and Pig are like DML, interacting with the defined tables.
  • A single-node Hadoop cluster isn’t particularly interesting; the fun part will come later when I set up additional nodes.

The Evolution of the DBA

Recently, there’s been a couple of great posts about the Death of the Database Administrator, including a response by Steve Jones and a several reactions by the staff of SQL Server Pro; the central premise behind the supposed demise revolves around this one major thought:

 

The evil cloud has reduced the need for internal systems infrastructure, including database administration.  It’s a storm of needs for faster development (agility) and the rise of hosted services; who needs a database server, when you can rent space on Azure?   Please note that I’m not specifically anti-cloud, but I’m casting it as the villain when careers are on the line.

Furthermore, in shops where the cloud is banned (e.g., financial services),  developers are using tools like Entity Framework to write SQL for them. Tuning SQL thus becomes an application change as opposed to a stored procedure change; DBA’s who do performance tuning have to focus on index maintenance and hardware acquisition.  Code tuning is now part of the development domain, and the career of the pure SQL developer is gasping in comparison.   

Like all great controversial statements, there’s an element of truth; the cloud, agile approaches, and new technologies are reducing the need for traditional database administrators, but I think we’re a long way away from pulling the plug.  However, I will say that over the next decade, these trends will probably continue to grow, eating away at the availability of jobs that do strict database administration (and the SQL developer will probably expire altogether).  But not yet.

What this does mean is that if you are intending to be employed 10 years from now, and you’re a database administrator, you’ve got two choices to make today:

  1. Master a specialty.  If you’re planning on consulting for a living,  this is a great choice.  Get so intimate with the database product of your choice that you become the go-to person for problem-solving.  Companies that have large installations of SQL Server will need secondary support as the product becomes easier to maintain (and big problems get obfuscated by GUI’s).
  2. Expand your horizon.  Instead of focusing on super in-depth mastery of your database platform, broaden your perspective; if you’re a SQL Server guy like me, start learning a little bit about SSRS, SSAS, and SSIS (if you don’t already know it).  Spread out into Hadoop, and NoSQL; dabble in MySQL and SQLLite.  Understand what the cloud can do, and where it makes sense to use it.

So go deep or go broad, but go.  I wouldn’t start quaking in my boots just yet about the demise of your career, but change is coming; those who adapt, survive.

For me? I’m going broad.  I’ve built a home-brewed server, and downloaded a copy of the HortonWorks Hadoop Sandbox.  Stay tuned for my adventures with Hadoop.

MaTT: The MARS framework for DBA’s

So, as usual, I’m struggling to sit down and write; I know I should (it’s good for the soul), but frankly, I’m struggling to put words to web page.  As a method of jumpstarting my brain, I thought I would write about something simple and relevant to what I’m doing these days in my primary capacity: management.

When I took over the reins of a newly formed department of DBA’s, I knew that I needed to do something quick to demonstrate the value of our department to our recently re-organized company.  We’re a small division in our company, but we manage some relatively large databases (17 TB of data, with a relatively high daily change rate; approximately 10 TB of data change daily).  My team was comprised of senior DBA’s who were inundated with support requests (from “I need to know this information” to “I need help cleaning up this client’s data”); while they had monitoring structures in place, it wasn’t uncommon for things to go unnoticed (unplanned database growth, poor performing queries, etc.).   One of my first acts as a manager was to put in a system of classification I called MARS; all of our work as DBA’s needed to be categorized into one of four broad groups.

Maintenance, Architecture, Research, and Support

The premise is simple; by categorizing efforts, we could measure where the focus of our department was, and begin to allocate resources into the proper arenas.  I defined each of the four areas of work as such:

  • Maintenance – the efforts needed to keep the system performing well; backups, security, pro-active query tuning, and general monitoring are examples.
  • Architecture – work associated with the deployment of new features, functionality, or hardware; data sizing estimates, upgrades to SQL 2012, installation of Analysis services are examples.  To be honest, Infrastructure may have been a better term, but MIRS sounded stupid.
  • Research – the efforts to understand and improve employee skills; I’m a former teacher, and I put a pretty high value on lifelong learning.  I want my team to be recognized as experts, and the only way that can happen is if the expectation is there for them to learn.  
  • Support – the 800 lb gorilla in our shop; support efforts focus on incident management (to use ITIL terms) and problem resolution.  Support is usually instigated by some other group; for example, sales may request a contact list from our CRM that they can’t get through the interface, or we may get asked to explain why a ticket didn’t get generated for a customer.

After about a year of data gathering, I went back and thought about the categories a bit more, and realized that I could associate some descriptive adjectives with the workload to demonstrate where the heart of our efforts lies.  I took my cues from the JoHari window, and came up with the two axes: Actions: Proactive – Reactive, and Results: Delayed-Immediate.   I then arranged my four categories along those lines, like so:

 image

In other words, Maintenance was Proactive, but had Delayed results (you need to monitor your system for a while before you grasp the full impact of changes).  Research was more Reactive, because we tend to research issues that are spawned by some stimulus (“what’s the best way to implement Analysis Services in a clustered environment?” came up as a Research topic because we have a pending BI project). 

Immediate results came from Architectural changes; adding more spindles to our SAN changed our performance quickly, but there was Proactive planning involved before we made the change.  Support is Reactive, but has Immediate results; the expectation is that support issues get prioritized, so we try to resolve those quickly as part of our Operational Level Agreements with other departments. 

After a couple of months looking at our work load (using Kanban), I see that we still spend a lot of time in Support, but that effort is trending downward;  I continue to push Maintenance, Architecture, and Research over Support, and we’re becoming much more proactive in our approaches.  I’m not sure if this quadrant approach is the best way to represent workload, but it does give me a general rule-of thumb in helping guide our efforts.

SQLSaturday 285 (#sqlsatatl) pre-cons are now live!

SQL Saturday #285 is offering 3 preconference sessions on Friday, May 2 at the GSU campus in Alpharetta, site of SQL Saturday #285 on Saturday, May 3:
Kalen Delaney: What the Hekaton!? A Whole New Way to Think About Data Management

SQL Server Hekaton, Microsoft’s new In-Memory table technology being shipped as part of SQL Server 2014, will completely change the way you think about data management. As a DBA, you’ll need to analyze your memory and storage needs completely differently. All Hekaton data is always stored in memory, and the data stored on disk is basically just a REDO log used to regenerate the contents of your memory-optimized tables. In this full-day seminar, Kalen Delaney (a SQL Server MVP for over 20 years) will show you the in-memory architecture for your Hekaton data and indexes, and discuss what gets written to disk during checkpoints, as well as what gets logged. She will explain how the recovery process recreates your Hekaton tables. Finally, she’ll go into detail on just what it is that makes Hekaton so much FASTER!

 

Denny Cherry: SQL Performance Tuning & Optimization

In this session you will learn about SQL Server 2008 R2 and SQL Server 2012 performance tuning and optimization. Industry Expert Denny Cherry will guide you through tools and best practices for tuning queries and improving performance within Microsoft SQL Server.  This session will guide you through real life performance problems which have been gathered and tuned using industry standard best practices and real world skills.

 

 

Teo Lachev: Deep Dive into the Microsoft BI Semantic Model (BISM)

The chances are that your organization has a centralized data repository, such as ODS or a data warehouse, but you might not use it to the fullest. Join this insightful full-day event to understand the importance of having a semantic layer that bridges users and data. In the Microsoft BI world, BISM consists of Power Pivot, Tabular, and Multidimensional. 

All 3 presenters are published authors and Microsoft MVPs many times over. These sessions are a huge value to spend a day with an acknowledged SQL Server

thought leader.

Early registration is only $129 until March 1, when the price of any remaining seats will go up to $149. Follow @AtlantaMDF on Twitter and get a promo code for $10 off the early registration price (for Kalen Delaney or Denny Cherry)! We’ll tweet the promo code at 9am Thursday (Jan 23) – it’s only good for 10 uses (for each session), so be sure to check your Twitter feed tomorrow morning!

Back on the trail…. #sqlsatnash

I realize that I should probably be blogging about my New Year’s resolutions, but meh… I’ve been super busy surviving the holidays.  So busy in fact that I’ve failed to mention that I’ll be presenting at the SQLSaturday in Nashville on January 18, 2014.  I actually got selected to present TWO topics, which is HUGE for me.  Hoping that I can refine a presentation, and get ready for our own SQLSaturday in Atlanta.

Working with “Biggish Data”

Most database professionals know (from firsthand experience) that there continues to be a “data explosion”, and there’s been a lot of focus lately on “big data”. But what do you do when your data’s just kind of “biggish”? You’re managing Terabytes, not Petabytes, and you’re trying to squeeze out as much performance out of your aging servers as possible. The focus of this session is to identify some key guidelines for the design, management, and ongoing optimization of “larger-than-average” databases. Special attention will be paid to the following areas: * query design * logical and physical data structures * maintenance & backup strategies

Managing a Technical Team: Lessons Learned

I got promoted to management a year ago, and despite what I previously believed, there were no fluffy pillows and bottles of champagne awaiting me. My team liked me, but they didn’t exactly stoop and bow when I entered the room. I’ve spent the last year relearning everything I thought I knew about management, and what it means to be a manager of a technical team. This session is intended for new managers, especially if you’ve come from a database (or other technical) background; topics we’ll cover will include:*How to let go of your own solutions. *Why you aren’t the model you think you are, and *Why Venn diagrams are an effective tool for management.

PASS 2013 Summit Evals are out!

And I didn’t do too bad; wish I had done better.  I said that when I was done, I felt like it was a “B” level presentation, and it was; I got a 4 out of 5 on my evals.  If I had been a less experienced speaker, I would be thrilled with that; as it stands, I’m a little bummed.  I know that it’s tough to get accepted to speak at Summit, and I feel bad that I didn’t hit this one out of the park.

However, it was a great experience; 73 people attended my session, which is a big audience for me.  I struggled with my demos throughout (I don’t even want to listen to the audio because I’m worried about how bad it was), and I should have worked on finding ways to better connect with my audience.  The feedback I got was really constructive:

Was a good intro, just would have liked to have seen some broader examples. For example converting XML into relational tables, not in detail but just at a high level.

Lots of demos geared towards people who have already written a lot of XQuery. This should have been a 201 session. A discussion on why you’d even use the XML datatype would have been useful. What problem does the XML datatype even solve for people?

I think I would have benefitted from a hard copy (gasp) of the XML data.  I would have been able to see the data and compared it to your on screen results

Way too fast, too ambitious for a 101 session

Well put together and paced. Very clear and coherent

Scale back expectations if it really is a 101 level session

So it sounds like I didn’t do the best job of making my abstract clear; people had different expectations than what I had for what a 100 level course was supposed to be.  I do agree that it was too much content, and if I present on the topic again, I’ll be sure to go back to splitting this up to focus on the basics of XPath, and save a discussion of FLWOR for later.  Also, I really should have used demos much more judiciously; I kept running code and trying to work the magnifier, when I should have just used slides for the basics, and then done a much more thorough demo.

So what did I learn?  Connect with the audience first and foremost.  If I could have kept them engaged and entertained, I may have covered less material, but may have inspired them to do more research on their own (which in the end, is the point of this whole exercise).

quick blog from #sqlpass #summit13

Been a busy couple of days; hell, the last few weeks have been just nuts. I’m pausing for a few seconds to type a quick post from sunny Charlotte, and just fill in a few thoughts.

First, I think my XQuery session went reasonably well; I got bit by the demo gods, and shouldn’t have tried to use the magnifier without more practice, but I had a lot of questions and a few nods of approval. Overall, I think it was a B effort, and am hopeful that I can improve it.

Second, it’s always exciting to be back at Summit; kind of like New Year’s Day. I make lots of resolutions about how I want to get involved with an active and dynamic community. Let’s see how many of them stick. Mostly, I like being around smart people, and it’s been quite exciting to talk to some of the smartest people I know. I’ve had some great technical conversations with lots of people, and its given me a lot of things to mull over in terms of where I want to go and grow.

Third, I also got sucked into a lot of conversations about the whole PASS election/membership issue. My post about Allen Kinsel’s campaign seem to have kicked off more of a firestorm than I realized. I’ve had lots of people ask me what my thoughts were on the issue, and really, it’s kind of simple: We’re database people, and we need a plan to fix a data problem. I don’t have that plan, but there are lots of people who do (see my second point above about smart people).

Fourth, keynotes\sessions are awesome. I’m learning a lot, and I hope others are as well.

More to come soon.

SQL Server XQuery: Functions (sql:variable() & sql:column())

 

Like most query languages, XQuery has several functions that can be used to manipulate and query data.  SQL Server’s implementation supports a limited subset of the XQuery specification, but there’s a lot of power in the functions provided.  I hope to cover some of those functions in more detail at a later date, but for now I’d like to focus on a couple of very specific functions (sql:variable() & sql:column()).  These are proprietary extensions to XQuery, and they (like the xml methods I previously discussed) provide a bridge between the SQL engine and the XQuery engine.

For example, if you wanted to find the value of the third node of a simple XML document in SQL Server, you could do the following:

DECLARE @x XML ='<alpha>a</alpha><alpha>b</alpha><alpha>c</alpha>'
SELECT @x.value('(//alpha)[3]', 'varchar(1)')

The .value() method would return the letter “c” in the form of a varchar to the SQL engine.  However, if you wanted to do this dynamically, and specify which node to return based on a parameter, you would use the sql:variable() function, like so:

DECLARE @x XML ='<alpha>a</alpha><alpha>b</alpha><alpha>c</alpha>'

DECLARE @node INT = 3
SELECT @x.value('(//alpha)[sql:variable("@node")][1]', 'varchar(1)')

The sql:variable() function uses a string literal (a value surrounded by double quotes) to reference a SQL parameter (in this case, @node) and concatenates it to the XQuery string.  The above query is seen as:

(//alpha)[3][1]

by the XQuery engine.  In English, we are looking for the 3rd node named alpha.  You may wonder about the extra positional reference (“[1]”) ; the .value() method requires that a positional reference be explicitly defined.  In this situation, we are telling the XQuery engine to return the first instance of the third node of the alpha node.  Seems a bit clunky, but it works.  Looking at the execution plan, we can see that this is a relatively complex process, with multiple calls between the two sides of the query processor:

image

The sql:column() function is similar, but is used to refer to a column instead of a parameter; this allows for the dynamic querying of an XML column on a row by row basis.  For example:

DECLARE @T TABLE ( ID INT, x XML )

INSERT  INTO @T
        ( ID, x )
VALUES  ( 1, '<alpha>a</alpha><alpha>b</alpha><alpha>c</alpha>' ),
        ( 2, '<alpha>a</alpha><alpha>b</alpha><alpha>c</alpha>' ),
        ( 3, '<alpha>a</alpha><alpha>b</alpha><alpha>c</alpha>' )

SELECT  ID, v=x.value('(//alpha)[sql:column("ID")][1]', 'varchar(1)')
FROM    @T

The above query will return a dataset like so:

image

Summary

SQL Server provides two functions for sharing information from the SQL engine to the XQuery engine: sql:variable() & sql:column().  The nature of these functions is pretty straight-forward; you pass the value of either a parameter or a column to an XML method, and it builds an XQuery string using the values of these functions.