Big Data

5 #DevOps Books I plan to finish this year

New Year.  Resolutions, etc. 🙂

I’m notoriously bad about starting a book and never finishing it, particularly when it’s a technical book.  My goal this year is to finish the following 5 books:

The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations

Gene Kim is perhaps best known for his novel “The Phoenix Project”, which lays out the fundamental precepts for DevOps.  The Handbook (by Kim, Patrick Debois, John Willis, and Jez Humble) gets great reviews, and I think it does a good job of translating theory into practice.  I’ve only finished about a third of it, so I’ve still got a lot of reading left to do, but I hope to finish it soon.

Site Reliability Engineering: How Google Runs Production Systems

This one might be a little easier to cheat on my goal; I’ve already read most of it.  It’s a collection of papers written by various SRE’s within Google, and gives some great insights into their vision of applying developmental principles to operation problems.  While it could be argued that the SRE model is distinct from DevOps, there’s enough overlap that it makes sense to apply these techniques to my DevOps study.

Level Up Your Life: How to Unlock Adventure and Happiness by Becoming the Hero of Your Own Story

This one’s a bit of a stretch for most DevOps folks, but if you think of it an approach to personal continual improvements, then it makes sense why this book belongs in a DevOps collection.  I started reading this one last year, and quickly off the bandwagon.  My goal is to try and finish it by the middle of the year, and hopefully begin to apply some of the principles to my personal and professional challenges.

The Art of Capacity Planning: Scaling Web Resources in the Cloud

I heard John Willis at DevOpsDays Nashville this year, and he recommended following and reading John Allspaw (among other people); the second edition of this book is coming out this year, so I’ll probably wait till it arrives.  While I don’t do much with either web or cloud development, the principles of scaling is relevant to all kinds of applications.

Team of Teams: New Rules of Engagement for a Complex World

Damon Edwards actually recommended this book during a webcast I saw a couple of months ago, and while it’s not a technical book, it speaks to the art of transforming a large, complex organization with entrenched policies into a nimble, responsive team.  Brownfield to greenfield (with military references).

#BigData is coming; what should SQL Server people do about it?

I’ve been presenting a lot on Big Data (specifically Hadoop) from the perspective of a SQL Server DBA, and I’ve made a couple of recent observations.  I think most people are aware of the fact that data generation is growing at a staggering rate, with some estimates as high as 44 zettabytes by the year 2020; what I think is lacking in the SQL Server community is a rapid movement among database professionals to expand their skills to highly scalable Big Data platforms (like Hadoop) or streaming technologies.  Don’t get me wrong; I think there’s people out there who have made the transition (like Michelle Ufford; SQLFool, now Hadoopsie), and are willing to share their knowledge, but by and large, I think most SQL Server professionals are accustomed to working with our precious relational system.

Why is that?  I think it boils down to three reasons:

  1.  The SQL Server platform is a complex product, with ever increasing opportunities to learn something new.  SQL 2016 is about to drop, and it’s a BIG release; I expect most SQL Server people to wrap themselves up in new features and learn something new soon.  There’s always going to be a need for deep expertise, and as the product continues to mature and grow, it requires deeper knowledge.
  2. Big Data tools are vast, untamed, and very organic.  Those of us accustomed to the Microsoft development cycle are used to having a single official product drop every couple of years; Big Data tools (like Hadoop) are open-source, prone to various forks, and very rapidly developed.  It’s like drinking from a firehose.
  3. It’s not quite clear how it all fits together.  We know that Microsoft has presented some interesting data technologies as of late, but it’s not quite clear how the pieces all work together; should SQL Server pros learn Azure, HDInsight, Hadoop?  What’s this about U-SQL?  StreamInsight, Spark, Cortana Analytics?

The first two reasons aren’t easily solved; they require a willingness to learn and a commitment to study (both of which are difficult resources to commit).  The third issue, however, can be easily addressed by the following graphic.

This is Microsoft’s generic vision of a complete end-to-end analytics platform; for the data professional, it’s a roadmap of skills to learn.  Note that relational engines (and their BI cousins) remain a part of the vision, but they’re only small pieces in an ever-increasing ecosystem of database tools.

So here’s the question for you; what should SQL Server people do about it?  Do we continue to focus on a very specific tool set, or do we push ourselves (and each other) to learn more about the broader opportunities?  Either choice is equally valid, but even if you choose to become an expert on a single platform in lieu of transitioning to something new, you should understand how other tools interact with the relational system.

What are you going to learn today?

Hadoop for the SQL Server DBA – Initial Challenges

I’ve been intrigued by the whole concept of Big Data lately, and have started actually presenting a couple of different sessions on it (one of which was accepted for PASS Summit 2014).  Seems only right that I should actually *gasp* blog about some of the concepts in order to firm up some of my explanations.  Getting started with Hadoop can be quite daunting, especially if you’re used to relational databases (especially the commercial ones); I hope that this series of posts can help clear up some of the mystery for the administrative side of the house.  Before we dive in, I think it’s only fair to lay out some of the initial challenges with discussing Big Data in general, and Hadoop specifically.  Depending on your background, some of these may be more challenging than others.

Rapid Evolution

Welcome to the wild, wild west.  If you come from a commercial database background (like SQL Server), you’re probably accustomed to a mature product.  For Microsoft SQL Server, a new version gets released on what appears to be a 2-4 year schedule (SQL 2005 -> 2008 -> 2012 -> 2014); of course, there’s always the debate as to what constitutes a major release (2008 R2?), but in general, the core product gets shipped with new functionality, and there’s some time before additional new functionality is released.

Hadoop’s approach to the release cycle is much looser; in 2014 alone, there have been two “major” releases with new features and functionality included.  Development for the Hadoop engine is distributed, so the release and packaging of new functions may vary within the ecosystem (more on that in a bit).  For developers, this is exciting; for admins, this is scary.   Depending on how acceptable change is within your operational department, the concept of rolling out an upgraded database engine every 3-4 months may be daunting.

Ecosystems, not products

Hadoop is an open-source product, so if you’re experienced with other open-source products like Linux, you probably already understand what that means; open-source licensing means that vendors can package the core product into their product, as long as they allow open access to the final package.  This usually means that commercial providers will either bundle an open-source product with their own proprietary side-by-side software (“we interface with MySQL” or “we run on Linux”), or they release their modified version of the software in a completely open fashion and earn revenue from a support contract (e.g., Red Hat).  In either case, it’s an ecosystem, not a canned product.

Hadoop technically consists of four modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

However, take a look at the following framework from Hortonworks (the Microsoft partner for Hadoop):

hortonworks Lots of stuff in there that’s being developed, but isn’t officially Hadoop.  it could become part of this official stack at some point, or it may not.  Other vendors may adopt it, or they may not.   Each of these components has their own update schedule (again, change!), but there is some flexibility in this approach (you can upgrade only the individual components); it does make the road map complex compared to traditional database platforms.

Big Data doesn’t always mean Big Data.

Perhaps the hardest thing to embrace about Big Data in general (not just Hadoop) is that the nomenclature doesn’t necessarily line up with the driving factors; a Big Data approach may be the best approach for smaller data sets as well.   In essence, data can be described in terms of the 4 V’s:

  1. Volume – The amount of data held
  2. Velocity – The speed at which the data should be processed
  3. Variety – The variable sources, processing mechanisms and destinations required
  4. Value – The amount of data that is viewed as not redundant, unique, and actionable

A distributed approach (like Hadoop) is usually appropriate when tackling more than 1 of these four v’s; if your data’s just large, but low velocity, variety, or value, a single installation of SQL Server (with a lot of disk space) may be appropriate.  However, if your data has a lot of variety and a lot of velocity even if it’s small, a Big Data approach may yield considerable efficiency.  The point is that big data alone is not necessarily the impetus for using Hadoop at all.

Summary

Big Data & Hadoop are complex topics, and they’re difficult to understand if you approach them from a traditional RDBMS mentality.  However, understanding the fundamentals of how Big Data approaches are evolving, disparate, and generally applicable to more than just volume can lay a foundation for tackling the platforms.

 

 

First few bites of the elephant: working with Hortonworks Hadoop

So a few weeks ago, I mentioned that I was starting to diversify my data interests in hopes of steering my career path a bit; I’ve built a home brewed server, and downloaded a copy of the Hortonworks Sandbox for Hadoop.  I’ve started working through a few tutorials, and thought I would share my experiences so far.

My setup….

I don’t have a lot of free cash to setup a super-duper learning environment, but I wanted to do something on-premise.  I know that Microsoft has HDInsight, the cloud-based version of Hortonworks, but I’m trying to understand the administrative side of Hadoop as well as the general interface.  I opted to upgrade my old fileserver to a newer rig; costs ran about $600 for the following:

ASUS|M5A97 R2.0 970 AM3+ Motherboard   
AMD|8-CORE FX-8350 4.0G 8M CPU   
8Gx4|GSKILL F3-1600C9Q-32GSR Memory   
DVD BURN SAMSUNG | SH-224DB/BEBE  DVD Burner

I already had a case, power supply, and a couple of SATA drives (sadly, my IDE’s no longer work; also the reason for purchasing a DVD burner).  I also had a licensed copy of Windows 7 64 bit, as well as a few development copies for Microsoft applications from a few years ago (oh, how I wish I was an MVP….).

As a sidebar, I will NEVER purchase computer equipment from Newegg again; their customer service was horrible.  A few pins were bent on the CPU, and it took nearly 30 days to get a replacement, and most of that time was spent with little or no notification.

I downloaded and installed the Hortonworks Sandbox using the VirtualBox version.  Of course, I had to reinstall after a few tutorials because I had skipped a few steps; after going back and following the instructions, everything is just peachy.  One of the nice benefits of the Virtualbox setup is that once I fire up the Hortonworks VM on my server, I can use a web browser on my laptop pointed to the server’s IP address with the appropriate port added (e.g., xxx.xxx.xxx.xxx:8888), and bam, I’m up and running.

Working my way through a few tutorials

First, I have to say, I really like the way the Sandbox is organized; it’s basically two frames: the tutorials on the left, and the actual interface into a working version of Hadoop on the right.  It makes it very easy to go through the steps of the tutorial.

image

The Sandbox has lots of links and video clips to help augment the experience, but it’s pretty easy to get up and running on Hadoop; after only a half-hour or so of clicking through the first couple of tutorials, I got some of the basics down for understanding what Hadoop is (and is not); below is a summary of my initial thoughts (WARNING: these may change as I learn more).

Summary:

  • Hadoop is comprised of several different data access components, all of which have their own history.  Unlike a tool like SQL Server Management Studio, the experience may vary depending on what tool you are using at a given time.  The tools include (but are not limited to):
    • Beeswax (Hive UI): Hive is a SQL-like language, and so the UI is probably the most familiar to those of us with RDBMS experience.  It’s a query editor.
    • Pig is a procedural language that abstracts the data manipulation away from MapReduce (the underlying engine of Hadoop).  Pig and Hive have some overlapping capabilities, but there are differences (many of which I’m still learning).
    • HCatalog is a relational abstraction of data across HDFS (Hadoop Distributed File System); think of it like the DDL of SQL.  It defines databases and tables from the files where your actual data is stored; Hive and Pig are like DML, interacting with the defined tables.
  • A single-node Hadoop cluster isn’t particularly interesting; the fun part will come later when I set up additional nodes.

Back on the trail…. #sqlsatnash

I realize that I should probably be blogging about my New Year’s resolutions, but meh… I’ve been super busy surviving the holidays.  So busy in fact that I’ve failed to mention that I’ll be presenting at the SQLSaturday in Nashville on January 18, 2014.  I actually got selected to present TWO topics, which is HUGE for me.  Hoping that I can refine a presentation, and get ready for our own SQLSaturday in Atlanta.

Working with “Biggish Data”

Most database professionals know (from firsthand experience) that there continues to be a “data explosion”, and there’s been a lot of focus lately on “big data”. But what do you do when your data’s just kind of “biggish”? You’re managing Terabytes, not Petabytes, and you’re trying to squeeze out as much performance out of your aging servers as possible. The focus of this session is to identify some key guidelines for the design, management, and ongoing optimization of “larger-than-average” databases. Special attention will be paid to the following areas: * query design * logical and physical data structures * maintenance & backup strategies

Managing a Technical Team: Lessons Learned

I got promoted to management a year ago, and despite what I previously believed, there were no fluffy pillows and bottles of champagne awaiting me. My team liked me, but they didn’t exactly stoop and bow when I entered the room. I’ve spent the last year relearning everything I thought I knew about management, and what it means to be a manager of a technical team. This session is intended for new managers, especially if you’ve come from a database (or other technical) background; topics we’ll cover will include:*How to let go of your own solutions. *Why you aren’t the model you think you are, and *Why Venn diagrams are an effective tool for management.