Blog 03: Part 03 – Disaster Recovery

Going to depart from my original plan for entry three this week to talk about disaster recovery, since I experienced a hard drive failure on my work laptop on Wednesday and DR is an important aspect of data architecture, so it still fits thematically with the other entries.  First though, some context.  I would say that my organization has an unstructured data problem.  We have tons and tons of unclassified data sitting on end user hard drives, network file shares, and cloud storage environments.  Tons of it.  Duplicate data.  Incorrect data.  Corrupt data.  End users who bother to backup their personal data do it poorly: It’s not common for someone’s personal file share to be filled with manual redundant copies, e.g. Jan, Jan-Feb, Jan-Mar, Jan-Apr, etc etc  Some of them even encrypted their data, which sounds good on the face of it, but they didn’t use an appropriate managed encryption system, so not if but WHEN they lose/forget their password, nobody is able to unlock it for them.  For the rest of the user base, they don’t backup.

Nobody cares about backups, until they need them…then it’s always IT’s fault that they don’t exist and everyone is scrambling to and paying a lot of money to recover the data.  I think my favorite story was a sales guy who spilled wine on his laptop.  Hard drive was toast, but we have an agreement with data recovery vendors who can actually recover data from fairly destroyed drives, as long as we pay through the nose.  The sales guy insisted he had important data that needed recovery, so away the drive was sent.  When the ~$8,000 bill arrived, it prompted some questions:  he had to justify the expense.  Turns out, the data he was after was for his fantasy football team.  Whoops.

For the last several years, end user PCs have been backed up to the cloud (much like our servers).  It happens automatically, many times a day, and incrementally, only files that have been changed are backed up.   And that’s great for the end user, but all of these backups of (questionably useful) data take up bandwidth and bandwidth isn’t free.  In fact, due to the expenses of bandwidth plus the costs of the backup service itself increasing from $4 per user per month to $9.50 per user per month, the company made the decision to end the cloud backup offering.  At the same time, for data loss prevention issues, all writing to externally mounted volumes is now blocked as well.  The only means for end users to backup their data is by manually copying to internal cloud storage services designed for collaboration, not archival purposes.   This has generated a great deal of animosity towards IT, however it has saved quite literally millions of dollars, since we have ~300K employees globally.

 

 

Blog 03: Part 02 – Mastering the Data

After last post’s discussing some of the pitfalls of leaping into a big data initiative unprepared, I’d like to focus now on some of the less technical reasons why an organization may struggle with data management.  It’s the annoying little brother of Big Data and the the subject that makes everyone’s eyes glaze over when it’s brought up: Governance!  In this case Master Data Management (MDM).  At its core, MDM is a single file, a single point of reference, which links all enterprise data together in a common point of reference.  This is critical, especially in larger organizations with lots and lots of (dare I say BIG) data where the data is shared with different business functions and discrepancies in the data could be problematic, especially with non or poorly integrated applications.

The Gartner reading (G00214129) does a great job at highlighting the pitfalls of paying lip service to MDM, the most common way I’ve seen this happen is putting the onus of governance on the data entry folks.  Gartner states the issues with that are:

  • Loss of morale as some users leave the team. IT shared services is not a desirable career move for a lot of “power users” that could have seen line management as their future career
    path.
  • Realization that the movement of “governance” from line of business to shared services creates a vacuum in business as those users are removed from any responsibility for data governance. The shared services team “loses touch” with business and so “governance” starts to erode and break down.
  • The end state results in the accidental separation of “data entry” (which works well in shared services) and data stewardship that breaks down, since the expectation is that shared services can execute this, when in fact they cannot, so it does not happen.

For bullet two, this happens even faster than most people think, because often time the shared service that data entry is farmed out to is considered IT or contracted out; it’s severed from the business from the onset.

The key takeaway here is that a successful BIG DATA program isn’t simply about storing large amounts of data on silicon. Data should be considered an enterprise resource and it must be maintained so it remains an asset, not a liability, to the business.

Blog 03: Part 01 – BIG DATA

“Big Data” is one of those terms that I rank up there along with “Cloud” and “Internet of Things” as one of those IT buzz words people like to throw around to sound like they’re actually saying something of substance. It’s true that one of the great use cases of computers is the organization (and analysis) of data.  But now data is BIG.  We keep it in a lake.  Or a warehouse. And we have entire teams of analysts whose sole job is generating business value from interpretating the data…but as it turns out, that is much easier said than done.  Because it is much much easier to collect and store data then it is to draw meaningful conclusions from it.

IBM has a nice graphic here on the so-called “Four Vs” of data and I agree this makes for a nice alliterative categorization.  I’ll talk about each in turn.

  • Volume – As I mentioned in the intro, in terms of technical expertise and physical technology, it is much easier and cheaper to store data today then ever and organizations seem to choose the digital pack rat methodology:  when in doubt, keep it!  Even at the individual level, people horde data…in this case unstructured data, but data none the less.  Despite my organization having a fairly draconian document retention policy, it is rarely enforced and on average each employee has about 3GB worth of data per year of service.  But just like no one can ever pull out that file they KNOW they have SOMEWHERE in a timely manner, enterprise data is often at the mercy of poorly written database queries or business analysts who must have been sleeping during STAT 200.
  • Variety – All data is equal, but some is more equal than others.  And to manage this inequality it is important to have a data classification model to assist in prioritizing how data should be accessed and secured.  The general rule of thumb, from least to most important, is: External, Internal, Confidential, and Restricted.  Equifax should take note.
  • Velocity – This is one of the chief contributors to the Big Data problem, the speed with which data is created.  BA’s work at a pretty high level, generating high level dashboards that aggregate and distill everything down to a few charts or bullet points, but very often the devil is in the details.  Lots of shop floor equipment now comes standard with logging capability and there is a ton of operating data available, but much of it is fluff.  Who cares if your machine can log operating temperature within a tenth of a degree when temperature doesn’t impact any production metric.
  • Veracity – This is the killer.  The trouble with a better “save” than sorry mentality towards data retention means that very often, you have conflicting sources for the same data.  This is especially problematic in integrating multiple systems together: which system shall be the source of record?  And how can we reconcile?