You’ve read a few articles about it. Maybe you’re already on the Big Data journey and have adopted some of the wide array of technologies. However, for many people, there is still a lot of myth and rumour around Big Data. Why is that?
One of the reasons is this: I’ve read more articles about Big Data than I care to remember and the overriding theme for the vast majority is theory and concept. I like to deal in facts, experience and empirical data and, having worked on six Big Data projects, have my share of battle scars. In this blog series, I will pass on some of that insight and provide some actual recommendations, not just hypothesis.
Questions, questions, and more questions
The sensible place to start for any Big Data implementation is by asking questions, and primarily questions about your business, not just the technology. What are the business outcomes that you want to achieve? What will differentiate you from your competitors? What is the one question about your business that you would like to answer? Having this enquiring mindset is a good place to start.
Do you opt for Cloud or On-premise? Or hybrid? If you go Cloud, is it Amazon Web Services or Microsoft Azure? The mainstay of all things Big Data is Hadoop, so do you opt for a commercial distribution? That’s one of the easier questions to answer, and the answer is ‘Yes’. Nobody installs Hadoop directly from the Apache binaries any more, as several engineers told me whilst looking aghast, when I’d tried (unsuccessfully) to do it.
OK, you’ve decided it’s Hadoop, so is it Hortonworks, Cloudera or MapR? Do you run Spark on top of it? And what about NoSQL? Is it Cassandra, Couchbase, MongoDb or Hbase? OK, you get the picture.
Be outcome-led, not technology-led
The point is that there is no off-the-shelf architecture to be had for a Big Data implementation. Primarily, it depends on the specific business outcomes that you want to achieve, and which Use Cases then fit within those desired outcomes. Be outcome-led, not technology-led.
So, once you’ve defined your target architecture, what then? Again, there’s no silver bullet in all this. Whilst it’s easy to spin up these environments to run a Proof of Concept, the real challenges come later. There are at least two commonly recognised pain points; they are data ingestion and… well, I’ll tell you about the other one next time.
One of the main reasons for the difficulties encountered with data ingestion is data governance, or more accurately, the lack of it. To some, the concept of a Data Lake is synonymous with a Data Dump; let’s just chuck everything in there and worry about the other stuff later. The sooner you can embed things like data classification, taxonomy, security and audit trail into your processes the better. I for one will be watching the Apache Atlas project with interest, which seeks to address some of these issues.
In summary, don’t make the mistake of thinking that Big Data/a Data Lake will magically solve all your data quality issues. However, done correctly, it will provide new and exciting insight for your organisation, and can serve as a catalyst to improve overall data quality at source.
Next time I will be talking about that second Big Data project pitfall…