Choosing which Big Data software path to go down

In the How to start on an information management strategy, I discussed the steps that you need to take to develop a technical strategy.

Assuming, that as a result of that exercise you’ve identified use cases where you need Big Data as part of your information management solution, there are 4 paths to decide from in terms of how you decide how to engage with Big Data.

1. Full Open Source path

Apache Hadoop and the associated ecosystem are developed by communities of developers and made freely available for use.

The benefits of following this path include:-

  • You can add new tools and versions as soon as they become available
  • You have access to the full set of open source tools out there
  • You have software licensing cost savings

The drawbacks of this path include:-

  • You have costs in doing your own support and proving integration between each tool actually works
  • Your developers will need to be very highly skilled, and thus be comparatively expensive
  • New Big Data Apache open source projects are leapfrogging established open source projects. The need to maintain legacy code quickly makes keeping up to date with the latest innovations very challenging.
  • Big Data will likely form only part of your information management solution, so integration with existing non-Big Data systems and business intelligence tools may prove challenging.

2. Hadoop-affiliated Packaged providers

There are 3 companies which provide packaged and supported versions of the core parts of the Hadoop ecosystem – Cloudera, MapR & Hortonworks. Hortonworks stay as close to the open source versions of the code as possible, whereas MapR keeps the external interfaces to Hadoop but recodes some of the underlying functions for improved performance. Clouders sits somewhere in-between in having value-added services and backporting new functionality to the Hadoop project.

The benefits of buying a distribution from a packaged provider include:-

  • They deal with the testing of components within the package and ensure that they work together
  • They react less quickly to innovative solutions which means that the distributions
  • They provide support

The drawbacks include:-

  • Some parts of the distribution may be proprietary and, therefore, have limited available support
  • Higher licensing costs compared to pure open source
  • Big Data will likely form only part of your information management solution, so integration with non-Big Data systems and tools may prove challenging

3. Cloud computing providers

One of the better known cloud computing providers is Amazon Web Services (AWS) who offer a Hadoop package – Amazon Elastic Map Reduce. I’ll cover the details of this offering in more detail in a forthcoming article.

The benefits of this path include:-

1. You can choose pre-configured Hadoop clusters which takes away a lot of admin headache

2.  There are no upfront costs

3.  Your solution can be scaled up or down dependent on your needs

4.  Your data is accessible from almost anywhere

Potential drawbacks include:-

1. You don’t have the same level of security control as you would have in your own organisation.

2. Your cloud will likely form only part of your overall system architecture. How do you connect your software on the cloud with in-house systems such as your user authentication server or billing system easily

3. Cloud-based applications can be slower to interact with and have less features than a desktop installed applications.

4. You’re reliant on your cloud supplier for a substantial chunk of your business. If relationships sour, what is the migration path.

4. Established vendors

The information management “big beasts” have not sat idly by whilst the new kids steal their lunch. Oracle have partnered with Cloudera and produced their own integration tools to push and pull data from parts of the Hadoop ecosystem and NoSQL databases as well as their own proprietary Big Data solutions.

IBM have similarly produced their own integration tools to Hadoop and NoSQL and their own proprietary Big Data solutions.

The benefits of working with an established vendor include:-

  • They’re likely to be around for some time
  • Most business intelligence and data integration tools already support them
  • They can provide a solution for the complete information management architecture
  • They provide extensive support

The drawbacks include:-

  • One of the main reasons as to why open source projects developed was the software licensing support
  • They don’t provide integration with all of the Hadoop ecosystem and NoSQL databases.
  • They are the slowest to adopt new Big Data innovations e.g. Apache SparkApache Drill meaning that you could choose a solution that doesn’t meet your low latency non-functional requirements and be stuck with it until the established vendor adds support.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s