In the How to start on an information management strategy, I discussed the steps that you need to take to develop a technical strategy.
Assuming, that as a result of that exercise you’ve identified use cases where you need Big Data as part of your information management solution, there are 4 paths to decide from in terms of how you decide how to engage with Big Data.
1. Full Open Source path
Apache Hadoop and the associated ecosystem are developed by communities of developers and made freely available for use.
The benefits of following this path include:-
- You can add new tools and versions as soon as they become available
- You have access to the full set of open source tools out there
- You have software licensing cost savings
The drawbacks of this path include:-
- You have costs in doing your own support and proving integration between each tool actually works
- Your developers will need to be very highly skilled, and thus be comparatively expensive
- New Big Data Apache open source projects are leapfrogging established open source projects. The need to maintain legacy code quickly makes keeping up to date with the latest innovations very challenging.
- Big Data will likely form only part of your information management solution, so integration with existing non-Big Data systems and business intelligence tools may prove challenging.
2. Hadoop-affiliated Packaged providers
There are 3 companies which provide packaged and supported versions of the core parts of the Hadoop ecosystem – Cloudera, MapR & Hortonworks. Hortonworks stay as close to the open source versions of the code as possible, whereas MapR keeps the external interfaces to Hadoop but recodes some of the underlying functions for improved performance. Clouders sits somewhere in-between in having value-added services and backporting new functionality to the Hadoop project.
The benefits of buying a distribution from a packaged provider include:-
- They deal with the testing of components within the package and ensure that they work together
- They react less quickly to innovative solutions which means that the distributions
- They provide support
The drawbacks include:-
- Some parts of the distribution may be proprietary and, therefore, have limited available support
- Higher licensing costs compared to pure open source
- Big Data will likely form only part of your information management solution, so integration with non-Big Data systems and tools may prove challenging
3. Cloud computing providers
One of the better known cloud computing providers is Amazon Web Services (AWS) who offer a Hadoop package – Amazon Elastic Map Reduce. I’ll cover the details of this offering in more detail in a forthcoming article.
The benefits of this path include:-
1. You can choose pre-configured Hadoop clusters which takes away a lot of admin headache
2. There are no upfront costs
3. Your solution can be scaled up or down dependent on your needs
4. Your data is accessible from almost anywhere
Potential drawbacks include:-
1. You don’t have the same level of security control as you would have in your own organisation.
2. Your cloud will likely form only part of your overall system architecture. How do you connect your software on the cloud with in-house systems such as your user authentication server or billing system easily
3. Cloud-based applications can be slower to interact with and have less features than a desktop installed applications.
4. You’re reliant on your cloud supplier for a substantial chunk of your business. If relationships sour, what is the migration path.
4. Established vendors
The information management “big beasts” have not sat idly by whilst the new kids steal their lunch. Oracle have partnered with Cloudera and produced their own integration tools to push and pull data from parts of the Hadoop ecosystem and NoSQL databases as well as their own proprietary Big Data solutions.
The benefits of working with an established vendor include:-
- They’re likely to be around for some time
- Most business intelligence and data integration tools already support them
- They can provide a solution for the complete information management architecture
- They provide extensive support
The drawbacks include:-
- One of the main reasons as to why open source projects developed was the software licensing support
- They don’t provide integration with all of the Hadoop ecosystem and NoSQL databases.
- They are the slowest to adopt new Big Data innovations e.g. Apache Spark, Apache Drill meaning that you could choose a solution that doesn’t meet your low latency non-functional requirements and be stuck with it until the established vendor adds support.