In previous articles, I presented the system architecture components which comprise a modern information management system architecture, as well as Examples as to where companies use Big Data technology and Poor reasons for adopting “Big Data”.
Tying these articles together, where would we best place a “Big data” solution component in the overall architecture?
The obvious choices at the moment, in my opinion, are as:-
1. The Enterprise Data Hub – Here using the Hadoop Distributed File System (HDFS) on cheap commodity hardware makes a lot of sense to store all of the data source information in unadulterated form. Alternatives such as using a traditional relational database or undistributed filesystems fail due to either cost or retrieval performance. Concerns around security can be removed by ensuring that end users do not have direct access to the enterprise data hub, and that administrators who have to have access have been security checked, and that they can only work in a secured environment which prevents data loss.
2. To hold warm data in a data archive. The benefits here are that by moving data that is rarely used but still would benefit from being online to the Hadoop Distributed File System (HDFS), you are moving the data in your enterprise data warehouse from expensive primary storage to cheap commodity storage, and improving performance in your enterprise data warehouse.
The main problems with replacing your Enterprise Data Warehouse with a “Big Data” or NoSQL solution are:-
- A lot of the consuming systems are not geared up yet to fully integrate with “Big Data” filesystems and especially the NoSQL databases.
- “Big Data” filesystems such as HDFS support sequential file access i.e. a consuming system has to read the entire file from start to finish, rather than random access, which allows a consuming system to access a particular entry in the file. The consequence of this is that this provides relatively poor query performance.
If you have use cases where low latency query results is not a concern and you’re dealing with query result sets which can fit within available memory, then I’d recommend running a proof of concept to select between the following options in order of maturity:-
- Apache HIVE – This was the original way to query data held in HDFS using SQL-type syntax for high latency batch processing and will shortly be improved to use a new underlying technology called Apache Spark which will allow low latency near real-time querying. This is worth looking at if you need to incorporate programs which were written against Big Data sources prior to the other solutions becoming available.
- Cloudera Impala– is ANSI SQL92 standard compliant, has wide support from non-Big Data vendors and has been available for a comparitively long time.
- Apache Spark will also allow low latency SQL access but due to it’s relative immaturity, it offers less integration with non-Big Data vendor tools at this point in time.
- Apache Drill is another Apache project to keep a close track of, as it’s capable of providing SQL2003 compatible access to Big Data and NoSQL databases, and is sponsored by MapR, Drill has gained certification from Microstrategy (an established business intelligence provider).