Extracting your Big Data from your enterprise data hub in real time

Originally, Big Data was all about loading data on to the Hadoop Distributed File System (HDFS) from a data source and then running MapReduce jobs (which had to be hand coded). Writing the code was tricky and hence a MapReduce job often could not complete because it encountered some unexpected data on a data node which wasn’t handled properly by the code. MapReduce jobs were designed for aggregating data from log files in batch processing mode requiring sequential read access.

If you wanted to query the data using random access – where records in a file are indexed so can be read without having to scan the whole file-  then you needed to move the data in to either a traditional database e.g. MySQL or Oracle or a NoSQL database e.g. MongoDB or Cassandra.

If you needed to query the data using standard SQL then you needed to move the data in to a traditional database (although NoSQL vendors have incorporated SQL variants which partially resolve this issue).

The Apache open source community resolved the SQL access issue to an extent by introducing HIVE which allowed SQL variant access to files held in HDFS. The problem was that it still relied on MapReduce jobs to query the data, so still worked in batch processing mode. However due to it providing SQL access rather than having to handcode MapReduce jobs, it was readily adopted, which meant that any replacement has to pick up on this technical debt.

Cloudera Impala was introduced to provide real time access using SQL with direct access to HDFS and HBASE rather than relying on underlying MapReduce jobs. Impala still needs the HIVE metastore for compatibility reasons. Impala was a big step forward for real time analytics but had problems with datasets that don’t fit in to memory on each data node and the SQL still being a variant that caused issues with BI tools.

Apache Spark has been introduced as a new method to allow direct access to HDFS & HBase. Spark allows full SQL access and also has a streaming entry point as well as interfaces for machine learning and Graph technology. HIVE is being adapted to allow it to talk directly to Spark as well as MapReduce which will allow legacy scripts written using HIVE to still work.

Apache Drill is the latest advance. It breaks free from the need for a HIVE metastore which means you don’t have to create a HIVE external table before you query self-describing schema files such as JSON (a common format used to store data by web developers). This means that you could have your website persist user data to HDFS as JSON files (via a message queue to improve performance) and have it queryable by a no-code data discovery tool such as Tableau as soon as you’ve periodically flushed the message queue to a single file spread across HDFS.

Apache Drill currently supports JSON, Parquet, HiveHBASE, MongoDB, Amazon S3, Google Cloud Storage, Azure Blob Storage & Swift file formats.  Apache Drill is written in Java so will work on any platform which runs Java.

Apache Drill still doesn’t overcome the sequential read feature, however, so if you want quick distributed, indexed, random access to your Big Data, you will still need to pump the data in to a NoSQL database.XML is also still not supported.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s