If you’re a dot com startup or a larger organisation who wants to experiment with Hadoop using public data, your first port of call is likely to be a cloud computing provider, the most popular of which is Amazon Web Services (AWS). If you’re a Microsoft Shop, you may well choose Azure.
Cloud computing refers to a capability to obtain infrastructure (IAAS), platform (PAAS) or software (SAAS) as a service. In all service models, the benefits are that there are no upfront costs, you’re charged just for what you need to use, and you can scale up or down your use and hence cost as your business needs change.
The most basic offering is infrastructure which will supply you servers and a network, on which you will have to install your own software. The more viable offering is to go for are the Big Data PAAS offerings that Amazon have – Elastic MapReduce (for Hadoop) and Redshift (for more traditional, if slightly limited, data warehousing option). Azure HDInsight is the Microsoft equivalent to Elastic MapReduce.
For AWS, Elastic MapReduce being a Hadoop software stack wins out if you need to store video, pictures, XML etc. (as Redshift doesn’t allow this), or you’re anticipating needing to handle petabytes of data, or need high availability or scalability, or simply want to get used to Hadoop. It also allows extra software to be added to the cluster e.g. Apache Spark or Python so is a flexible platform
Redshift is your traditional federated data warehouse so performs well in sub-petabyte volumes and offers standard SQL access. It doesn’t offer the high availability that Hadoop offers (by automatically replicating copies of partitioned data to additional server nodes). You’re also restricted on the data sources to getting data from Amazon DynamoDB or Amazon S3 storage service and data loading is single threaded and hence is comparitively slow. It also doesn’t handle images, video or non fixed-format text files. Due to these drawbacks, Elastic MapReduce would be the more logical starting point for starting the Big Data journey.
In order to start moving files to the Big Data cloud, you’re going to need to purchase Elastic MapReduce as well as the cluster that you wish to host it – Amazon Elastic Compute Cloud (EC2)
Amazon provide a range of cluster configurations with varying memory, i/o and cpu capabilities. Information on these together with suitable use cases for each can be found at http://aws.amazon.com/ec2/instance-types/. Cluster configuration i2 is the one recommended by Amazon for production Hadoop configurations, but would be way over spec’d initially. It’s key feature being high volume, high speed locally mounted solid state disks.
Amazon provide a helpful calculator which allows you to choose your options and it will inform you of your monthly costs.
To get started, you simply have to login with an Amazon username and password at http://aws.amazon.com/.