Recently I had the opportunity to explore microsoft’s offering for managing Big Data. It was amazing to see how easy it was to setup a Hadoop & Spark cluster using the Azure HDinsights framework. There are numerous setups available and pre-made distributions are also available from vendors such as Cloudera and Hortonworks.
It’s great to see Microsoft not only embracing but also actively contributing to some of the BigData solutions in the Open Source community. Apache Spark is an open-source framework for cluster computing and one that I actively follow. I have seen regular contribution from Microsoft staff members towards this project.
The key advantage of using Azure HDInsights is that a Cluster of computers can easily be configured and made available within 20 to 30 mins. The default Spark cluster also comes pre-configured with lots of applications such as Ambari, PIG, HIVE, Flume, Kafka as well as a Jupyter notebook that will work with Python and Scala.
It’s worth noting that rather than the filesystem residing on commodity hardware, Microsoft would utilise Azure Blob Storage. The following are some common patterns on how we might use Microsoft Power BI (Data Visualisation Tool for Big Data which is similar to Tableau):
Option 1: (Hive to Data Visualisation tool)
Use a Hive Table and query this via an OBDC driver, note that with this approach the entire table is downloaded to PowerBI. In most cases the hive table will be derived by querying data in the Azure Blob Storage using a cluster of computers
Option 2: (Process Data-> Save to Azure Blob -> Analyse Flat File with Visualisation tool)
Another way to import data from HDInsight into Power BI is by connecting to flat files in either Blob or the Data Lake Store
In this situation use HDInsight to process your data and write the resulting curated or aggregated data into text files. Generally this will give better refresh performance as we bypass the ODBC driver.
Option 3: (This is in Beta) Direct Query with Spark Cluster
This option allows you to keep the data in the Azure Blob storage and utilise technology like Spark SQL to query the data, the summarised results are sent back to Power BI. This approach can allow for huge data sets to be analysed using a cluster of machines
Option 4: Direct Query with Azure SQL DB
DirectQuery using Azure SQL Database (DB) you would process your data in your cluster, but write the resulting data to tables in Azure SQL DB (or Azure SQL Data Warehouse). Power BI would take care of the data refresh as well as getting only the data that is required from the Azure SQL database.
In my view the most common implementation scenario is going to be utilising a cluster to mine the inital data and get aggregated results. This data would then reside in flat files on Azure Blob storage or Azure SQL Database. This approach will mean that the cost of keeping a cluster turned on will be at a minimum. Further tools like Jupyter notebooks can have prpre-builtcripts that can be easily modified for Adhoc data processing.
Let me know if there is anything you would like me to conver furter with Azure HDInsights, I also have access to some rolling microsoft credits to prototype sample solutions.