Windows Azure Solution Cookbook

by on July 2, 2012

I’ve spent quite a bit of time discussing cloud computing with our customers lately and the common theme among them all has been “lean”. Windows Azure can provide some tremendous benefits to nearly every organization we speak with: operating cost reduction, faster time to market, scale on demand. It literally can’t get any leaner that this.

The challenge I’m running into is that it’s hard for architects and developers to get a big picture of the Azure platform and how all the features can be used together to build solutions. Microsoft is shipping new services on a quarterly basis and each new service is designed to solve a particular need our customers are asking for. We need a way to see these services holistically as a set of building blocks or ingredients to use in our solutions.

Windows Azure Reference ArchitectureIn this series going to offer up some architectural recipes to help visualize solutions to the common scenarios we’ve identified since Azure’s launch. These are by no means the only solutions you an solve with Azure or the only way to address these scenarios but hopefully they will provide you with a high level way to visualize your solutions on the Azure platform.

The diagram you see to the right is designed to provide a layered architectural overview of the developer and infrastructure services currently available in Windows Azure*. (Note: You can click on it to see the full size version.) This diagram will be the master template for the entire series.

Recipes


* Special thanks to my buddy Holger Sirtl for his outstanding architecture overview diagram. This series would not be possible with out him.



Windows Azure Recipe: Big Data

by on July 2, 2012

As the name implies, what we’re talking about here is the explosion of electronic data that comes from huge volumes of transactions, devices, and sensors being captured by businesses today. This data often comes in unstructured formats and/or too fast for us to effectively process in real time. Collectively, we call these the 4 big data V’s: Volume, Velocity, Variety, and Variability. These qualities make this type of data best managed by NoSQL systems like Hadoop, rather than by conventional Relational Database Management System (RDBMS).

We know that there are patterns hidden inside this data that might provide competitive insight into market trends.  The key is knowing when and how to leverage these “No SQL” tools combined with traditional business such as SQL-based relational databases and warehouses and other business intelligence tools.

Drivers

  • Petabyte scale data collection and storage
  • Business intelligence and insight

Solution

The sketch below shows one of many big data solutions using Hadoop’s unique highly scalable storage and parallel processing capabilities combined with Microsoft Office’s Business Intelligence Components to access the data in the cluster.

image

Ingredients

  • Hadoop – this big data industry heavyweight provides both large scale data storage infrastructure and a highly parallelized map-reduce processing engine to crunch through the data efficiently. Here are the key pieces of the environment:
    • Pig - a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
    • Mahout - a machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop using the map/reduce paradigm.
    • Hive - data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in distributed storage. Directly accessible to Microsoft Office and other consumers via add-ins and the Hive ODBC data driver.
    • Pegasus - a Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop and that provides algorithms for important graph mining tasks such as Degree, PageRank, Random Walk with Restart (RWR), Radius, and Connected Components.
    • Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.
    • Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large log data amounts to HDFS.
  • Database – directly accessible to Hadoop via the Sqoop based Microsoft SQL Server Connector for Apache Hadoop, data can be efficiently transferred to traditional relational data stores for replication, reporting, or other needs.
  • Reporting – provides easily consumable reporting when combined with a database being fed from the Hadoop environment.

Training

These links point to online Windows Azure training labs where you can learn more about the individual ingredients described above.

image Hadoop Learning Resources (20+ tutorials and labs)

Huge collection of resources for learning about all aspects of Apache Hadoop-based development on Windows Azure and the Hadoop and Windows Azure Ecosystems

SQL Azure (7 labs)

Microsoft SQL Azure delivers on the Microsoft Data Platform vision of extending the SQL Server capabilities to the cloud as web-based services, enabling you to store structured, semi-structured, and unstructured data.

See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.



Windows Azure Recipe: High Performance Computing

by on June 8, 2012

One of the most attractive ways to use a cloud platform is for parallel processing. Commonly known as high-performance computing (HPC), this approach relies on executing code on many machines at the same time. On Windows Azure, this means running many role instances simultaneously, all working in parallel to solve some problem. Doing this requires some way to schedule applications, which means distributing their work across these instances. To allow this, Windows Azure provides the HPC Scheduler.

This service can work with HPC applications built to use the industry-standard Message Passing Interface (MPI). Software that does finite element analysis, such as car crash simulations, is one example of this type of application, and there are many others. The HPC Scheduler can also be used with so-called embarrassingly parallel applications, such as Monte Carlo simulations. Whatever problem is addressed, the value this component provides is the same: It handles the complex problem of scheduling parallel computing work across many Windows Azure worker role instances.

Drivers

  • Elastic compute and storage resources
  • Cost avoidance

Solution

Here’s a sketch of a solution using our Windows Azure HPC SDK:

image

Ingredients

  • Web Role – this hosts a HPC scheduler web portal to allow web based job submission and management. It also exposes an HTTP web service API to allow other tools (including Visual Studio) to post jobs as well.
  • Worker Role – typically multiple worker roles are enlisted, including at least one head node that schedules jobs to be run among the remaining compute nodes.
  • Database – stores state information about the job queue and resource configuration for the solution.
  • Blobs, Tables, Queues, Caching (optional) – many parallel algorithms persist intermediate and/or permanent data as a result of their processing. These fast, highly reliable, parallelizable storage options are all available to all the jobs being processed.

Training

Here is a link to online Windows Azure training labs where you can learn more about the individual ingredients described above. (Note: The entire Windows Azure Training Kit can also be downloaded for offline use.)

Windows Azure HPC Scheduler (3 labs) 

The Windows Azure HPC Scheduler includes modules and features that enable you to launch and manage high-performance computing (HPC) applications and other parallel workloads within a Windows Azure service. The scheduler supports parallel computational tasks such as parametric sweeps, Message Passing Interface (MPI) processes, and service-oriented architecture (SOA) requests across your computing resources in Windows Azure. With the Windows Azure HPC Scheduler SDK, developers can create Windows Azure deployments that support scalable, compute-intensive, parallel applications.

See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.