Designing and Implementing a Data Warehouse in the Cloud – DEVELOPPARADISE
29/04/2018

Designing and Implementing a Data Warehouse in the Cloud


Introduction

In computing parlance, a data warehouse refers to a system that is used for data analysis, real-time reporting and decision making. It is often considered a fundamental part of efficient business intelligence. Data warehouses are used as centralized storage repositories for integrated data originating from various sources. Current as well as historical data is stored in a single place and is used to create reports for business workers as needed.

The raw data obtained from various departments in a business undergoes data cleansing before being used for additional processing. This data cleansing exercise is critical to ensure standard data quality before being used for reporting and other similar purposes.

A broader definition of data warehousing also includes multiple tools used for business intelligence, data extraction, and transformation, to load the data into a repository and to manage as well as retrieve metadata.

Designing a Data Warehouse

Data warehouses help you make business decisions based on the historical data collected over a period of time. However, the business data is usually distributed across multiple databases and applications located in various geographical areas. Now imagine if you compile this data and then use Big Data Analytics to improve customer satisfaction, simplify costs and find new avenues for growth. A well-designed data warehouse can help you do that.

Most businesses that claim to have a data warehouse fail to realize that data warehousing is not about building a dumping ground for tables. Instead, a good warehouse turns the raw data into something that’s cleansed, organized, summarized and supplemented. There are many standard designs that you can try to build. For instance, a standard Extract, Transform and Load (ETL) based data warehouse makes use of staging, integration, and data access layers to maintain its basic functions:

  1. The staging database saves the raw data that has been extracted from multiple data system sources.
  2. The integration layer then integrates the different sets of data and transforms them from the earlier layer and stores this transformed data within an Operational Data Store (ODS) database. Following the integration process, the data is once again moved to a Data Warehouse database where it is sorted by hierarchy and stored as facts as well as aggregated facts.
  3. The third layer, the access layer, is where users are able to retrieve the data as needed.

Data Warehouse Architecture

There are a number of different methods to construct or organize a Data Warehouse. The final structure of a data warehouse architecture is generally based on the organization’s need. The fundamental factors that define the architecture of a DW include the hardware components used, the software created as well as the specific data resources required to enable the accurate functioning of the data warehouse.

Designing and Implementing a Data Warehouse in the Cloud

In most cases, a Data Warehouse uses a three-tiered architecture. Each of these is described below:

  1. Bottom Tier – The architecture of the bottom tier consists of a Data Warehouse database server. This is a relational database system. Users can make use of backend tools as well as utilities to enter data into this tier. The back end tools are responsible for carrying out the Extract, Clean, Load as well as refresh functions.
  2. Middle Tier – The middle tier contains the On-Line Analytical Processing (OLAP) Server which is implementable by Relational OLAP (ROLAP) and Multi-Dimensional OLAP (MOLAP).
  3. Top Tier – The top tier is the client layer located on the front end. The top layer contains the tools used for querying, reporting, analysis, and mining of data.

There are many implementations of the three-tier architecture. The traditional implementation used databases to implement warehouses. The modern implementation uses the cloud.

On-Premise Data Warehousing

Before the evolution of real-time or cloud-based data warehouses, there existed multiple traditional data warehouse methods being used. Traditional data warehouse techniques were designed to lend support to programmed functionalities like:

  • Data Roll Up – Data is generalized by summary
  • Pivot – Data is cross-tabulated or rotated as needed
  • Slice and Dice – Based on the defined dimensions, perform projection operations
  • Drill Down – Reveals sub-details of different datasets
  • Selection – The ability to select data based on value and range
  • Sorting – The ability to sort the data on the basis of ordinal value

The drawback with the traditional method of data warehousing it that it requires more resources, more development time and access to business intelligence tools. ETL codes needed to be handwritten which takes an inordinately long period of time to build the data warehouse. Because of the time delay involved, the data warehouse is, more often than not, out of sync with the requirements needed during the time of deployment.

Unfortunately, the actual value, as well as the capabilities of the data, was rarely understood until a data warehouse is built and, by which time, it was already out of sync with the current requirements. If the data warehouse wasn’t already abandoned midway to completion, once completed, it resulted in an expensive and inflexible way to analyze data.

The process of building a traditional warehouse included the following steps:

  1. Identify and collect requirements
  2. Design the dimensional model
  3. Execute T-SQL queries to create and populate your dimension and fact tables

This topic has already been covered in the tutorial How to Create Your First Data Warehouse by Mubin M. Shaikh.

Data Warehousing in the Cloud

Cloud data warehousing is essentially a Data Warehouse as a Service (DWaaS) approach aimed at simplifying costly and time-consuming management, maintenance and administration, and the fine-tuning necessary while working with on-premise data warehouses. Implementing the DW architecture in the cloud is easier because the cloud warehousing tools are built that way.

Implementing the middle and top tier functionalities of the architecture requires very little effort compared to that of on-premise DW infrastructure. The cloud data warehousing tools use machine learning algorithms too that are otherwise hard to build for an enterprise.

Having your DW on the cloud comes with many benefits. To migrate a traditional Data Warehouse to the cloud can be a complex process of relocating data, schema and ETL. In case of restructuring the database schema or if you plan to rebuild the various data pipelines, the complexity of the migration grows exponentially.

Once the business decides on cloud warehousing, the next obvious question becomes, should you use the public cloud or a private cloud?

The challenges faced while using the public cloud viz. workload performance, data security and privacy laws, can be easily taken care of by utilizing a private cloud. The workload performance is much better on a private cloud vs. the public cloud as the hardware can be customized for these heavy workloads. Also, the organization or business using the private cloud manages the cloud environment.

Data security and privacy are also addressed as most of the private clouds are hosted within the network of a data center. Private clouds are also able to provide performance, cost, scalability, flexibility as well as elasticity benefits.

Comparison Table – Traditional vs. Public Cloud vs. Private Cloud Data Warehouse

  Traditional Data Warehouse Public Cloud Data Warehouse Private Cloud Data Warehouse
Initial Costs High Low Medium to high
Incremental Costs High Low Low, until hardware limit reached
Environment Lead Time High Low Low, provided all hardware procured and VMs configured
Scalability Prior planning needed Immediate Prior planning needed
Workload Performance Heavy workloads Low workloads Heavy workloads
Data Integration Easy Hard Easy
Privacy Laws Comply May or may not comply Comply
Data Security High Low High

Summary

Data warehousing is a technique for storing your business data for data analysis and reporting. Although it adds a lot of business value, a full-fledged data warehouse is complex and hard to build. Cloud data warehouses, on the other hand, have unlimited resources and very powerful analytical tools. If you have any experience to share with designing a data warehouse, let us know that in the comments.