Share

What is Data Warehousing?

The data warehouse concept started in 1988 when Barry Devlin and Paul Murphy published their groundbreaking paper in the IBM Systems Journal. Their vision sparked a need for more specific definitions of database implementations, which Bill Inmon and Ralph Kimball provided in the early 1990s – and Gartner further clarified definitions in 2005. Now any discussion on the topic also includes how or where one is implemented, such as within the cloud, or spanning on-premises and cloud in a hybrid manner.
 
A data warehouse isn’t a collection of tables or measured in terabytes. It’s a design pattern, a data architecture with many characteristics:
Subject oriented: reflects business entities and processes that the organization works with daily. The level of detail the subject area is what is important: if detailed data is there, it is a data warehouse. If summary or only limited data is there, it is a data mart.
 
Integrated and consistent: data formats and values are standardized across all tables to ensure complete, accurate data that users can understand. The data must also have integrity, e.g., the warehouse cannot have purchasing transactions without a corresponding customer record. 
 
Time variant and nonvolatile: a warehouse captures data changes and tracks data changes over time. All data is kept and does not change with transactional updates. Whether traditional, hybrid, or cloud, a data warehouse is effectively the “corporate memory” of its most meaningful data.
 
Metadata, Schema, and the Data Dictionary
Typically, an organization captures all there is to know about customers, for example, as well as sales transactions. Those transactions are then linked to customers, which leads to connections with products descriptions, inventories, bills of material, support records, and marketing campaigns. Subject areas are made up of relational tables, like spreadsheets with column headers and rows of data. The column headers, called metadata because they describe the context for values in that column, contains the name of the fields and their data type such as date, currency, time, integers, and text. Every record stored has at least one key field that allows for random direct access. All the table definitions – the metadata – together are called a schema and are stored in the data dictionary. An entire, fully built data warehouse is 4,000 to 7,000 relational tables organized by topic areas.
 
How Does Data Get into the Data Warehouse?
It’s not uncommon to have 200 or even 500 different applications sending data to the warehouse, which consolidates and integrates all such data into the subject areas. The warehouse gets input from applications such as enterprise resource planning (ERP), customer relationship management (CRM), and supply chain management (SCM). Clickstream data from mouse clicks on web pages are another source, as is sensor data from machinery vehicles, and so on. There is also unstructured data such as JSON (JavaScript Object Notation) which does not conform to rows and columns but is still captured in the data warehouse.
 
Data Warehousing and the Data Integration Step
Before the data goes into the data warehouse database, it passes through the data integration step, a complex process that rationalizes data from multiple sources into a single result. Originally this was called extract, transform, and load (ETL) because the data had to be pulled from the source, refined, then loaded into data warehouse relational tables.
 
Data Cleansing
Modern integration processes include data cleansing, which involves detecting and correcting corrupt or inaccurate records. Sometimes bad data gets into a record, or hardware corruption occurs – and sometimes simple human error is at fault. The data integration task combines the best, most accurate and most complete data from multiple applications into a clean, reliable “golden record” that’s stored in the warehouse.  
 
Data Warehousing and Data Loading
Then the data is loaded into the data warehouse in a continuous process -- all day long for most implementations. Data loading is a heavy consumer of relational database compute time primarily because of all the recovery processing that is needed in the event load jobs fails. 
Data loading segues to the business purpose of delivering data to the business users: getting insight and finding answers to business problems. Data scientists apply advanced mathematics to large amounts data to find patterns and anomalies. Multi-dimensional analysis is where all the data is highly summarized, enabling fast review of rollups by region, city, sales person, and product sold, for example. Executives and business analysts (or “citizen data scientists”) use reports and dashboards with visualization, all pulling from the source of governed data: the data warehouse.