Ever Heard of Data Auditing?

December 22, 2011

I am sure most IT professionals, especially the ones dealing with data, have heard of data quality.  The idea of monitoring data to ensure the data fit into the intended use with a high level of accuracy.  When creating a Business Intelligence (BI) solution how can data quality be continually monitored through the entire solution?  One method is called Data Auditing.  The concept might not be new, but the formalization of the process is. 

Data Auditing is the process of ensuring data quality from the beginning of the BI process to the final destination in a repeatable and measured way.   This includes validation of data that arrive from source to staging to star to cube (if exists).  Where ever business logic can be implemented, a data audit can be used to make sure the quality of the data is consistent. 

One example of data auditing I have used was for a ‘technical glitch’ with the use of SQL Server Analysis Services (SSAS).  The decision was made to not change source data that was of bad quality and load it into the Kimball star schema as it was.  (Now this decision can be debated, but one major reason this was done was to easily expose data quality issues to the users as they believed there would be an inconceivably small amount of data quality issues.) What was discovered with using SSAS is it would not handle dates with a year prior to around 1500.  Some of the dates in the source system had the year 200 instead of 2000 and so on.  A Data Audit routine was designed to look for these dates into the stage tables and change them to a pre-determined default date. This allowed the SSAS job to complete and the cube to process.

This process seems much like any other data quality processes.  The true auditing came from how these instances (and the others) were reported.  The code developed to catch the data quality issues also entered data into  “performance” stars schemas designed to provide data on the nightly process.  This data was then shown by a dashboard used by the internal IT, BI staff.  Every morning we could see how many rows of data were caught by each audit.  This allowed us to make quick decisions on how to handle the data to change in the source system and even change in the data warehouse (including staging tables to make sure there was not a type 2 slowly changing dimension row added by the change to the original source data).

There were other data audits that summed specific counts of rows from the source system and made sure that count was the same in the star and cube due to how the star was loaded.  Another data audit allowed us to show the measures in the fact tables and the cube were the same based on logic used.  These types of data audits are not used to catch data quality, but to explicitly show that on a day-to-day basis, the results are the same.  This audit was extremely necessary to build confidence in the data.  Confidence in “the numbers” was extremely important to our clients and this was the simplest way for us to convince our end users they were getting what they were supposed to get. 

The best way to implement a data auditing solution is to use the existing BI tools to build the report, dashboard or any other means to expose the audit.  Even the simple use of Excel against the cube and star can be used.  Any way that can be easily maintained is prefered.  Of course the data needs to be understood in order to make sure this is done correctly.  If data is distributed in the star to form a lower level of granularity, then it needs to be summed back to the original level and compared to what is in the source.  This could mean there are rounding errors present, but that should only provide around a penny difference. 

A problem with Data Auditing is the trade-off of time developing the data auditing process takes away from time to develop the business needed parts of the BI solution.  One way to incorporate this is to build it into each project.  The first project to utilize the concept of data auditing will take longer due to the need to build the underlying data structure and processes.  Once this is started and built, the next project to use the structure will take less time to ‘plug in’ just like the idea of re-using conformed dimensions.

Data Auditing’s benefits provide a repeatable way to show data throughout the entire BI process is correct.  This concept is part of a good data quality/data governance solution.  The ability to ‘watch’ the data as it goes through the entire BI process to make sure it means what it supposed to mean will provide a security blanket for the end users.  The end users can say “This data is from the data warehouse and I am positively sure it is correct.”  How much value to the business and reassurance to IT does that grant?

– Doug

How Necessary is a Data Warehouse?

November 28, 2011

The largest and most complex aspect of Business Intelligence (BI) is the data warehouse.  In this context, the data warehouse is the repository of data generally fed from many sources to keep historical perspectives of an entity’s data.   It is a behemoth that is generally expensive, slow to build, complicated in structure and difficult to maintain.  How necessary is it?  Does a company need the actual, physical data warehouse to have a successful and sustainable business intelligence (BI) program?

There are many design methodologies that take these issues into consideration.  There are advantages and disadvantages to both traditional (and non-traditional) methodologies which I do not cover in this post.  My goal is to bring up points of view of why and when a data warehouse may or may  not be used.  What I would like to cover is:

  • The Corporate Information Factory (CIF), based on the Inmon approach
  • The Kimball Style of data warehousing
  • BI using no data warehouse at all

Corporate Information Factory

The Corporate Information Factory methodology, in a nutshell, says there is no way of getting around this inevitable fact of the need for a data warehouse.  In order to have a successful and sustainable BI program, a data warehouse is needed.  Not only is it needed, it needs to be completely designed, built and populated prior to any further analysis or BI work can be done.  This is due to the nature of how business concepts are intertwined within each other necessitating the big picture view.  This style also views the architecture process more from the IT/data perspective compared to the business need point of view.

Kimball Methodology

The Kimball methodology of data warehouse design is not as structured and regimented as the Corporate Information Factory.  The Kimball data warehouse is the sum of its parts; meaning one area of the business could be designed, developed and deployed providing BI insight while other aspects of the business have not been discussed.  This concept will speed the development of the data warehouse compared to the CIF, but the underlying data warehouse can become much more complex as more and more is added to it along with the possibility of rework.  This style views the architecture process from the business needs point of view compared to the IT/data perspective.

No Data Warehouse?

What about not using a data warehouse?  In the new age of Data as a Service (DaaS), Master Data Management along with Service Oriented Architecture (SOA), why re-store data from disparate systems?  Why not store the metadata of where the data is found and attach the business logic to the SOA call?  This can be a very powerful way to gain insight into data.  The idea that the development of a data warehouse can be done without the data warehouse.  There are already tools that will do this.  One of them is Qlikview from Qliktech.  The basic premise behind this tool is to allow the user to develop the Transform and Load aspects of ETL (Extract Transform and Load) in memory to delivery very quick analytics in a solid visual manner.  This tool is not a methodology, but SOA could be used in a larger context with the same principles. This style views the architecture process as something the business could do, but IT does not have to do.

The idea that a data warehouse is necessary for a successful BI implementation is not necessarily true.  A data warehouse is not necessary to have analytics or provide a picture of the data you have.  I believe it is very questionable to say this process is sustainable to leverage every benefit for BI.  The very important aspect of BI that cannot be overcome by SOA, or in-memory analytic tools like Qlikview, is the entire reason the data warehouse first came about. 

The decision for building or not building a datawarehouse is all about the history of the data.  Not the history that is required by law to be kept like financial data or what in many cases is considered ‘facts’ in the Kimball style.  If this were the only history needed, a data warehouse would be less necessary.  The type of history that is important is the history that cannot be reproduced within the source systems.  This is the history of changes made that are not kept by the source system.  In many cases a customer’s address may not be historically important in a transactional/source system so only  the most current record is kept.  If that history is not kept somewhere (like a data warehouse), analytics of historical purchases of products will not show a true picture of what actually happened.  It will only show the picture of what is in the source system at the current point in time.  This situation is the quinticential lynchpinn for why a data warehouse should be necessary.  The ability to track and keep history that is not kept in the source system is something SOA, or in-memory BI is not capable of reproducing.

If the desired BI capability for the business is operational in nature, a data warehouse will not offer any significant benefit over SOA.  This is a short sighted tactical means of looking at data and cannot provide strategic insight, but it certainly could be the best way to answer that need for data given the circumstances.  This would not be the end-all-be-all for BI, but it certainly can provide means to start a program.

So does this completely answer the question “Is a data warehouse necessary for BI?”  The data warehouse is necessary for a complete and sustainable BI program, but it does not have to be the start of the program.  So… of course the answer to that is still…. “It depends…”

… Doug

The idea of an Information Democracy…

November 21, 2011

During the Business Intelligence Symposium presented by Lucrum in conjunction with the University of Cincinnati, College of Business, Filippo Passerini, Group President of Global Business Services and CIO of P&G, promoted the idea of an Information Democracy. He is not the first person to use this phrase, only the latest to try and specifically define what is meant by the term. The power of providing an Information Democracy to the data consumers enables similar freedoms to the citizens of the U.S. democracy.
LIFE: the growth of an organization using data driven decisions (a company that is not growing is dying)
LIBERTY: the ability to quickly make the appropriate decisions based on data (a company is less suppressed by competent data driven decision making)
THE PURSUIT OF HAPPINESS: the ability to improve profits (what company is not happier with more profit??)

Since there are many types of democracies, the term Information Democracy is not easily refined. Mr. Passerini discussed his idea of Information Democracy as providing the same information at the same time to all that should view that data. This Lateral information exchange has enabled P&G unprecedented access to data propelling their decision making to be quicker and based on current data.
The purpose of their Information Democracy is to provide not only one version of the truth, but the same version of the truth to everyone. This might sound like the same concept, but there is a subtle difference and it deals with the latency of the data and ability to massage results. It tries to eliminate the “My data shows…” statements made by many because the data is owned and seen by all people at the same time. There is no delay to anyone in receiving data, no standardized reports to be re-issued, no side data to be pulled into Excel to get a different look, just the data received in a dashboard/cockpit environment.
The delivery of the data in Mr. Passerini’s Information Democracy is prolific. The same pieces of information are delivered via mobile devices, traditional PCs or P&G’s Business Sphere environment (a conference room of walls with electronic displays filled with information). The same data provided at the same time to all parties involved using multiple delivery devices allows the entire P&G managerial structure to evaluate data wherever they may be. This pervasive data culture is another example of P&Gs increased ability to adapt their business more quickly in a team environment.
The Information Democracy has not come easily at P&G as they have had to overcome obstacles. It has taken a huge effort to change the culture to embrace data for data driven solutions. Security issues make the delivering of data to all the necessary people difficult. The technology to do this is available, but the governance was generally lacking. These issues must be addressed, as P&G has, prior to successfully implementing the idea of an Information Democracy.
Transparency of the data (showing the same data to all necessary parties), timeliness of the data (getting the data to all parties as early as possible), and transportation of the data (delivering the data in multiple formats for easy consumption) make the three branches of the Information Democracy much like the executive, legislative and judicial branches make up our democracy. With these branches and the appropriate data governing processes, there truly can be an Information Democracy allowing data “…of the people, by the people and for the people.”

…Doug

A Data Architect’s Initial View of Data Vault

November 18, 2011

As a data architect that has never heard of the data vault concept, I was skeptical of the value and validity of the data vault.  The timing of the training was not the best considering it was held after a long day of work, but what made it interesting was the passion and expertise shown by Jon Shirey, a Principle Consultant with Lucrum.  The training was an introduction of Dan Lindstedt’s concept of the data vault as a way of showing the validity to the method.

Having designed and implemented Kimball style data marts and data warehouses, I was very skeptical at first.  The Kimball methodology has been proven over and over time and again.  How could something, like the data vault, that has been around a decade receive such little push from the mainstream Business Intelligence groups like TDWI?  How could this seemingly over simplified methodology really do what it claims?

As the presentation continued, Jon started asking if there have ever been issues with the Kimball style with changing the data model.    Of course there has been, hardly ever does anyone ever get the first cut at a star right.  There always seems to be things missed, either by looking in the data or by the subject matter experts forgetting to explain something.  It is one of the largest hurdles to overcome in a star design, but everyone has to deal with that… right?

One huge advantage to the data vault is a way to easily get around this issue.  If something is missed in the design, the data vault is adjusted; the star design is adjusted and reloaded.  No data is ever lost in the staging areas due to the misunderstanding of requirements because the data vault methodology works around this.  The history is kept in the vault so the ‘loading’ into the star could be re-done every time the requirements change without anything ever being lost.

Of course no data architect worth his salt would ever make a design that would lose data like that… Really? The staging areas have always been designed by the team correctly and every single star design has been flawless or the business had not changed how business is done?  The crux of this comes down to not having to ‘cook’ business logic into how the star is loaded from the source.  All the data is loaded into the data vault first and then business logic can be used to load a specific data mart/star schema.  This idea in the methodology is a genius way to never having to say ‘sorry’.

When this benefit of the data vault was covered, my initial thought was, well this is why you have a staging area.  All the data will be kept there so you can always come back to it later.  The more I thought about that idea, the more I recalled how complex the staging area becomes in trying to do this.  There is no way to model all the twists and turns needed to code this way as well as the space needed to keep a traditional ETL environment up and running.  The more I thought about it, the more I began thinking a traditional staging area and its complexities are a huge headache!  The simpler design using the data vault methodology as the persistent staging area offers huge benefits over the traditional Kimball style data warehouse staging area.  This includes repeatable code use in building and populating the data vault as well as the ability to easily account and validate the data.

Validating the data in the star can be daunting due to the business rules involved.  In order to validate the numbers found in the star, the logic used to build the star has to be applied to the source system in order to compare apples to apples.  This is not necessary in the data vault.  The simplicity of the data vault makes auditing and validation easier.  Since the data gets entered into the system as it is in the source, auditing becomes easy.

As this topic was being covered, my Kimball instincts took over and said, “Well, we have to do this anyway and can use the staging area…”  This can be true, but the simplicity of the data vault methodology makes this process so much easier and includes the ability to take commonly used data with different meanings in different departments (think customer to sales vs. marketing…) and easily link that data together.  This is no small task in the Kimball data warehouse world.

After one session and after thinking it over, I have come to the conclusion that using the data vault as the persistent staging area for a Kimball style data warehouse or the non-user accessed data warehouse for the Inmon style is the best way to allow for quick design that truly can be iterative.  The ease to get this design started and the flexibility to easily change what is presented to the user through data marts makes the data vault concept truly unique.

Business Intelligence Symposium VI Focuses on Innovative Analytics

October 17, 2011

LÛCRUM Inc., in conjunction with the University of Cincinnati’s Carl H. Lindner College of Business, has announced that their 6th Business Intelligence Symposium is scheduled for December 13, 2011, at the Carl H. Lindner College of Business focusing on Innovative Analytics. LÛCRUM Inc. and the Carl H. Lindner College of Business have been organizing these events for business executives and IT professionals that use and need analytical data to provide a forum for sharing ideas, stories, experiences and business cards.

Confirmed speakers for the event are:


Drew Boyd

Drew Boyd is a recognized authority, thought leader, educator, and practitioner in the fields of innovation, persuasion, and social media. He is the Executive Director of the Master of Science in Marketing Program and Assistant Professor of Marketing and Innovation at the University of Cincinnati.

Ginny Walker

Virginia (Ginny) Walker is currently GE Aviation’s Chief Enterprise Architect, reporting to the CIO. She grew up in Michigan, obtained her undergraduate degree in Systems Analysis at Miami (OH) University, an MBA from Xavier (OH), and has worked 25 years for GE in Aviation, Corporate and Energy. She has held a diverse span of Information Technology roles…from application development, to data warehouse and data management, to strategic planning, outsourcing, data center management, and ERP implementations.

Jim Goetz

Jim is the Chief Information Officer at Convergys.  He is responsible for the planning, development, and delivery of the technology products and services for the Company globally, including external and internal clients. He reports to Jeff Fox, president and chief executive officer of Convergys.

Mike Cholak

Mike Cholak is the Vice President of Customer Intelligence Services at Convergys.  He leads a team dedicated to delivering a full suite of consulting services to the Company’s clients that help leverage customer intelligence and feedback to optimize long- term customer loyalty and profitability.

For more information:

 

 

 

 

 

 

 

 

 

 

 

 

Data Vault: The Preferred “flavor” for DW Architecture in BI – Part II

October 7, 2011

In Part-I, I explained the place of Data Vault (DV) in Enterprise Data Warehouse Architecture. Now let’s look at different DV entities, rules for each entity and why Dan Lindstedt calls DV a “hybrid” approach. This minimal understanding is necessary before diving into the differences between the various modeling techniques.

The main entities of Data Vault are Hub, Link and Satellite.

HUB Entity (HUB_): This is a defining entity. It contains a unique list of business keys. These are the keys that businesses utilize in everyday operations. For example, employee number, SSN number, Product Code. So the attributes of HUB are:

  • Surrogate Key – This is a Primary Key of hub and holds 1-to-1 relationship with the Business Keys.
  • Business Key – This is a Primary Key of the source system. This can be a composite key. ETL checks this key’s existence in the hub table and inserts one if it doesn’t exist.
  • Load Date Time – The datetime of the key / record when it was first loaded into the table.
  • Record Source – The name of the source the record originated from. This is useful for data traceability.
  • Record Begin Date Time – The datetime when the record became active in the source (if available) or the datetime when ETL has been run.
  • Record End Date Time – The datetime when the record is closed. This can only be detected if the logical deletes are supplied or derived in some manner.

LINK Entity (LINK_): LINKS are constructed once all the HUBS are identified. Links are relationship entities.  These are the physical representation of m-to-m 3NF relationship. It represents the relationship or transaction between hubs. The link table contains the unique list of relationships between hub keys. When a relationship arrives, it simply gets loaded into the table if doesn’t exist. Typically, the link tables translate into fact tables in the datamart access layer. For example, the link between employee number and the project number. The other attributes of LINK are:

  • Surrogate Key – This is a Primary Key of the table and is useful when a link contains more than two hub keys as composite key might cause performance problems. This is also
    useful when the granularity of the link changes (a hub key is added) or history needs to be maintained on the relationships.
  • Hub Key 1 to Hub Key N – The surrogate keys from the hub tables that are involved in the relationship.
  • Load Date Time- The datetime when the record was loaded into the table.
  • Record Source – The source system name from where the record or relationship was loaded from.

SAT Entity (SAT_): SATS holds descriptive information about the hub keys or the relationships. The satellite is most closely resembles Type 2 Dimension. When the data changes, a delta record is inserted into the table and if the certain columns changes faster than others then these can be split into two different tables to avoid data replication. For example, employee details such as employee name, address, phone number, email address in the satellite off of hub  or time spent by an employee on a certain project in satellite off of LINK that stores the relationship between an employees and projects. The other attributes of SAT are:

  • Hub or Link Surrogate Key from HUB or LINK table. This is part of the primary key.
  • Load Date Time – The datetime when the record was inserted into the table. This is part of the primary key.
  • Surrogate Key – This is optional. It is useful when satellites have multiple values such as multiple home addresses.
  • Record Source – The name of the source.
  • Record Begin Date Time – The datetime when the record became active in the source (if known) or the datetime when ETL has been run.
  • Record End Date Time – The datetime when the record is closed.

And stand-alone tables such as calendars, time, code and description tables may be used.

 

Below is a snippet of a Data Vault Model housing borrowers who have taken out Student Loans:

 

 

Modeling Rules for Each Part of the Entity:

FOR HUBS:

  • Hubs keys cannot migrate into other hubs (no parent/child like HUBS).
  • Hubs must be connected through links.
  • More than two hubs can be connected through links.
  • Surrogate keys may be used.
  • Business keys are 1 to 1 relationship with surrogate keys.
  • Hubs primary keys always migrate outward.
  • Hub business keys and primary keys never change.
  • If a hub has two or more satellites, then a point-in-time table can be built for ease of joins.
  • An ‘UNKNOWN’ business key record can be inserted into Hub that can be used to tie other data in links and sats that has no business keys in source. This kind of data is usually a bad/incomplete source data.

FOR LINKS:

  • Links can be connected to other links.
  • Links must have atleast two hubs associated with them in order to instantiated.
  • Surrogate keys may be used.
  • The combination of surrogate business keys made a unique key.
  • Does not contain descriptive data.
  • Does not contain begin and end dates.

FOR SATS:

  • Satellites may be connected to hubs or links.
  • Have 1 and only 1 parent table.
  • Satellites always contain either a load date-time stamp, or a numeric reference to a stand-alone load date-time sequence table.
  • Primary key is a combination of ‘surrogate key’ from either hub or link and the load datetime stamp.
  • Surrogate keys may not be used.
  • Must have a Load End Date to indicate when the CHANGE to the data set has occurred.
  • Satellites are always delta driven. Duplicate rows should not appear.
  • Data is separated into satellite structures based on 1) type of information 2) rate of change.

DV model utilize bits of both 3rd Normal Form and Dimension Modeling concepts.  This approach has made the model simple, flexible, expandable, adaptable and consistent.

  • Adapted many-to-many physical relationship structure from 3NF that became a LINK table.
  • The LINK table is also similar to factless fact in Start Schema.
  • Adapted the notion of 1 to 1 (business key to surrogate key) tracking from dimensional modeling (type 1 dimension).
  • Adapted the notion of “data over time in a separate table/structure” from dimensional modeling (type 2 dimension). This resulted in a SAT table however it is fundamentally
    different, in that it is a child dependent table, whereas the dimension is a parent table to the facts.

This is it for now. In next post(s) we will look into some examples which will show how Data Vault technique overcomes the limitations of 3NF and Dimensional Model structures when applied as an Enterprise Data Warehouse.

- Jyothi

 

Source: tdan.com, danlinstedt.com

Pervasive BI Enabled – Data Vault

October 2, 2011

There were a few interesting posts this past week as it relates to the Data Vault Methodology and emerging trends in Business Intelligence (BI). These are must reads for BI practitioners who wish to stay current on where the industry is heading as well as how and why things should be done in a certain way or method. Again, as Dan Linstedt, the inventor of the methodology is known to say, the data vault was created to solve specifically the Enterprise Data Warehouse (EDW) problem.

The first entry is the slide deck from the recent Advanced Architecture Conference in Denver: http://danlinstedt.com/datavaultcat/datavault-advanced-architecture-conference-slides/

There is a wealth of information contained herein that outlines at a high level the benefits and reasons why the data vault method is enabling pervasive BI. It goes into details about comparing 3rd normal form and the star schema form to the value the data vault brings to the table. It outlines the components of a data vault, including the hubs, links, satellites, PIT tables, and bridge tables.

It was nice to see the connection with the methodology to emerging tools. There are now more than a handful of tools that are undertaking the task of automating the ETL and loading processes of an EDW. The method of the data vault allows for a lot of the technical innovation. Dan presents the case well that businesses today want answers faster and cheaper today as well as having the system flexible enough to evolve with the business over time. Most EDW systems today fall short of being pervasive, and often that is a methodology problem that appears to have an answer.

In addition to these topics, Dan pushed forth the proposition of an “Operational Data Vault” and began to lay out how this connects to Business Intelligence. This section was very interesting and helped to turn on a few light bulbs and progressive thoughts. I love the idea of pushing the intelligence as far as possible while still keeping the style and benefits. I agree with Dan that this is still so new that there are not yet any vendor applications or tools that can help with this…yet.

The presentation does leave a little bit of gap with the more advanced concepts, such as change management and how to leverage a data vault using agile principles. It is also a bit light on how to drive a project plan as well as what to watch out for while attempting to build a data vault style EDW. All in all, he presents a very compelling case as to why the data vault should be strongly considered when delivering Business Intelligence.

The second entry is a public rebuttal to a challenge about the validity of the data vault methodology:
http://danlinstedt.com/datavaultcat/datavault-new-response-to-frank-habers-part-1/

As Dan states, he felt the need to lay out the facts and correct misconceptions from the original author. After you get past the first few sections of disclaimers and background, the article does a great job of comparing and contrasting methodologies and technical styles for building the EDW.

I often see HDSA (persistent historical data storage area) implemented at client sites and they sometimes go by different names (ODS, staging, etc.). I have yet to see a better discussion of the merits of a data vault compared to the HDSA as he presents here. If you are in an environment that has an ODS or persistent history area, please take a moment to read this section, because the warnings and issues are real and one should not make decisions here lightly because of the long term ramifications.

Dan also goes into great detail to itemize the benefits and value proposition of why you take the steps to have a data vault at the core of the EDW.

I would say that the overall tone of the article still makes it a bit hard to read, but if you can overlook the defensive stance, you will find many gems that will help solidify and justify the purpose and value of the data vault compared to any other technique.

Data Vault: The Preferred “Flavor” for DW Architecture in BI – Part I

September 21, 2011

Business Intelligence (BI) is todays ‘MANTRA’ chanted by almost every business. Companies want to outsmart the competition. Companies are ready to invest big bucks and human power to build a sophisticated BI system so that they can have the knowledge that others don’t and seize on the opportunities in the market before others do. BI shows the Future Value of Your Business.

BI systems need DATA and every business has terabytes of real data which can provide them with the information and knowledge they need to make the right decisions on time. But the key is to turn that data into information in a timely, efficient and effective manner once the WHAT AND WHY questions are answered i.e., what information is needed, what matters and why that is required.  In today’s market, every business is in a RACE. The race to conquer others. The race to generate more gains/profits. The race to foresee the risks early on so that they can be avoided.  So time is of the essence here.

An optimized BI system integrates large volume of external and internal near real time data to allow management to create opportunities by making intelligent decisions after performing predictive analysis of their approach on the business. A good BI System is like a GPS. An effective GPS is one that not only shows you a route to your destination but also guides you when you hit roadblock, gives up-to-date external conditions (constructions / traffic) information, provides multiple routes to choose from, suggests you with alternatives for shorter and fastest routes, predict the total time based on your driving behavior, tells you what to expect next etc. Just knowing the path to your destination is not sufficient. You need to know many other factors during the whole ride to reach destination on time and without any hurdles.

For a good integrated BI system, a good Data warehouse architecture needs to be in place.  Data warehouse architecture is “an integrated set of products that enable the extraction and transformation of operational data to be loaded into a database for end-user analysis and reporting”. Below are the pictorial representations of different “flavors” of DW architectures.

Methodologies used by different architecture:

Kimball’s DW Architecture – Is based on ‘Bottom-UP’ methodology.

Inmon’s DW Architecture – Is based on ‘Top-Down’ methodology.

Dan Lindstedt’s Data Vault DW Architecture – Is based on ‘HYBRID DESIGN’

The first two design methods have some limitations for Data Warehouse layer such as inflexibility and unresponsiveness to the changing departmental needs during the implementation phase, insufficient auditability of data back to its source system, inability to integrate unstructured data, inability to rapidly respond to changes (organizational changes, new ERP implementations) or difficult to load type 2 dimensions in real time. This is where DATA VAULT came in to rescue. Data Vault follows a ‘HYBRID DESIGN’ methodology which follows ‘TOP-DOWN ARCHITECTURE WITH A BOTTOM-UP DESIGN’.

The model is a mix of normalized modeling components with type 2 dimensional properties. In this model, the DW serves as a backend system that houses historical data which is integrated by the business keys. All data ‘good, bad, incomplete’ gets loaded into the data vault and all the cleansing and application of business rules takes place downstream i.e., out of DW. This means that Data Vault model is geared to be strictly a data warehouse layer, not as a data delivery layer which still requires physical or Virtual star schemas or cubes for Business Users or BI tools to access.

Bill Inmon in 2008 stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.”

In Part 2 and 3, I am going to explain different components of Data Vault and it’s power with the help of some examples.  That will clearly explains why the Data Vault should be a preferred “flavor” for different businesses.

- Jyothi Kaparthi

15 Steps to Better Business Intelligence Architecture

June 21, 2011

There are lots of tips and tricks out there that help technicians develop successful Business Intelligence (BI) implementations. Some of these ideas come from real world experience, some from industry experts, and some from personal preference. I thought it would be a good idea to put together a list of some of the things to consider when building out the architecture.

Therefore, if you want to jump start your BI implementation, I suggest considering the following steps to help you minimize rework and avoid common pitfalls that lead to higher costs and longer delivery time:

1. Change the name of the dimension TIME_DMN to DATE_DMN because there should be a date and a separate time dimension available in the EDW. It is not unusual to see this combined together, but there is much more flexibility with a separate physical date dimension and time dimension as well as much fewer rows.

2. Use a tool to load all the date value possibilities desired instead of coding this by hand. Perform a one time “seeding” of the dimension and be finished with it. (See http://www.kimballgroup.com/html/MDWT08Content/Ch07_Date_Dim_2000-2020.xlsx). Why reinvent the wheel when the experts have spelled it out for all to use?

3. Develop as much of the ETL code as close to the database as possible. The database will give you the best performance if you can get away with it. I generally use ETL tools to call a SQL statement (views) in the database and write into another object in the same database. The ETL tool is more of a job control mechanism rather than a comprehensive logic flowing mechanism. This does require a methodology that keeps the ETL pattern based and simple like a data-vault.

4. Use a “data vault” instead of the ODS as the core of the data warehouse to store all history. Loading should be from files to a staging area. Load from the staging to data-vault and from data-vault to data-mart. The ODS could be used as an online repository for the files or loads and should closely resemble the file layouts and no business logic applied. The ODS may serve other purposes as well and may need to be treated as another source system.

5. Use a tool or combination of tools to generate ETL code and database structures as well as document. These tools exist on the market today and obviously code generators will consistently outperform even the best coders (see http://thefuturevalueofbusiness.com/data-vault-applying-the-story-of-john-henry.htm).

6. Fact tables should all be numeric columns. In a perfect world and for performance reasons as well as logic, stick to numbers. Try to move all non-numeric columns to dimensions. The databases will thank you when the table has lots of rows but the width of the rows is as narrow as possible.

7. Constraints in a star schema should be “RELY” – the ETL process should guarantee the validity. This one relates to Oracle, but is useful to make a point that the downstream systems will still see the constraints but the database does not need to incur the overhead of maintaining the relationship. The ETL process will do a fine job of making sure this is correct.

8. Check constraints are unnecessary as well (check for NOT NULL) on tables. This is more overhead that can be avoided. This is more important in transaction systems and again the ETL process should be dealing with this if it is important.

9. Dimensions should have indexes on columns that are often used in GROUP BY clauses. Stars will perform better if there are indexes against the columns that are mostly used to group and filter on.

10. The attributes about the ETL job should be stored in a separate table with a FK in all the tables that point back to the run. This helps keep fact tables lean and aids in consistency of the ETL process itself. It may also help in the cleanup effort of errant ETL runs.

11. Data governance should be happening now to review, cleanse, establish business rules, and sign off on the master dimensional data. It’s never too early to start on this process. The sooner lists and value sets are presented to the business around key business elements the earlier the discussion can begin to resolve issues and assign ownership.

12. Information that is used for reporting should never be destroyed (Preliminary vs. Final) unless signed off by the business. Once the information is loaded into the BI environment, it is a wise idea to keep it around. At some time the question will be asked “Why did we make THAT decision?” Keeping this information off-line or on-line is a different discussion.

13. All character columns should be defined as VARCHAR(255) instead of variable smaller widths in order to handle most column size situations. I see this often and I am puzzled as to why it is still being done this way. Modern databases are pretty good at storing variable length fields and not wasting space in the process. You will have a problem once you come across a new data source that has a value that is wider than what you sized the existing column. This sometimes comes along after a new acquisition gives you a new data source for your business.

14. Development should not be locked down. It stifles productivity and creates a bottleneck. There still needs to be peer and code reviews but that should happen prior to promotion to the Test environment while things are being designed and developed. Source control helps here too. Prototyping requires quick response and the faster things can change while the design is being “vetted” the better.

15. Use database sequences or identity columns instead of using what the ETL tool provides. It is generally a bad idea to rely on the ETL tool to set the “surrogate” keys. Moving code between environments becomes a real challenge. Databases have built functionality for this purpose, so it’s a good idea to use it.

Even implementing one of these steps will be beneficial, you should see marked improvement if you can do the whole list.

I am sure there are many more steps. I would love to hear your feedback and experience. Have you tried the steps and had a different outcome? What steps would you add or remove?

Data Vault: Applying the Story of John Henry

May 9, 2011

John Henry is an American folk hero, notable for having raced against a steam powered hammer and won, only to die in victory with his hammer in his hand. He has been the subject of numerous songs, stories, plays, and novels. He had a 20 pound (9 kg) hammer that he thought was light.

Like other “Big Men” such as Paul Bunyan and Pecos Bill, John Henry also served as a mythical representation of a group within the melting pot of the 19th-century working class. He grows to become the greatest “steel-driver” in the mid-century push to erect the railroads across the mountains to the West. When the owner of the railroad buys a steam-powered hammer to do the work of his mostly black driving crew, to save his job and the jobs of his men, John Henry challenges the owner to a contest: himself alone versus the steam hammer. John Henry beats the machine, but exhausted, collapses and dies.

His story is usually seen as an archetypal illustration of the futility of fighting the technological progress that was evident in the 19th century upset of traditional physical labor roles. Some labor advocates interpret the legend as illustrating that even the most skilled workers of time-honored practices are marginalized when companies are more interested in efficiency and production. Although John Henry proved himself more powerful than the steam-drill, he worked himself to death and was replaced by the machine anyway.

Note: the above excerpt was taken from http://en.wikipedia.org/wiki/John_Henry_(folklore).

So how does this story apply to the Information Technology industry in the 21st century? I find in my travels a similar resistance to efficiency and production, although it would be due to different motives. In this day and age, the most opportune ways to drive down costs over the long term is to automate as much as possible. I still see and talk to technicians that are still trying to build and manage the entire Business Intelligence environment by hand, basically wearing themselves out trying to keep up with maintenance and the rate of change demanded by business requirements.

So what is the root cause of the resistance? I would suggest that there is a comfort level in the technologist in us that entire works hard to reach a level of competence and skills and then, once attaining that level, resists enabling technologies. This is akin to John Henry feeling that he was able or better than a machine in doing a particular job. We, myself included, are comfortable with what we know how to do. Fear holds us back as well, because if we are able to generate the ETL job process, we will be replacing ETL developers with machines. Change requires us to leave our comfort zone and see beyond our own fears, in order to deliver real value to our customers.

In the information technology industry, as it has evolved over time, components move more to a modular or commodity approach as efficiency and refinement processes are applied over time. You see this with software products on the market today that have “wizards” that do routine or repetitive tasks. The more repetitive tasks can be commoditized, the more efficient the design and development process becomes. This saves real time and real dollars on development projects.

So what is one component, among many, to target for this approach? So in the realm of Business Intelligence, there is a pattern based approach to data modeling called the Data Vault ( www.learndatavault.com ). This pattern is manifested in three basic types of structures: hubs, links, and satellites. The nature of these structures allow for a repetitive process to be involved in the loading of the Enterprise Data Warehouse (EDW). For the most part, information is inserted, not updated or deleted. It tracks all changes over time, is integrated, and is auditable back to the source system. From the EDW, the standard data marts can be generated (and in some cases, this can be done with views). Because the pattern is restricted to three basic types, it becomes simpler to generate the loading ETL process using a machine. It is still not a perfect science, and depends on the value of the source model – but the point is the more that can be generated will save time and money over the long term.

So how does the technologist address the “John Henry” in each of us in order to provide the best service to our customers? First, one has to admit that it is good and beneficial to have a machine generate code and processes that were once done by hand. Secondly, one has to be open-minded to realize the benefit. Thirdly, one has to have the courage to take on change, by realizing that inside the change component, there is opportunity for future growth. This is manifested in customers being pleased that high quality projects are produced within a predictable cost and timeframe. By products of this method also reduce support costs over time. It’s a win for the technologist as well as success for the customer.

It’s time to embrace the “machine”, and allow it to do what it does best. The technician needs to move on ahead and add the thought leadership to the projects that will harness the power of smart design with efficient development patterns. The knowledge and knowhow is here today. I know. I have seen it done. I will continue to lay out the design patterns over the next few entries of what it takes to move in this direction and leverage the power of the data vault to bring world class solutions to our clients.

Next Page »