15 Steps to Better Business Intelligence Architecture
June 21, 2011
There are lots of tips and tricks out there that help technicians develop successful Business Intelligence (BI) implementations. Some of these ideas come from real world experience, some from industry experts, and some from personal preference. I thought it would be a good idea to put together a list of some of the things to consider when building out the architecture.
Therefore, if you want to jump start your BI implementation, I suggest considering the following steps to help you minimize rework and avoid common pitfalls that lead to higher costs and longer delivery time:
1. Change the name of the dimension TIME_DMN to DATE_DMN because there should be a date and a separate time dimension available in the EDW. It is not unusual to see this combined together, but there is much more flexibility with a separate physical date dimension and time dimension as well as much fewer rows.
2. Use a tool to load all the date value possibilities desired instead of coding this by hand. Perform a one time “seeding” of the dimension and be finished with it. (See http://www.kimballgroup.com/html/MDWT08Content/Ch07_Date_Dim_2000-2020.xlsx). Why reinvent the wheel when the experts have spelled it out for all to use?
3. Develop as much of the ETL code as close to the database as possible. The database will give you the best performance if you can get away with it. I generally use ETL tools to call a SQL statement (views) in the database and write into another object in the same database. The ETL tool is more of a job control mechanism rather than a comprehensive logic flowing mechanism. This does require a methodology that keeps the ETL pattern based and simple like a data-vault.
4. Use a “data vault” instead of the ODS as the core of the data warehouse to store all history. Loading should be from files to a staging area. Load from the staging to data-vault and from data-vault to data-mart. The ODS could be used as an online repository for the files or loads and should closely resemble the file layouts and no business logic applied. The ODS may serve other purposes as well and may need to be treated as another source system.
5. Use a tool or combination of tools to generate ETL code and database structures as well as document. These tools exist on the market today and obviously code generators will consistently outperform even the best coders (see http://thefuturevalueofbusiness.com/data-vault-applying-the-story-of-john-henry.htm).
6. Fact tables should all be numeric columns. In a perfect world and for performance reasons as well as logic, stick to numbers. Try to move all non-numeric columns to dimensions. The databases will thank you when the table has lots of rows but the width of the rows is as narrow as possible.
7. Constraints in a star schema should be “RELY” – the ETL process should guarantee the validity. This one relates to Oracle, but is useful to make a point that the downstream systems will still see the constraints but the database does not need to incur the overhead of maintaining the relationship. The ETL process will do a fine job of making sure this is correct.
8. Check constraints are unnecessary as well (check for NOT NULL) on tables. This is more overhead that can be avoided. This is more important in transaction systems and again the ETL process should be dealing with this if it is important.
9. Dimensions should have indexes on columns that are often used in GROUP BY clauses. Stars will perform better if there are indexes against the columns that are mostly used to group and filter on.
10. The attributes about the ETL job should be stored in a separate table with a FK in all the tables that point back to the run. This helps keep fact tables lean and aids in consistency of the ETL process itself. It may also help in the cleanup effort of errant ETL runs.
11. Data governance should be happening now to review, cleanse, establish business rules, and sign off on the master dimensional data. It’s never too early to start on this process. The sooner lists and value sets are presented to the business around key business elements the earlier the discussion can begin to resolve issues and assign ownership.
12. Information that is used for reporting should never be destroyed (Preliminary vs. Final) unless signed off by the business. Once the information is loaded into the BI environment, it is a wise idea to keep it around. At some time the question will be asked “Why did we make THAT decision?” Keeping this information off-line or on-line is a different discussion.
13. All character columns should be defined as VARCHAR(255) instead of variable smaller widths in order to handle most column size situations. I see this often and I am puzzled as to why it is still being done this way. Modern databases are pretty good at storing variable length fields and not wasting space in the process. You will have a problem once you come across a new data source that has a value that is wider than what you sized the existing column. This sometimes comes along after a new acquisition gives you a new data source for your business.
14. Development should not be locked down. It stifles productivity and creates a bottleneck. There still needs to be peer and code reviews but that should happen prior to promotion to the Test environment while things are being designed and developed. Source control helps here too. Prototyping requires quick response and the faster things can change while the design is being “vetted” the better.
15. Use database sequences or identity columns instead of using what the ETL tool provides. It is generally a bad idea to rely on the ETL tool to set the “surrogate” keys. Moving code between environments becomes a real challenge. Databases have built functionality for this purpose, so it’s a good idea to use it.
Even implementing one of these steps will be beneficial, you should see marked improvement if you can do the whole list.
I am sure there are many more steps. I would love to hear your feedback and experience. Have you tried the steps and had a different outcome? What steps would you add or remove?
Oracle supports Microsoft
May 16, 2008
I can’t tell you how many times I’ve been in conversations around the topic of “Oracle vs. Microsoft”. I’ve heard both sides of the story ranging from “SQL Server for mission critical operations…are you crazy!” to “Oracle costs me my first born child…year after year!”. While these discussions are often entertaining, the line delineating the two database giants is blurring by each subsequent release.
In my years consulting for LÛCRUM, I have worked for numerous clients that have had installations of both Oracle and Microsoft running in their environments. With recent statistics estimating that Oracle controls >50% of the database market and Microsoft controlling >50% of the server operating system market, are you surprised? SQL Server only runs on Microsoft. Oracle offers more operating system versatility. While you’ll see UNIX and Linux installations, Oracle’s ability to run on Microsoft remains strong and they are improving their functionality with respect to Microsoft development. Where might an Oracle database deployed on a Microsoft server make most sense? In the small and mid-sized business market (SMB). In the SMB market, Oracle has competitively priced versions such as Oracle Database Standard Edition and Standard Edition One.
So what advantages does running Oracle on Microsoft have to offer? First, Oracle has tight integration with Active Directory and Windows Security Framework. Items such as single sign-on and security via database role and Active Directory group fall into this category. Next, Oracle offers 32-bit and 64-bit versions. In the 32-bit version, Oracle is able to utilize up to 3GB (out of a 4GB O.S. maximum) of system memory for database use. Finally, Oracle has also been working on enhancing its ability to integrate with the Windows development suite, specifically Visual Studio 2008. Oracle supports .NET in 3 ways. The Oracle Data Provider for .NET leverages ADO.NET API and allows .NET applications to access Oracle data. These APIs should be familiar to most Microsoft developers. In addition, through an add-in (free for that matter), developers can work with Oracle services via Visual Studio 2005 (and 2008 as previously mentioned). Through the development suite, developers have access to various wizards to perform various database tasks (i.e. DDL), a procedure editor (for PL/SQL procedures, packages, and functions), a Debugger for runtime error interaction, and integrated help for items such as Oracle error reference, SQL, and PL/SQL user manuals. Lastly, Oracle has integrated .NET extensions directly inside the database. This allows developers to created stored procedures and functions using C# or VB.NET within Visual Studio. This code can then be deployed to the database and referenced wherever a stored procedure or function is permitted.
Oracle has shown it is advantageous to offer solutions that fit neatly into an operating system that controls the majority of the server market, even if that vendor also happens to be a major competitor in the database market. Offer a product that is extensible and easy to use with development GUIs is sure to give you a seat at the table when it comes to choosing a solution for your organization. That is precisely why Oracle supports Microsoft (most of the time <grin>).
Dave


