Ever Heard of Data Auditing?
December 22, 2011
I am sure most IT professionals, especially the ones dealing with data, have heard of data quality. The idea of monitoring data to ensure the data fit into the intended use with a high level of accuracy. When creating a Business Intelligence (BI) solution how can data quality be continually monitored through the entire solution? One method is called Data Auditing. The concept might not be new, but the formalization of the process is.
Data Auditing is the process of ensuring data quality from the beginning of the BI process to the final destination in a repeatable and measured way. This includes validation of data that arrive from source to staging to star to cube (if exists). Where ever business logic can be implemented, a data audit can be used to make sure the quality of the data is consistent.
One example of data auditing I have used was for a ‘technical glitch’ with the use of SQL Server Analysis Services (SSAS). The decision was made to not change source data that was of bad quality and load it into the Kimball star schema as it was. (Now this decision can be debated, but one major reason this was done was to easily expose data quality issues to the users as they believed there would be an inconceivably small amount of data quality issues.) What was discovered with using SSAS is it would not handle dates with a year prior to around 1500. Some of the dates in the source system had the year 200 instead of 2000 and so on. A Data Audit routine was designed to look for these dates into the stage tables and change them to a pre-determined default date. This allowed the SSAS job to complete and the cube to process.
This process seems much like any other data quality processes. The true auditing came from how these instances (and the others) were reported. The code developed to catch the data quality issues also entered data into “performance” stars schemas designed to provide data on the nightly process. This data was then shown by a dashboard used by the internal IT, BI staff. Every morning we could see how many rows of data were caught by each audit. This allowed us to make quick decisions on how to handle the data to change in the source system and even change in the data warehouse (including staging tables to make sure there was not a type 2 slowly changing dimension row added by the change to the original source data).
There were other data audits that summed specific counts of rows from the source system and made sure that count was the same in the star and cube due to how the star was loaded. Another data audit allowed us to show the measures in the fact tables and the cube were the same based on logic used. These types of data audits are not used to catch data quality, but to explicitly show that on a day-to-day basis, the results are the same. This audit was extremely necessary to build confidence in the data. Confidence in “the numbers” was extremely important to our clients and this was the simplest way for us to convince our end users they were getting what they were supposed to get.
The best way to implement a data auditing solution is to use the existing BI tools to build the report, dashboard or any other means to expose the audit. Even the simple use of Excel against the cube and star can be used. Any way that can be easily maintained is prefered. Of course the data needs to be understood in order to make sure this is done correctly. If data is distributed in the star to form a lower level of granularity, then it needs to be summed back to the original level and compared to what is in the source. This could mean there are rounding errors present, but that should only provide around a penny difference.
A problem with Data Auditing is the trade-off of time developing the data auditing process takes away from time to develop the business needed parts of the BI solution. One way to incorporate this is to build it into each project. The first project to utilize the concept of data auditing will take longer due to the need to build the underlying data structure and processes. Once this is started and built, the next project to use the structure will take less time to ‘plug in’ just like the idea of re-using conformed dimensions.
Data Auditing’s benefits provide a repeatable way to show data throughout the entire BI process is correct. This concept is part of a good data quality/data governance solution. The ability to ‘watch’ the data as it goes through the entire BI process to make sure it means what it supposed to mean will provide a security blanket for the end users. The end users can say “This data is from the data warehouse and I am positively sure it is correct.” How much value to the business and reassurance to IT does that grant?
– Doug
How Necessary is a Data Warehouse?
November 28, 2011
The largest and most complex aspect of Business Intelligence (BI) is the data warehouse. In this context, the data warehouse is the repository of data generally fed from many sources to keep historical perspectives of an entity’s data. It is a behemoth that is generally expensive, slow to build, complicated in structure and difficult to maintain. How necessary is it? Does a company need the actual, physical data warehouse to have a successful and sustainable business intelligence (BI) program?
There are many design methodologies that take these issues into consideration. There are advantages and disadvantages to both traditional (and non-traditional) methodologies which I do not cover in this post. My goal is to bring up points of view of why and when a data warehouse may or may not be used. What I would like to cover is:
- The Corporate Information Factory (CIF), based on the Inmon approach
- The Kimball Style of data warehousing
- BI using no data warehouse at all
Corporate Information Factory
The Corporate Information Factory methodology, in a nutshell, says there is no way of getting around this inevitable fact of the need for a data warehouse. In order to have a successful and sustainable BI program, a data warehouse is needed. Not only is it needed, it needs to be completely designed, built and populated prior to any further analysis or BI work can be done. This is due to the nature of how business concepts are intertwined within each other necessitating the big picture view. This style also views the architecture process more from the IT/data perspective compared to the business need point of view.
Kimball Methodology
The Kimball methodology of data warehouse design is not as structured and regimented as the Corporate Information Factory. The Kimball data warehouse is the sum of its parts; meaning one area of the business could be designed, developed and deployed providing BI insight while other aspects of the business have not been discussed. This concept will speed the development of the data warehouse compared to the CIF, but the underlying data warehouse can become much more complex as more and more is added to it along with the possibility of rework. This style views the architecture process from the business needs point of view compared to the IT/data perspective.
No Data Warehouse?
What about not using a data warehouse? In the new age of Data as a Service (DaaS), Master Data Management along with Service Oriented Architecture (SOA), why re-store data from disparate systems? Why not store the metadata of where the data is found and attach the business logic to the SOA call? This can be a very powerful way to gain insight into data. The idea that the development of a data warehouse can be done without the data warehouse. There are already tools that will do this. One of them is Qlikview from Qliktech. The basic premise behind this tool is to allow the user to develop the Transform and Load aspects of ETL (Extract Transform and Load) in memory to delivery very quick analytics in a solid visual manner. This tool is not a methodology, but SOA could be used in a larger context with the same principles. This style views the architecture process as something the business could do, but IT does not have to do.
The idea that a data warehouse is necessary for a successful BI implementation is not necessarily true. A data warehouse is not necessary to have analytics or provide a picture of the data you have. I believe it is very questionable to say this process is sustainable to leverage every benefit for BI. The very important aspect of BI that cannot be overcome by SOA, or in-memory analytic tools like Qlikview, is the entire reason the data warehouse first came about.
The decision for building or not building a datawarehouse is all about the history of the data. Not the history that is required by law to be kept like financial data or what in many cases is considered ‘facts’ in the Kimball style. If this were the only history needed, a data warehouse would be less necessary. The type of history that is important is the history that cannot be reproduced within the source systems. This is the history of changes made that are not kept by the source system. In many cases a customer’s address may not be historically important in a transactional/source system so only the most current record is kept. If that history is not kept somewhere (like a data warehouse), analytics of historical purchases of products will not show a true picture of what actually happened. It will only show the picture of what is in the source system at the current point in time. This situation is the quinticential lynchpinn for why a data warehouse should be necessary. The ability to track and keep history that is not kept in the source system is something SOA, or in-memory BI is not capable of reproducing.
If the desired BI capability for the business is operational in nature, a data warehouse will not offer any significant benefit over SOA. This is a short sighted tactical means of looking at data and cannot provide strategic insight, but it certainly could be the best way to answer that need for data given the circumstances. This would not be the end-all-be-all for BI, but it certainly can provide means to start a program.
So does this completely answer the question “Is a data warehouse necessary for BI?” The data warehouse is necessary for a complete and sustainable BI program, but it does not have to be the start of the program. So… of course the answer to that is still…. “It depends…”
… Doug
Data Vault: The Preferred “flavor” for DW Architecture in BI – Part II
October 7, 2011
In Part-I, I explained the place of Data Vault (DV) in Enterprise Data Warehouse Architecture. Now let’s look at different DV entities, rules for each entity and why Dan Lindstedt calls DV a “hybrid” approach. This minimal understanding is necessary before diving into the differences between the various modeling techniques.
The main entities of Data Vault are Hub, Link and Satellite.
HUB Entity (HUB_): This is a defining entity. It contains a unique list of business keys. These are the keys that businesses utilize in everyday operations. For example, employee number, SSN number, Product Code. So the attributes of HUB are:
- Surrogate Key – This is a Primary Key of hub and holds 1-to-1 relationship with the Business Keys.
- Business Key – This is a Primary Key of the source system. This can be a composite key. ETL checks this key’s existence in the hub table and inserts one if it doesn’t exist.
- Load Date Time – The datetime of the key / record when it was first loaded into the table.
- Record Source – The name of the source the record originated from. This is useful for data traceability.
- Record Begin Date Time – The datetime when the record became active in the source (if available) or the datetime when ETL has been run.
- Record End Date Time – The datetime when the record is closed. This can only be detected if the logical deletes are supplied or derived in some manner.
LINK Entity (LINK_): LINKS are constructed once all the HUBS are identified. Links are relationship entities. These are the physical representation of m-to-m 3NF relationship. It represents the relationship or transaction between hubs. The link table contains the unique list of relationships between hub keys. When a relationship arrives, it simply gets loaded into the table if doesn’t exist. Typically, the link tables translate into fact tables in the datamart access layer. For example, the link between employee number and the project number. The other attributes of LINK are:
- Surrogate Key – This is a Primary Key of the table and is useful when a link contains more than two hub keys as composite key might cause performance problems. This is also
useful when the granularity of the link changes (a hub key is added) or history needs to be maintained on the relationships. - Hub Key 1 to Hub Key N – The surrogate keys from the hub tables that are involved in the relationship.
- Load Date Time- The datetime when the record was loaded into the table.
- Record Source – The source system name from where the record or relationship was loaded from.
SAT Entity (SAT_): SATS holds descriptive information about the hub keys or the relationships. The satellite is most closely resembles Type 2 Dimension. When the data changes, a delta record is inserted into the table and if the certain columns changes faster than others then these can be split into two different tables to avoid data replication. For example, employee details such as employee name, address, phone number, email address in the satellite off of hub or time spent by an employee on a certain project in satellite off of LINK that stores the relationship between an employees and projects. The other attributes of SAT are:
- Hub or Link Surrogate Key from HUB or LINK table. This is part of the primary key.
- Load Date Time – The datetime when the record was inserted into the table. This is part of the primary key.
- Surrogate Key – This is optional. It is useful when satellites have multiple values such as multiple home addresses.
- Record Source – The name of the source.
- Record Begin Date Time – The datetime when the record became active in the source (if known) or the datetime when ETL has been run.
- Record End Date Time – The datetime when the record is closed.
And stand-alone tables such as calendars, time, code and description tables may be used.
Below is a snippet of a Data Vault Model housing borrowers who have taken out Student Loans:
Modeling Rules for Each Part of the Entity:
FOR HUBS:
- Hubs keys cannot migrate into other hubs (no parent/child like HUBS).
- Hubs must be connected through links.
- More than two hubs can be connected through links.
- Surrogate keys may be used.
- Business keys are 1 to 1 relationship with surrogate keys.
- Hubs primary keys always migrate outward.
- Hub business keys and primary keys never change.
- If a hub has two or more satellites, then a point-in-time table can be built for ease of joins.
- An ‘UNKNOWN’ business key record can be inserted into Hub that can be used to tie other data in links and sats that has no business keys in source. This kind of data is usually a bad/incomplete source data.
FOR LINKS:
- Links can be connected to other links.
- Links must have atleast two hubs associated with them in order to instantiated.
- Surrogate keys may be used.
- The combination of surrogate business keys made a unique key.
- Does not contain descriptive data.
- Does not contain begin and end dates.
FOR SATS:
- Satellites may be connected to hubs or links.
- Have 1 and only 1 parent table.
- Satellites always contain either a load date-time stamp, or a numeric reference to a stand-alone load date-time sequence table.
- Primary key is a combination of ‘surrogate key’ from either hub or link and the load datetime stamp.
- Surrogate keys may not be used.
- Must have a Load End Date to indicate when the CHANGE to the data set has occurred.
- Satellites are always delta driven. Duplicate rows should not appear.
- Data is separated into satellite structures based on 1) type of information 2) rate of change.
DV model utilize bits of both 3rd Normal Form and Dimension Modeling concepts. This approach has made the model simple, flexible, expandable, adaptable and consistent.
- Adapted many-to-many physical relationship structure from 3NF that became a LINK table.
- The LINK table is also similar to factless fact in Start Schema.
- Adapted the notion of 1 to 1 (business key to surrogate key) tracking from dimensional modeling (type 1 dimension).
- Adapted the notion of “data over time in a separate table/structure” from dimensional modeling (type 2 dimension). This resulted in a SAT table however it is fundamentally
different, in that it is a child dependent table, whereas the dimension is a parent table to the facts.
This is it for now. In next post(s) we will look into some examples which will show how Data Vault technique overcomes the limitations of 3NF and Dimensional Model structures when applied as an Enterprise Data Warehouse.
- Jyothi
Source: tdan.com, danlinstedt.com
Data Vault: The Preferred “Flavor” for DW Architecture in BI – Part I
September 21, 2011
Business Intelligence (BI) is todays ‘MANTRA’ chanted by almost every business. Companies want to outsmart the competition. Companies are ready to invest big bucks and human power to build a sophisticated BI system so that they can have the knowledge that others don’t and seize on the opportunities in the market before others do. BI shows the Future Value of Your Business.
BI systems need DATA and every business has terabytes of real data which can provide them with the information and knowledge they need to make the right decisions on time. But the key is to turn that data into information in a timely, efficient and effective manner once the WHAT AND WHY questions are answered i.e., what information is needed, what matters and why that is required. In today’s market, every business is in a RACE. The race to conquer others. The race to generate more gains/profits. The race to foresee the risks early on so that they can be avoided. So time is of the essence here.
An optimized BI system integrates large volume of external and internal near real time data to allow management to create opportunities by making intelligent decisions after performing predictive analysis of their approach on the business. A good BI System is like a GPS. An effective GPS is one that not only shows you a route to your destination but also guides you when you hit roadblock, gives up-to-date external conditions (constructions / traffic) information, provides multiple routes to choose from, suggests you with alternatives for shorter and fastest routes, predict the total time based on your driving behavior, tells you what to expect next etc. Just knowing the path to your destination is not sufficient. You need to know many other factors during the whole ride to reach destination on time and without any hurdles.
For a good integrated BI system, a good Data warehouse architecture needs to be in place. Data warehouse architecture is “an integrated set of products that enable the extraction and transformation of operational data to be loaded into a database for end-user analysis and reporting”. Below are the pictorial representations of different “flavors” of DW architectures.
Methodologies used by different architecture:
Kimball’s DW Architecture – Is based on ‘Bottom-UP’ methodology.
Inmon’s DW Architecture – Is based on ‘Top-Down’ methodology.
Dan Lindstedt’s Data Vault DW Architecture – Is based on ‘HYBRID DESIGN’
The first two design methods have some limitations for Data Warehouse layer such as inflexibility and unresponsiveness to the changing departmental needs during the implementation phase, insufficient auditability of data back to its source system, inability to integrate unstructured data, inability to rapidly respond to changes (organizational changes, new ERP implementations) or difficult to load type 2 dimensions in real time. This is where DATA VAULT came in to rescue. Data Vault follows a ‘HYBRID DESIGN’ methodology which follows ‘TOP-DOWN ARCHITECTURE WITH A BOTTOM-UP DESIGN’.
The model is a mix of normalized modeling components with type 2 dimensional properties. In this model, the DW serves as a backend system that houses historical data which is integrated by the business keys. All data ‘good, bad, incomplete’ gets loaded into the data vault and all the cleansing and application of business rules takes place downstream i.e., out of DW. This means that Data Vault model is geared to be strictly a data warehouse layer, not as a data delivery layer which still requires physical or Virtual star schemas or cubes for Business Users or BI tools to access.
Bill Inmon in 2008 stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.”
In Part 2 and 3, I am going to explain different components of Data Vault and it’s power with the help of some examples. That will clearly explains why the Data Vault should be a preferred “flavor” for different businesses.
- Jyothi Kaparthi
15 Steps to Better Business Intelligence Architecture
June 21, 2011
There are lots of tips and tricks out there that help technicians develop successful Business Intelligence (BI) implementations. Some of these ideas come from real world experience, some from industry experts, and some from personal preference. I thought it would be a good idea to put together a list of some of the things to consider when building out the architecture.
Therefore, if you want to jump start your BI implementation, I suggest considering the following steps to help you minimize rework and avoid common pitfalls that lead to higher costs and longer delivery time:
1. Change the name of the dimension TIME_DMN to DATE_DMN because there should be a date and a separate time dimension available in the EDW. It is not unusual to see this combined together, but there is much more flexibility with a separate physical date dimension and time dimension as well as much fewer rows.
2. Use a tool to load all the date value possibilities desired instead of coding this by hand. Perform a one time “seeding” of the dimension and be finished with it. (See http://www.kimballgroup.com/html/MDWT08Content/Ch07_Date_Dim_2000-2020.xlsx). Why reinvent the wheel when the experts have spelled it out for all to use?
3. Develop as much of the ETL code as close to the database as possible. The database will give you the best performance if you can get away with it. I generally use ETL tools to call a SQL statement (views) in the database and write into another object in the same database. The ETL tool is more of a job control mechanism rather than a comprehensive logic flowing mechanism. This does require a methodology that keeps the ETL pattern based and simple like a data-vault.
4. Use a “data vault” instead of the ODS as the core of the data warehouse to store all history. Loading should be from files to a staging area. Load from the staging to data-vault and from data-vault to data-mart. The ODS could be used as an online repository for the files or loads and should closely resemble the file layouts and no business logic applied. The ODS may serve other purposes as well and may need to be treated as another source system.
5. Use a tool or combination of tools to generate ETL code and database structures as well as document. These tools exist on the market today and obviously code generators will consistently outperform even the best coders (see http://thefuturevalueofbusiness.com/data-vault-applying-the-story-of-john-henry.htm).
6. Fact tables should all be numeric columns. In a perfect world and for performance reasons as well as logic, stick to numbers. Try to move all non-numeric columns to dimensions. The databases will thank you when the table has lots of rows but the width of the rows is as narrow as possible.
7. Constraints in a star schema should be “RELY” – the ETL process should guarantee the validity. This one relates to Oracle, but is useful to make a point that the downstream systems will still see the constraints but the database does not need to incur the overhead of maintaining the relationship. The ETL process will do a fine job of making sure this is correct.
8. Check constraints are unnecessary as well (check for NOT NULL) on tables. This is more overhead that can be avoided. This is more important in transaction systems and again the ETL process should be dealing with this if it is important.
9. Dimensions should have indexes on columns that are often used in GROUP BY clauses. Stars will perform better if there are indexes against the columns that are mostly used to group and filter on.
10. The attributes about the ETL job should be stored in a separate table with a FK in all the tables that point back to the run. This helps keep fact tables lean and aids in consistency of the ETL process itself. It may also help in the cleanup effort of errant ETL runs.
11. Data governance should be happening now to review, cleanse, establish business rules, and sign off on the master dimensional data. It’s never too early to start on this process. The sooner lists and value sets are presented to the business around key business elements the earlier the discussion can begin to resolve issues and assign ownership.
12. Information that is used for reporting should never be destroyed (Preliminary vs. Final) unless signed off by the business. Once the information is loaded into the BI environment, it is a wise idea to keep it around. At some time the question will be asked “Why did we make THAT decision?” Keeping this information off-line or on-line is a different discussion.
13. All character columns should be defined as VARCHAR(255) instead of variable smaller widths in order to handle most column size situations. I see this often and I am puzzled as to why it is still being done this way. Modern databases are pretty good at storing variable length fields and not wasting space in the process. You will have a problem once you come across a new data source that has a value that is wider than what you sized the existing column. This sometimes comes along after a new acquisition gives you a new data source for your business.
14. Development should not be locked down. It stifles productivity and creates a bottleneck. There still needs to be peer and code reviews but that should happen prior to promotion to the Test environment while things are being designed and developed. Source control helps here too. Prototyping requires quick response and the faster things can change while the design is being “vetted” the better.
15. Use database sequences or identity columns instead of using what the ETL tool provides. It is generally a bad idea to rely on the ETL tool to set the “surrogate” keys. Moving code between environments becomes a real challenge. Databases have built functionality for this purpose, so it’s a good idea to use it.
Even implementing one of these steps will be beneficial, you should see marked improvement if you can do the whole list.
I am sure there are many more steps. I would love to hear your feedback and experience. Have you tried the steps and had a different outcome? What steps would you add or remove?
Plug into the Power of the Data Vault
April 1, 2011
In this day and age, it seems to be trendy to gravitate to the flash and splash of the latest and greatest user facing tools to address Business Intelligence issues. Some believe that if they just get a dashboard and a few nifty graphs, all of a sudden they will have “answers” flowing through their systems and into the reports. …Almost like it was magic.
The true Business Intelligence practitioners know better. Most modern systems still suffer under the same age old issues because they are still doing things the same old way. Some of the issues that are still prevalent are ability to change over time as the business changes and the integration of the information problem.
So when you build architecture on the quick and easy solution in a “silo”, you will eventually hit the wall when it comes to adaptability and scalability. So where does one turn when there is a need for speed as well as the ability to support mission critical reporting and analysis needs that must be able to pass audits?
There is a methodology that attempts to bridge the gaps between the typical issues in the current Business Intelligence offerings. The inventor of the data vault is Dan Linstedt (www.danlinstedt.com) where the concepts and rules are specified for successful engagements.
The data vault is not a product. It is not a magic pill that makes all your IT ills go away. It is a comprehensive approach to addressing real world issues with existing implementations. It brings real flexibility and adaptability to the implementation and brings reliability and dependability to the business. And with a team that understands the power of the data vault, you are now able to take your Business Intelligence environment upon which the tools that do the flash and splash can be sourced from.
According to Dan Linstedt, the inventor of the data vault methodology, the challenges around data integration include some or all of the following:
- Definition, or understanding of the data
- Functions or transformations applied to translate the data
- Interpretation of the data
- “MASTER” determination of the data
- Best storage and architecture of the data
- Visualization of the data
- Accountability and auditability of the data
- Traceability of the data
- Overloading (multi-use of single columns and record types) of data
- Historical data with lost definitions
- Data too big
- No change data capture/no audit trail
- Bad indexes
- No control over source feeds, source timing
- Multi-system valuation dependencies
- Missing data
- Changed Passwords
- Mis-aligned access rights
- Overflowing data
- Out-of-range data
- Bad domain data (a date field contains a string…)
Unless there is a comprehensive plan in place to deal with data integration, then it will only be a matter of time before your implementation will begin to suffer under the weight of the problem. Short sighted solutions only mask this issue for a short time, where our customers need a quick solution that will also stand the test of time and change.
And because of the nature of a data vault, this can be done in rapid releases that bring value within a few weeks of embarking on the project. Because of the style, there are now tools on the market that can generate the table and transformation logic. Once you reach this level, then change is transparent and accessible to your user community as well as the IT staff can finally keep close to the change as it is happening in the business.
According to Dan Linstedt, one would expect the following results from pursuing an implementation that included a data vault:
- Manage and enforce compliance to Sarbanes-Oxley, HIPPA, and BASIL II in your enterprise data warehouse (EDW)
- Spot business problems that were never visible previously
- Rapidly reduce business cycle time for implementing changes
- Merge new business units into the organization rapidly
- Rapid ROI and delivery of information to new star schemas
- Consolidate disparate data stores (i.e., master data management)
- Implement and deploy SOA, fast
- Scale to hundreds of terabytes or petabytes
- SEI CMM Level 5 compliant (repeatable, consistent, redundant architecture)
- Trace all data back to the source systems
With the data vault at the core of your Business Intelligence implementation, you are enabling your enterprise to be as nimble as possible without ignoring the core critical issues around data integration and change over time. Your user community will have the chance to grow at the pace dictated by business opportunity unconstrained by the “normal” issues around the traditional approaches.
Over the next few months, I will be going deeper into the components of a data vault, where it fits into enterprise architecture, and the ways to take advantage of the “Power of the Data Vault”. Stay tuned…
Database in the sky?
December 13, 2010
For some time now we at LUCRUM have been talking about cloud-based Business Intelligence solutions as a viable market alternative. We partnered with www.1010-data.com, offering big data – data warehouses in the cloud, and with Ocucue, offering “object” (as opposed to KPI) based data visualizations in the cloud. Executives from both of these organizations presented at our BI Symposium back in September.
Well, BI in the cloud just made a big leap forward. The poster child of cloud companies, Salesforce.com, has now entered the mix. Salesforce announced a few days ago the availability of www.database.com, a service they (salesforce) will be offering, based upon their existing architecture. WOW… The game should start moving even faster now, what with Salesforce now directly competing with the major Database vendors (Oracle and Microsoft specifically). I personally find it interesting as well that of all the areas Salesforce could have brought to the cloud (vertical/horizontal solutions, ERP/MRP systems, yet another email system, etc) they chose Database. Oh, don’t get me wrong, it makes sense relative to leveraging their existing infrastructure, but I think it’s an interesting vote on where BI, Data and the cloud are headed.
Data Mining – We want it Now!
September 1, 2010
“We are drowning in information but starving for knowledge.”
What do you really understand about data mining? Dr. Candace Gunnarsson was formerly a professor of statistics at Xavier University. Her view of data mining is all about getting information out of the systems you have today and predicting future results. Her view is also about needing a DW modeled in a way that makes the mining experience easier and more meaningful.
Her view is that there is prescriptive (automated) data mining and descriptive data mining. I think most people think of data mining to be an automated process. Truly, the manual exploration must happen first. What do I think the drivers should be? If it’s predicting a purchase, I need to understand you as a customer. What magazines do you read? How old are you? Male/female? How often do you come to my store? All of these become data points in an algorithm that will lead to prescriptive data mining. If I can predict that you will make a purchase, I then need to be able to test my theory and provide results back to my model. Data Mining is Avery iterative process.
To have truly effective data mining, you need to have a multi-disciplinary team. Be sure to bring in your IT, Marketing, Finance, and operations-focused team members. They will all have different views of your customers and will understand their transactional data better than anyone else. Use all of these views to create a better view.
BIG Question – can you start data mining before you have your data organized in a DW?
I’d be interested in hearing your data mining stories.
- Jodie
The Science of Visual Analytics
September 1, 2010
Today LUCRUM hosted our second BI Symposium. Once again, it was well attended and we had some great speakers! I’m hoping that this becomes a regular event. If you have yet to attend, I encourage you to come to our next event (to be scheduled).
Our first speaker was Mr. Stuart Woodward, President OD OcuCue. (http://ocucue.com/) OcuCue is an interesting start-up that’s all focused on data visualization. I always love listening to visual experts. There is such a science to visual design. It’s about understanding the psychology of how users think and perceive what they see. If you are creating a dashboard, you have to design it in the way people think – we read from left to right, heavy color should be at the lower left hand side, etc.
“Good design has two key elements. Graphical elegance is often found in the simplicity of design and complexity of data.” – Edward Tufte
Mr. Woodward’s company creates meaningful dashboards that are icon based. They go beyond speedometers and graphs and actually create a customized dashboards with icons that are meaningful to the company using them. One example that he showed was for a hospital. There are some rooms that can only take female patients or only male patients. To show bed availability, their dashboard has a pink pillow or a blue pillow to represent which rooms are available. Hmmm…never thought of that!
How are you presenting data to your users? Are you simplifying the message? Setting up the information from left to right? Are your colors meaningful? (ie Red should mean bad, green is good)
OK…gotta run and listen to the next speaker!
- Jodie
Making Information Available
August 9, 2010
I’m not sure if you’ve noticed, but I’ve not been blogging with the same gusto as of late. Ah the life of a Consultant.
I have been working with a local financial institution creating financial models this summer. (It leaves me with little time for blogging.) I did happen to stop by our 7755 Montgomery Road office today and checked my mailbox. In it was this month’s Information Management mag. I was immediately drawn to this month’s Snapshot: Making Information Available. Here’s some stats for you to consider:
61% of respondents are less than satisfied with their current process of creating information applications and are only lukewarm about their current information application technology. Here are their complaints:
- It takes too long to assemble and deploy applications.
- It is too difficult to assemble and view information into a simple view.
- There are not enough capabilities to integreate and normalize information from disparate applications.
WOW! I ask all of you fellow BI folks out there…what are you doing to solve this problem??? Why is it with all of the tools available today, our users are finding it too difficult to use them!! What are WE doing wrong?
As I mentioned, I am working with a customer on Financial Models this summer. I am fortunate to work with some SUPER SMART people in this group. They have come up with the most ingenious ways of getting their data out of old clunky systems. They can create some of the most INSANE Excel formulas to manipulate data! Their Excel sheets are visually appealing and get data to their management in a timely manner. I’ve had some spreadsheets that have taken me days to figure out the Excel formulas (and I’m a guru!). They are awaiting IT to “build them a DW” to make their lives easier. Here’s to hoping that it can deliver on their expectations! Here’s what I would do to ensure that it does:
1. Use an iterative methodology to build the DW. Recreate existing Excel reports from the DW as you go.
2. Implement a user-friendly reporting tool that allows them to create their own reporting. Give ‘em lots of drag and drop functionality and make sure it can Export to Excel.
3. Create a request process that allows the DW to change with the Business. Creating a process that queues up the work for months and months does not help the business user to create the financial package that’s needed at the end of the month.
4. Keep the model flexible. Doing this will ensure that you can always add a new organziation, hierarchy or measurement.
5. Build cubes! These users are smart cookies and they aren’t afraid of a Pivot Table. Give them the flexibility and performance of a cube and let them start to uncover their data.
Hmmm…what’s missing from my list? What would you add?
Happy building!
- Jodie







