Guest blog post by David Mould, intially published in the IBM Data Magazine.
Both data warehouse managers and data scientists should jointly evaluate these potential sources of useful predictors
Most companies collect data from several sources, such as past transactions, clicks, shipments, and more. The data scientist’s job is to transform this data into business intelligence and then into actionable results. Most of this data falls into one category of data: internal and easy-to-collect. The three other categories of data—internal and difficult-to-collect, external and easy-to-collect, and external and difficult-to-collect—are usually not in the data warehouse (see Figure 1). And some of the most important predictors are in these three categories.
Figure 1: Data that can provide some of the best predictions is often missing from the data warehouse.
Let’s look at the three missing categories of data and the reasons why they are sometimes missing from your warehouse.
Internal and difficult-to-collect data
This category of data is usually available within your company but is not in the data warehouse because it is generated infrequently or is sensitive in nature.
- Brutally frank competitor assessment: Everyone has competitors, and most companies have a key performance indicator (KPI) grid that rates or compares each competitor. Usually this information is made available just to the C suite and to the board of directors. But as a data scientist, your predictions and forecasts are affected by competitor actions. So it would be valuable to access this information.
- Surveys of former customers: Most internally published customer surveys are biased toward the loyal customers that like you. The best surveys are those that survey the customers that dumped you. Only those surveys will reveal what really needs fixing. If you don’t know what is broken from the customer viewpoint, it is more difficult to reconcile your actual results to the predicted results.
- Unbiased focus group findings: Most focus groups draw from current, loyal customers. To get an accurate assessment of the product or service to be evaluated, insist on including three other types of customers: (1) customers who switched from a competitor to you; (2) customers who switched from you to a competitor; and (3) customers who have always been with a competitor. Listening to their interactions will provide a less-biased assessment.
External and easy-to-collect data
This category of data is external to your company and is usually not in the data warehouse because no one has requested it yet.
- Government data on business conditions, statistics, and trends: External factors can have a major impact upon your business, so they should be tracked over time. The government collects and posts unemployment rates, business cycles, census demographics, and other data that can be downloaded and added to your data warehouse.
- Consumer reporting from TransUnion, Equifax, or Experian: Some organizations are finding that their predictive models can be enhanced with consumer credit scores. Since there is a per-score charge, an ROI analysis should be completed to determine if the benefits (score uplift) outweigh the cost.
- Consumer and business data from Acxiom, Dun & Bradstreet, Harte-Hanks, and others: Marketing firms can offer a wide range of valuable data on your customers. An ROI analysis should be completed to determine if the benefits of using this data (score uplift) outweigh the cost.
- Consumer and business survey data from Gallup, Forrester Research, and others: Surveys are a great source for forecasting, especially when you compare new survey data with past survey data. This information usually isn’t in the warehouse.
External and difficult-to-collect data
This category of data is also external to your company and is usually not in the data warehouse because it is generated infrequently.
- Expert opinions: Sometimes the best way to make a prediction is to use an expert’s opinion. If an expert’s opinion has been somewhat accurate in the past, go ahead and use that person’s opinion as a dummy variable.
- Published survey or trend data that needs to be scanned or typed into the database: Some data is only available in hard copy. If it is difficult to enter the data into the warehouse, then it probably won’t be there. You may have to scan or type it in manually. But it could be the missing independent variable that you have been looking for.
- Recent technology changes: Recent technology changes could have a profound impact on your business in the future and need to be tracked and modeled accordingly.
- Executive interviews: Trade journals and magazines sometimes include executive interviews. Your internal experts can pick up on key words and phrases to divine direction and trends.
- Industry expert and supplier feedback: Industry experts and suppliers can provide key information or a viewpoint that you overlooked. Take advantage of their years of experience.
Tapping into valuable data beyond your existing warehouse
Having millions of data records and hundreds of fields is great only if the data is useful. Some of the most useful data—which can provide the best sources of insight—is difficult to collect, external to the organization, or both. Collecting and incorporating this data into your data warehouse will be worth the effort since it can provide new predictors that can boost your accuracy to a new level.
What do you think? Let me know in the comments.