The term “Data Mining” is usually taken to mean finding previously unknown but useful patterns and outliers in structured data (e.g., see our previous blog post, “Text Mining” – A Misnomer, also http://en.wikipedia.org/wiki/Data_mining). Among statisticians, the term data mining can have a negative connotation (e.g., it is taken to mean dredging up any and all types of patterns, whether or not warranted by the statistical distribution within the data). Indeed, there are a number of dirty secrets in Data Mining (often known only to experienced analytics practitioners) that are never openly admitted nor discussed. Here is something of an expose:
First, before any general-purpose out-of-the-box Data Mining software can begin to get applied, there is a lot of work in creating a “data warehouse,” data “cleansing” (i.e., removing or otherwise carefully handling incorrectly entered or missing data), and data transformations to allow the software to work. And since messy data is very prevalent, and their correction is highly context-dependent, there really are no fully automated means available to do so (notwithstanding what software vendors will often claim). All this falls under data pre-processing. Second, such general-purpose software will often not even provide meaningful results, since the previously unknown but useful patterns tend to be very domain-specific and context-sensitive. And considerable expensive effort is required to set up any general-purpose software before obtaining any meaningful results – so much so that it may be better to start from scratch (rather than use off-the-shelf general-purpose software). Third, if for no other reason than happenstance, usually one obtains a huge number of resulting patterns from any dataset – in fact, it can be easily shown to be exponentially larger than the original dataset itself. And so, unless there are some powerful means to quickly identify the important patterns, examining the results becomes a more difficult task than doing so with the original dataset itself.
However, the situation is not that bad if you restrict attention to a specific domain, and the Data Mining software is configured specifically for such contexts. This is similar to producing special-purpose software for a narrow area, which can provide fairly valuable results if used appropriately. One of the mechanisms is to find good visualization mechanisms, and thereby use human eyes to assess the results. As an example, simple charts and similar techniques (to be discussed in our future blogs) can provide a rapid means to focus attention on important issues.
One other aspect to consider is that Data Mining usually entails software which will have certain threshold values built-in. When the incidence of certain patterns exceeds such pre-defined thresholds, the software reports the associated results. Now, mining computations often take a long while to execute, and in fact, user-interactive sessions are not the best use of time. Instead, Alerts can and should be made available to users whereby data mining is carried out whenever newer data enters the system, and if the data mining thresholds are exceeded, then Alerts should trigger and be reported to users. Again, we will discuss some techniques in future blog posts.