Minding Data Mining

September 17th, 2009 by Ubiquiti Categories: News 4 Responses

The term “Data Mining” is usually taken to mean finding previously unknown but useful patterns and outliers in structured data (e.g., see our previous blog post, “Text Mining” – A Misnomer, also http://en.wikipedia.org/wiki/Data_mining). Among statisticians, the term data mining can have a negative connotation (e.g., it is taken to mean dredging up any and all types of patterns, whether or not warranted by the statistical distribution within the data). Indeed, there are a number of dirty secrets in Data Mining (often known only to experienced analytics practitioners) that are never openly admitted nor discussed. Here is something of an expose:

First, before any general-purpose out-of-the-box Data Mining software can begin to get applied, there is a lot of work in creating a “data warehouse,” data “cleansing” (i.e., removing or otherwise carefully handling incorrectly entered or missing data), and data transformations to allow the software to work. And since messy data is very prevalent, and their correction is highly context-dependent, there really are no fully automated means available to do so (notwithstanding what software vendors will often claim). All this falls under data pre-processing. Second, such general-purpose software will often not even provide meaningful results, since the previously unknown but useful patterns tend to be very domain-specific and context-sensitive. And considerable expensive effort is required to set up any general-purpose software before obtaining any meaningful results – so much so that it may be better to start from scratch (rather than use off-the-shelf general-purpose software). Third, if for no other reason than happenstance, usually one obtains a huge number of resulting patterns from any dataset – in fact, it can be easily shown to be exponentially larger than the original dataset itself. And so, unless there are some powerful means to quickly identify the important patterns, examining the results becomes a more difficult task than doing so with the original dataset itself.

However, the situation is not that bad if you restrict attention to a specific domain, and the Data Mining software is configured specifically for such contexts. This is similar to producing special-purpose software for a narrow area, which can provide fairly valuable results if used appropriately. One of the mechanisms is to find good visualization mechanisms, and thereby use human eyes to assess the results. As an example, simple charts and similar techniques (to be discussed in our future blogs) can provide a rapid means to focus attention on important issues.

One other aspect to consider is that Data Mining usually entails software which will have certain threshold values built-in. When the incidence of certain patterns exceeds such pre-defined thresholds, the software reports the associated results. Now, mining computations often take a long while to execute, and in fact, user-interactive sessions are not the best use of time. Instead, Alerts can and should be made available to users whereby data mining is carried out whenever newer data enters the system, and if the data mining thresholds are exceeded, then Alerts should trigger and be reported to users. Again, we will discuss some techniques in future blog posts.

  1. Stephan Jegl says:

    Now that text mining technologies allow structuring and statistical evaluation of unstructured data, will this become a “state-of-the-art” ? Or rather a way for being different, means simply more efficient and better performing than the competitors ?

    • Ubiquiti Inc. says:

      “Text mining” is currently far from becoming “state-of-the-art”, although Ubiquiti certainly feels that it should and will, given that its benefits far outweigh its costs. Text technologies in and of themselves are not as useful as when they become an integral part of business activities (eg, quality control, audit, diagnostics, decision support etc.). Organizations will tend to use text in conjunction with structured data to benefit in their various business processes.

  2. Andrew Pawlowski says:

    I am interested in some examples of how Data Mining can be used in a proactive way, to find emerging issues. Could you put up some techniques/examples in future blogs?

    • Ubiquiti Inc. says:

      Yes, we should and will do so. However, we have to be careful in terms of using synthetic datasets for obvious reasons. Ubiquiti will soon make available our software to some groups (and Contact Us for details on how to obtain this), together with synthetic data, with which data mining techniques and results will be indicated.

Please add your comments below...