“Text Mining” – A Misnomer

September 9th, 2009 by Ubiquiti Categories: News 2 Responses

Welcome to our blog: we expect to write on topics of interest about once a week. We hope you will participate by providing your own insights and views.

Given that Ubiquiti offers Decision Support with Text, it makes sense to consider what this means in general, and in our context in particular. “Decision Support” usually refers to various types of analysis, reporting, alerting and “data mining” (more on this below, and in a subsequent Blog Post) – typically on data collected from diverse sources into a data repository, or “data warehouse” (e.g., see http://en.wikipedia.org/wiki/Data_warehouse). Including Text in this context usually means dealing with free-flowing text narratives, entered by humans, that may be present in individual records of data (here the term “records” is being used very generally to also include documents, blogs, email and so on). Unlike “structured” or “fielded” data, such as numeric (i.e., numbers) or categorical (e.g., names of cities, or fault-codes etc.) data, free-flowing verbatim text narratives cannot be analyzed directly using simple mathematical or statistical techniques. Instead, recourse is taken to using approaches loosely referred to as “Text Mining”.

The term “Text Mining” means different things depending on the contexts and the audience, and is the source of considerable confusion and usage abuse. Arising from the term “Data Mining” (i.e., usually taken to mean finding previously unknown but useful patterns and outliers in structured data – e.g., see http://en.wikipedia.org/wiki/Data_mining), Text Mining is not quite the same. Free-flowing verbatim text narratives must first have their information extracted (e.g., sometimes called entity extraction), which is an activity also referred to in different ways – this, by itself, is not altogether easy (since different people may express the same concepts differently, and depending on the context, the same words may convey different meanings). Often, simply the extraction or recognition of words or phrases (without regard to the meaning conveyed, although allowing for simpler variations of the spelling) is referred to as Text Mining; and while somewhat useful for Search, this is not too useful for analysis (other than simple word or phrase counts etc.). A much deeper analysis is required to extract meaningful concepts (or the “semantics”) from the text narratives (e.g., often referred to as Natural Language Processing, or NLP; see http://en.wikipedia.org/wiki/Natural_language_processing), and only thereafter does it become possible to do worthwhile analysis, reporting – and “mining”, as described next.

Information must first be extracted from text narratives, whether just the words and phrases – or capturing the elements of meaning conveyed by them (in the form of “concepts” which have “semantics” associated with them). This converts the unstructured narrative text into structured elements (i.e., just words and phrases at the simplest level, or the semantic concepts at a deeper level – with a wide variety of possibilities in-between). Such structured data can then be mined for patterns and outliers just like any other structured datasets – and hence the term “Text Mining”. Unfortunately, even though the significant and often very difficult aspect of semantic information extraction may not be done (and instead, only words and phrases may be extracted), mining that uses any structured data extracted from narratives is considered to be text mining. And the difference in results is huge, since free-flowing text narratives are very different from their encoded meanings. It becomes very important to ascertain what exactly is meant when “Text Mining” is suggested, and to be aware of the limitations of simplistic techniques.

For example, consider the following sentences:

– “Checked AC Compressor OK and Shorted Wiring Harness” should convey that the problem is in the wiring harness, and not the compressor (which may be mistakenly flagged if only words are extracted from the data). In technical datasets, given the distinct lack of proper syntax or grammar, statistical techniques work better (e.g., see book – Christopher D. Manning, Hinrich Sch¬łtze: Foundations of Statistical Natural Language Processing, MIT Press (1999), ISBN 978-0262133609).

– “The patient is hardly likely to have contracted pneumonia although he has fever” should convey that the patient is diagnosed to not have pneumonia (and yet, simply using words may mislead any analysis).

We will discuss other issues in Text Mining in some future blogs as well, in particular because our experience shows that utilizing such available data far exceeds the value obtained solely from the structured information.

  1. Ted Davis says:

    My problem is I can’t get the data and everyone writes things different. Before I can text mine I have to get the data which takes up a lot of time.

    • Ubiquiti Inc. says:

      Actually, if data is to be obtained from the same sources and with similar parameters, inexpensive “macro” tools (ie, under $50) work alright to automate the data download processes. This works by “recording” the keyboard and mouse activity, and “replaying” it at scheduled periodic times. Several of our customers use this quite effectively. Handling the different ways of saying the same thing is done well by text analytics.

Please add your comments below...