This page last changed 04 July 2007
Sitges, Spain
Tutorial: Wednesday
24 October 2007 (afternoon)
Text Mining: Placebo
or Scopolamine for Patent Analysis,
Pharma Research and Business Intelligence?
Stephen E. Arnold
President
Arnold Information Technology
A recent report pegged the text mining market at a healthy
$1.8 billion in 2007. An astounding figure when you calculate
that the five publicly-traded "leaders" in text mining
generate less than $300 million per year from licences directly
related to unstructured text.
The interest in the use of algorithmic processes to extract actionable
information and facts from terabytes of unstructured content continues
to rise. Can algorithms alone identify the names of people, places
and products without error? Can these systems handle content in
a single language or across multiple languages? Can text mining
systems process double byte unstructured content such as Chinese
and Korean laboratory reports? Are the breakthroughs in CPUs up
to tasks imposed by iterative algorithms cycling through email,
Word documents, PDF files, PowerPoints and semi-structured Web
content?
This tutorial explores the reality of today's leading text
mining systems. The systems that deliver useful results are often
quite different from the Alice-in-Wonderland descriptions in vendors'
marketing collateral.
What you will learn
- The principal categories into which most text mining vendors
cluster. The characteristic functions associated with the companies
in each cluster. When to choose a high-profile vendor versus
a little-known specialist.
- The key features of text mining systems offered as part of
a larger system from IBM, Microsoft, Oracle or SAP. The key features
included with embedded systems from vendors focused on business
intelligence and from vendors who have deep roots in search and
retrieval.
- The specific strengths and weaknesses of 20 text mining vendors'
systems with special emphasis placed on Microsoft SharePoint,
Google's clustering technique, Oracle's semantic tools, and
breakthrough technology from Attensity, Megaputer and Siderean,
amongst others.
- The pitfalls of text mining, including snapshots of financial
issues, computational bottlenecks, customisation issues, and
the frequently ignored "human element".
- Specific examples of text mining payoff in pharma and competitive
intelligence applications, illustrated with screenshots of outputs
and user interfaces.
What the tutorial covers
The tutorial is divided into four 45 minute segments:
Part 1: Background and Benefits of Text Mining
- The roots of text mining
- The four core functions of text mining
- Case example 1: pharma research
- Case example 2: competitive intelligence
- A checklist of key cost elements (hard copy handout)
Part 2: Profiles of 20 Vendors. For each vendor, you will
learn:
- Basic function of the product
- Architecture
- Typical application
- Strength
- Weakness
- First year licence and start up cost, including hardware
required for a reference implementation
Part 3: The Pitfalls and How to Avoid Them
- Planning and budgeting issues: The 2.5X guideline
- System set up, tuning and training
- Troubleshooting bottlenecks and accuracy
- Upgrading, customising and integrating more advanced functions
with other enterprise applications
- Solution to content processing challenges
Part 4: Getting a Fast Start
- Managing expectations
- Practical requirements
- An implementation checklist and roadmap
- The six essentials of a successful system.
Tutorial Format
The structure of the tutorial will be 30-minute lectures by
the presenter. Each lecture will be followed by a discussion period
between Mr Arnold and those in attendance.
Tutorial Materials
The materials used in the session and referenced will be made
available to registered attendees for a period of 10 days following
the tutorial at an ftp site. Registered attendees may use these
materials within their organisations, but any other use of the
data or the information requires the written permission of the
presenter.
Who should attend
This tutorial will provide actionable information to:
- Managers and technical professionals interested in access
to digitised content on the Web and within their organisations
- Information professionals requiring a deep understanding
of the forces reshaping analysis of unstructured content
- Consultants and advisors who need information to determine
if text mining can make an immediate and direct contribution
to their clients' business activities
- Legal personnel interested in understanding how text mining
can and cannot assist with discovery and patent analysis related
to litigation and innovation
- Bench chemists curious about the technologies available for
exploiting the knowledge-value of structured and unstructured
data available to them.
Companies Profiled
Companies in blue are treated
in more detail in this tutorial.
- Attensity
- Autonomy
- Clarabridge
- ClearForest
- Content Analyst
- Data Harmony
- Endeca
- FAST Search & Transfer
- Groxis
- IBM
- Inxight
- Megaputer
- Microsoft
- Nstein
- Oracle
- SAS
- SPSS
- SRA International
- SchemaLogic
- Siderean
- Stratify
- TEMIS
- Teragram
Tutorial Logistics
The tutorial will take place at the ICIC meeting hotel, the
Hotel Meliá Sitges. The tutorial starts at 14:00
on Wednesday 24 October and will end around 17:30. Prior registration
is required. Note that the tutorial is not included in
the registration fee for the ICIC meeting. A separate workshop
registration fee of 215 per person will be payable.
general event details
tutorial order form