This page last changed 09 November 2006

Boston, Massachusetts, April 23-24, 2007

Tutorial: Sunday April 22, 2007 (afternoon)

Text, Lies and Videotape: An Introduction to Text and Web Mining
through Public Data Sets

David D. Lewis, PhD
President, David D. Lewis Consulting

Both the volume and scope of textual and semistructured data, and the power of technological tools to extract value from it, have exploded in recent years. This tutorial will provide a tour of these technologies through examples based on some notorious public data sets, including the Enron emails, tobacco company documents released under the Master Settlement Agreement, and the Netflix $1 Million Prize movie recommendation data.

Tutorial attendees will learn about technologies for finding and organizing documents, extracting information from natural language, and transducing information from media such as speech, images and web pages. The inevitable errors, biases and ambiguities introduced by all these technologies pose particular challenges for data mining. We will cover approaches for exploration, prediction and decision making that are robust to the complexities of natural language.

Themes of the tutorial include:

Attendees will be provided with a copy of the presentation slides and extensive pointers to additional resources.

TUTORIAL LOGISTICS

The tutorial will take place at the Search Engine Meeting hotel, the Fairmont Copley Plaza in Boston. The tutorial starts at 1.30 pm on Sunday April 22 and will end around 5.30. Prior registration is required. Note that the tutorial is not included in the registration fee for the following Search Engine Meeting. A separate workshop registration fee of $325 per person will be payable. Note that the two tutorials (Arnold and Lewis) at this conference run concurrently.


general event details

tutorial order form