This page last changed 08 October 2005

Boston, Massachusetts, April 11-12, 2005

Tutorial: Sunday April 10, 2005 (afternoon)

Machine Learning for Text Classification Applications

David D Lewis

TUTORIAL OBJECTIVES & TARGET AUDIENCE

The target audience is practitioners interested in implementing or purchasing software for text classification, or for applications for which text classification is a component. Participants will learn about a wide range of real-world problems which can be framed as text classification, the advantages and disadvantages of doing so, and how to make effective use of machine learning in reducing human effort in fielding text classification systems.

OVERVIEW

Text classification will first be presented as an abstract problem in information retrieval, and it will be shown how a variety of information access and data mining tasks can be framed as classification problems. Classifying documents into categories is the simplest, and often the most practical, form of content enhancement for many information access tasks.

The bulk of the tutorial will then discuss the problems that come up at each stage in fielding a text classification system: defining categories, representing texts, producing training data, and training, applying and evaluating classifiers. The focus will be on machine learning methods, rather than manual rule construction; however, there will be considerable emphasis on how knowledge of the particular classification task can be brought to bear to improve effectiveness of learning systems.

Examples will be drawn from applications such as knowledge management, process improvement, customer service automation and analysis of customer data, web directories, vertical portals, alerting and customized information feeds, spam and porn filtering, bioinformatics, content analysis and survey research.

TUTORIAL OUTLINE

I. An Overview of Text Classification

I.A. Applications
1. Classification vs. other operations on text
2. Six classes of applications
3. Advantages of viewing problems as classification
4. Non-classificatory views of IR and text mining
5. Structuring sets of classes

I.B. Architecture of text classification software

I.C. Evaluating classifier effectiveness

II. Learning for Text Classification

II.A. Overview
1. Building classifiers by hand
2. Machine learning of classifiers and its advantages
3. Overfitting: the central issue in machine learning

II.B. Learning rule-based text classifiers
1. Types of rule-based classifier
2. (Dis)advantages of learning rule-based classifiers
3. Algorithms for learning rules
4. Restricting learned rules to control overfitting
5. Special issues in rule learning for text classification

II.C. Learning numeric text classifiers
1. Types of numeric classifier
2. (Dis)advantages of learning numeric classifiers
3. Algorithms for learning numeric classifiers
4. Restricting classifiers to control overfitting
5. Special issues in learning numeric text classification

III. Tuning and Tricks

III.A. Data and effectiveness
1. Creating training sets
2. Creating test sets
3. Maintaining effectiveness over time

III.B. Text representation
1. Text representation basics
2. Units of text
3. Statistical term weighting
4. Attempts to improve text representation
5. Natural language processing
6. Unsupervised learning
7. Non-textual attributes

III.C. Inserting knowledge
1. Prior knowledge vs. posterior editing
2. Suggesting attributes and parameter values
3. Constructing good attributes
4. Problems to watch out for
5. Relationships among classes
6. Knowledge bases

III.D. Systems issues
1. Efficiency of learning
2. Persistent storage
3. Flexibility of software

IV. Research Issues

TUTORIAL LOGISTICS

The tutorial will take place at the Search Engine Meeting hotel, the Fairmont Copley Plaza in Boston. The tutorial starts at 1.30 pm on Sunday April 10 and will end around 5.30. Prior registration is required. Note that the tutorial is not included in the registration fee for the following Search Engine Meeting. A separate workshop registration fee of $325 per person will be payable.


general event details

tutorial order form