Laboratory for Advanced Computing MLDM Laboratory

Laboratory for Machine Learning and Data Mining

Contact: Bing Liu, liub@cs.uic.edu.

The Lab conducts research in various areas of machine learning and data mining. Our earlier research focused on data mining, especially, interestingness of data mining, and classification based on association rules. We have also implemented a classification system based on association rules, called CBA, which is regarded as a benchmark system for association rule based classification. You can download the system from here. The system is based on our KDD-98 paper and others, KDD (1999), PKDD (2000), Applied Intelligence (2001). Our current research is mainly in the areas of Machine Learning and Web mining.

Current Research Projects

  1. Learning from Positive and Unlabeled Examples
    Classification learning is commonly stated as follows: Given a set of labeled training examples (data records or text documents) of n classes, the system uses this training set to build a classifier, which is then used to classify new documents into the n classes. Although this classic model is important, in practice one also encounters another problem. That is, one has a set of data records or documents of a particular topic or class P (positive class), and is given a large set U of mixed (unlabelled) documents that contains documents from class P and also other types of documents (negative documents). One wants to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative training data, which makes the traditional text classification techniques inapplicable. In our paper in ICML-02, it was shown theoretically that P and U provide sufficient information for learning. A number of techniques have been proposed and have appeared in ICML-03, IJCAI-03, and ICDM-03. However, since research in this direction only started recently, many important issues still need to be addressed in order to gain a better understanding of the problem. This project aims to address some of these issues. Our current system (called LPU) can also be downloaded from here here

  2. Web Information Extraction
    A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products and services. It is useful to mine such data records in order to extract information from them to provide value-added services. We have proposed an effective technique for mining such data records. Our paper entitled Mining Data Record from Web Pages has appeared in KDD-03. The system (called MDR) that comes with the paper can be downloaded here. Our next target is to perform information extraction from such data records.

  3. Web Page Cleaning
    Unlike conventional data or texts, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such noises on Web pages can seriously harm Web mining, e.g., clustering and classification. In this work, we proposed two novel techniques to deal with Web page noises to enhance Web mining. The proposed techniques have been evaluated with two Web mining tasks, Web page clustering and Web page classification. Experimental results show that they are able to dramatically improve the mining results. Two papers have appeared in IJCAI-03 and KDD-03. Dataset.

  4. Mining Topic-Specific Concepts and Definitions on the Web
    Traditionally, when one wants to learn about a particular topic, one reads a book or a survey paper. With the rapid expansion of the Web, learning in-depth knowledge about a topic from the Web is becoming increasingly important and popular. This is also due to the Web.s convenience and its richness of information. In many cases, learning from the Web may even be essential because in the fast changing world, emerging topics appear constantly and rapidly. There is often not enough time for someone to write a book on such topics. In this work, we attempt this challenging task, mining topic-specific knowledge on the Web. Our goal is to help people learn in-depth knowledge of a topic systematically on the Web. The proposed techniques first identify those sub-topics or salient concepts of the topic, and then find and organize those informative pages, containing definitions and descriptions of the topic and sub-topics, just like those in a traditional book. Our paper appeared in WWW-2003.
telephone (312) 996-0305
e-mail staff@teraflowtestbed.net
address 700 SEO MC 249, 851 S. Morgan St. Chicago, IL. 60607