Tutorial 1 : Algorithmic Excursions in Data Streams

 

Sudipto Guha

 

 

 

Abstract

 

For many recent applications, the concept of a data stream is more

appropriate than a data set. By nature, a stored data set is an

appropriate model when significant portions of the data are queried

again and again, and updates are small and/or relatively

infrequent. In contrast, a data stream is an appropriate model when a

large volume of data is arriving continuously and it is either

unnecessary or impractical to store the data in some form of

memory. Many applications naturally generate data streams as opposed

to simple data sets. Astronomers, telecommunications companies, banks,

stock-market analysts, and news organizations, for example, have vast

amounts of data arriving continuously. Data Mining of streams is thus

a necessary ingredient for many successful applications. The stream

view challenges basic assumptions in data mining like random access to

data. It also raises several fundamental questions like are there

effective techniques for mining streams?

 

In this tutorial we will present a survey of algorithms and

applications related to data streams. We begin by presenting the basic

data stream model of computation. We will then cover techiques for

preprocessing a stream like sampling from a stream, dimension

reduction of a stream, and summarizing a stream using structures like

histograms. These preprocessing steps are commonly employed prior to

data mining. We will then cover various techniques for mining streams

like computing frequent itemsets, clusters, and decision trees. Since

query processing is a basic tool that is needed to support data mining

in databases. We assume that the audience has elementary knowledge of

algorithms and a basic understanding of data mining. By and large, the

tutorial will be self-contained.

 

A Preliminary Outline of the Tutorial

 

1. The Data Stream Model

(a) Formal Definition of a Data Stream

(b) Different Models of Stream Computation

(c) Difference between Stream Models and existing models

 

2. Algorithmic Techniques for Preprocessing a Stream for Mining Purposes

(a) Sampling

(b) Dimension Reduction

(c) Histogram construction

(d) Wavelet representation

(e) Signatures

(f) Quantiles and Order Statistics

 

3. Mining a Data Stream

(a) Clustering: MinMax, MinSum

(b) Learning: Halfspaces, Decision Trees

(c) Frequent Itemsets

 

 

Biography

 

Sudipto Guha is an assistant professor in the Department of Computer and Information Sciences at University of Pennsylvania since Fall 2001. He completed his phD in 2000 at Stanford University working on approximation algorithms and spent a year working as a senior member of technical staff in Network Optimizations and Analysis Research department in AT&T Shannon Labs Research.

 

 

Contact Information:

 

Sudipto Guha

University of Pennsylvannia,

Email: sudipto@cis.upenn.edu

XML Query Processing

Tutorial 2 : Query Processing in XML Databases

 

Hongjun Lu, Jeffrey Xu Yu

 

 

 

Abstract

 

XML has become a de facto standard for information dissemination and exchange over the Internet. During the past few years, a large amount of work has been devoted to XML data management and dozens of XML data management systems have been developed. Recently more work on query processing and optimization in XML database systems has been reported. The objective of this tutorial is to review the issues in XML query processing and optimization and summarize the state-of-the-art techniques. We will first briefly discuss the special features of the XML data and XML query languages from the view point of query processing followed by an introduction on different approaches of physical data organization and indexing techniques. Query processing techniques in both relational based systems and native XML engines will be discussed in detail. Finally we will discuss the issues related to XML query optimization.

 

Biography

 

Hongjun Lu, currently a Professor at the Hong Kong University of Science and Technology (HKUST), graduated from the Tsinghua University, China, and received the MSc and PhD degrees from the University of Wisconsin-Madison. Priori joining HKUST, he was on the faculty of the School of Computing of the National University of Singapore. His main research interests are in data/knowledge base management systems with emphasis on query processing and optimization.  His recent research work includes data warehousing and OLAP, XML data management, knowledge discovery and data mining. He is also interested in database application development and applied performance evaluation. He has published more than 150 papers in those areas. Dr Lu is currently a Trustee of VLDB Endowment, and served as a member of the ACM SIGMOD Advisory Board from 1998 to 2002.  He is an associate editor for IEEE Transactions on Knowledge and Data Engineering, and a member of the review board of the Journal of Database Management. He is chairing the steering committees for the International Conference on Web-Age Data Management, and is a member of the steering committee of the Pacific-Asia Conference of Knowledge Discovery and Data Mining (PAKDD).   He has served on program committees for most international and regional database related conferences, including VLDB, ACM SIGMOD, ICDE, EDBT, and etc.

 

Jeffrey Xu Yu received his B.E., M.E. and Ph.D. in computer science, from the University of Tsukuba, Japan, in 1985, 1987 and 1990, respectively. Jeffrey Xu Yu was a research fellow (Apr. 1990 -- Mar. 1991) and a faculty member (Apr. 1991 -- July 1992) in the Institute of Information Sciences and Electronics, University of Tsukuba, and a Lecturer in the Department of Computer Science, Australian National University (July 1992 -- June 2000). Currently he is an Associate Professor in the Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong.  His current major research interests include XML query processing, data mining, and data stream processing. He has jointly published research papers in these areas in major international conferences including ICDE'02, ICDE'03, SIGMOD'03, VLDB'03 and ICDE'04.

 

Contact Information:

Hongjun Lu

Hong Kong University of Science & Technology

Tel: 852 2366773   Fax: 852 23581477 

Email:  luhj@cs.ust.hk

http://www.cs.ust.hk/~luhj

 

Jeffrey Xu Yu

Chinese University of Hong Kong

Email: yu@se.cuhk.edu.hk

Title: Design and Implementation of an E-Catalog Management System

Tutorial 3 : Design and Implementation of

an E-Catalog Management System

 

Sang-goo Lee

 

 

Abstract

 

Electronic catalogs providing information on products and services form one of the most important components of e-business systems. To support various business solutions such as e-procurement, supply chain management, and enterprise resource planning, a Catalog Management System (CMS) needs to provide flexible data schema and a holistic control for management activities including definition, creation, storage, retrieval, revision, reuse, and maintenance of data and meta-data of products and services.

The design of a CMS is complicated by the diversity in product types, applications, and vocabulary. Relational schema design for tens of thousands of different product types is an issue that has been well noted and addressed in the literature. Identifying products and product classes across organizations throughout the product lifecycle, often known as data synchronization, has been one of the toughest challenges in the supply chain management world.

 

In this tutorial, we will define the problems and challenges of e-catalog management. We will then introduce our experience and present solutions to some of these problems including product database design, classification scheme management, and product search. We will also introduce an ontology-based approach to these problems and its applications. Demos of a Catalog Management System are available.

 

Contents

 

- Scenarios of e-catalog use
- Product information
- Content modeling
- Classification hierarchy
- Standardization issues
- Relational Schema
- Ontology-oriented approach
- Demo

 

Biography

 

Sang-goo Lee is a professor at the School of Computer Science and Engineering, Seoul National University (SNU), Seoul, Korea. He has been the founding Director of Center for E-Business Technology (CEBT) since 2001. His primary research interest is in solving data management problems in business systems. He has successfully developed a commercial catalog management system, which has been commercialized by a local software vendor. Currently, he is developing an auto-classification engine for e-catalogs, which can be used to streamline the painstaking contents building process for e-marketplaces. Dr. Lee is a technical advisor to the Public Procurement Services of Korea and Korea CALS/EC Association that oversees a number of G2B and B2B initiatives lead by the Korean government. He has been actively involved in e-commerce standardization efforts and has written two local standards on e-catalogs. He chairs a number of joint industry-university working groups including E-Catalog Technology Working Group and the Global B2B Interoperability Working Group. Dr. Lee received B.S. from SNU, and both M.S. and Ph.D. from Northwestern University, Illinois, USA. Prior to joining SNU in 1992, he was a research engineer at EDS R&D, Michigan, USA.

 

Contact Information:

Sang-goo Lee

Seoul National University

Email: sglee@europa.snu.ac.kr

 

 


DASFAA2004 Homepage : http://aitrc.kaist.ac.kr/~dasfaa04