|
Tutorial 1 :
Algorithmic Excursions in Data Streams Sudipto Guha Abstract For many recent applications, the concept of a data
stream is more appropriate than a data set. By nature, a stored
data set is an appropriate model when significant portions of the
data are queried again and again, and updates are small and/or
relatively infrequent. In contrast, a data stream is an
appropriate model when a large volume of data is arriving continuously and
it is either unnecessary or impractical to store the data in
some form of memory. Many applications naturally generate data
streams as opposed to simple data sets. Astronomers,
telecommunications companies, banks, stock-market analysts, and news organizations, for
example, have vast amounts of data arriving continuously. Data Mining
of streams is thus a necessary ingredient for many successful
applications. The stream view challenges basic assumptions in data mining
like random access to data. It also raises several fundamental questions
like are there effective techniques for mining streams? In this tutorial we will present a survey of
algorithms and applications related to data streams. We begin by
presenting the basic data stream model of computation. We will then cover
techiques for preprocessing a stream like sampling from a stream,
dimension reduction of a stream, and summarizing a stream
using structures like histograms. These preprocessing steps are commonly
employed prior to data mining. We will then cover various techniques
for mining streams like computing frequent itemsets, clusters, and
decision trees. Since query processing is a basic tool that is needed to
support data mining in databases. We assume that the audience has
elementary knowledge of algorithms and a basic understanding of data
mining. By and large, the tutorial will be self-contained. A Preliminary Outline of the Tutorial 1. The
Data Stream Model (a)
Formal Definition of a Data Stream (b)
Different Models of Stream Computation (c)
Difference between Stream Models and existing models 2.
Algorithmic Techniques for Preprocessing a Stream for Mining Purposes (a)
Sampling (b)
Dimension Reduction (c)
Histogram construction (d)
Wavelet representation (e)
Signatures (f)
Quantiles and Order Statistics 3.
Mining a Data Stream (a)
Clustering: MinMax, MinSum (b)
Learning: Halfspaces, Decision Trees (c)
Frequent Itemsets Biography
Contact Information: Sudipto Guha University of Pennsylvannia, Email: sudipto@cis.upenn.edu |