Home
Submissions
Challenge
Deadlines
Organizers
Publications
Invited
Speakers
Accepted
Papers
Workshop
Proceedings
Workshop
Program
Contacts
LINKS
KDD 2009
Registration
Travel
SensorKDD- 2007 SensorKDD-
2008 |
|
|
Accepted Papers
The review process is over and we accepted 14 research papers (6 full papers and 8 short papers) and 2 challenge papers. The accepted full papers are listed here, the accepted short papers are listed here, and the accepted challenge papers are listed here.
FULL PAPERS:
-
Title: Handling Outliers and Concept Drift in Online Mass Flow Prediction in CFB Boilers
Authors: J. Bakker, M. Pechenizkiy, I. Zliobaite, A. Ivannikov, and T. Karkkainen
Abstract: In this paper we consider an application of data mining technology
to the analysis of time series data from a pilot circulating
fluidized bed (CFB) reactor. We focus on the problem
of the online mass prediction in CFB boilers. We present a
framework based on switching regression models depending
on perceived changes in the data. We analyze three alternatives
for change detection. Additionally, a noise canceling
and a state determination and windowing mechanisms are
used for improving the robustness of online prediction. We
validate our ideas on real data collected from the pilot CFB
boiler.
-
Title: An Exploration of Climate Data Using Complex Networks
Authors: Karsten Steinhaeuser, Nitesh V. Chawla, and Auroop R. Ganguly
Abstract:To discover patterns in historical data, climate scientists
have applied various clustering methods with the goal of
identifying regions that share some common climatological
behavior. However, past approaches are limited by the fact
that they either consider only a single time period (snapshot)
of multivariate data, or they consider only a single variable
by using the time series data as multi-dimensional feature
vector. In both cases, potentially useful information may be
lost. Moreover, clusters in high-dimensional data space can
be dicult to interpret, prompting the need for a more eective
data representation. We address both of these issues by
employing a complex network (graph) to represent climate
data, a more intuitive model that can be used for analysis
while also having a direct mapping to the physical world
for interpretation. A cross correlation function is used to
weight network edges, thus respecting the temporal nature
of the data, and a community detection algorithm identifies
multivariate clusters. Examining networks for consecutive
periods allows us to study structural changes over time. We
show that communities have a climatological interpretation
and that disturbances in structure can be an indicator of climate
events (or lack thereof). Finally, we discuss how this
model can be applied for the discovery of more complex concepts
such as unknown teleconnections or the development
of multivariate climate indices and predictive insights.
-
Title: A Comparison of SNOTEL and AMSR-E Snow Water Equivalent Datasets in
Western U.S. Watersheds
Authors:Cody L. Moser, Oubeidillah Aziz, Glenn A. Tootle, Venkat Lakshmi, and Greg Kerr
Abstract: It is a consensus among earth scientists that climate change will
result in an increased frequency of extreme events (e.g.,
precipitation, snow). Streamflow forecasts and flood/drought
analyses, given this high variability in the climatic driver
(snowpack), are vital in the western United States. However, the
ability to produce accurate forecasts and analyses is dependent
upon the quality (accuracy) of these predictors (snowpack).
Current snowpack datasets are based upon in-situ telemetry.
Recent satellite deployments offer an alternative remote sensing
data source of snowpack. The proposed research will investigate
(compare) remote sensing datasets in western U.S. watersheds in
which snowpack is the primary driver of streamflow. A
comparison is made between snow water equivalent (SWE) data
from in-situ snowpack telemetry (SNOTEL) sites and the
advanced microwave scanning radiometer – earth observing
system (AMSR-E) aboard NASA’s Aqua satellite. Principal
component techniques and Singular Value Decomposition are
applied to determine similarities and differences between the
datasets and investigate regional snowpack behaviors. Given the
challenges (including costs, operation and maintenance) of
deploying SNOTEL stations, the objective of the research is to
determine if satellite based remote sensed SWE data provide a
comparable option to in-situ datasets. Watersheds investigated
include the North Platte River, the Upper Green River, and the
Upper Colorado River. The time period analyzed is 2003-2008,
due to the recent deployment of the NASA Aqua satellite. Two
distinct snow regions were found to behave similarly between
both datasets using principal component analysis. Singular Value
Decomposition linked both data products with streamflow in the
region and found similar behaviors among datasets. However,
only 11 of the 84 SNOTEL sites investigated correlated at a
significance of 90% or greater with its corresponding AMSR-E
cell. Also, when comparing SNOTEL data with the corresponding
satellite cell, there was a consistent difference in the magnitude
(Snow Water Equivalent) of the datasets. Finally, both datasets
were utilized and compared in a statistically based streamflow
forecast of several gages.
-
Title: EDISKCO: Energy Efficient Distributed in-Sensor-Network K-center Clustering with Outliers
Authors: Marwan Hassani, Emmanuel Muller, and Thomas Seidl
Abstract: Clustering is an established data mining technique for grouping
objects based on similarity. For sensor networks one
aims at grouping sensor measurements in groups of similar
measurements. As sensor networks have limited resources
in terms of available memory and energy, a major task sensor
clustering is ecient computation on sensor nodes. As
a dominating energy consuming task, communication has to
be reduced for a better energy eciency. Considering memory,
one has to reduce the amount of stored information on
each sensor node.
For in-network clustering, k-center based approaches provide
k representatives out of the collected sensor measurements.
We propose EDISKCO, an outlier aware incremental
method for ecient detection of k-center clusters. Our
novel approach is energy aware and reduces amount of required
transmissions while producing high quality clustering
results. In thorough experiments on synthetic and real
world data sets, we show that our approach outperforms a
competing technique in both clustering quality and energy
eciency. Thus, we achieve overall signicantly better life
times of our sensor networks.
-
Title: Phenological Event Detection from Multitemporal Image Data
Authors: Ranga Raju Vatsavai
Abstract: Monitoring biomass over large geographic regions for seasonal changes in vegetation and crop phenology is important
for many applications. In this paper we a present a novel
clustering based change detection method usingMODIS NDVI
time series data. We used well known EM technique to find
GMM parameters and Bayesian Information Criteria (BIC)
for determining the number of clusters. KL Divergence measure is then used to establish the cluster correspondence
across two years (2001 and 2006) to determine changes between these two years. The changes identied were further
analyzed for understanding phenological events. This preliminary study shows interesting relationships between key
phenological events such as onset, length, end of growing
seasons.
-
Title: Mining in a Mobile Environment
Authors: Sean McRoskey, James Notwell, Nitesh V. Chawla, and Christian Poellabauer
Abstract: Distributed PRocessing in Mobile Environments (DPRiME)
is a framework for processing large data sets across an ad-hoc
network. Developed to address the shortcomings of Google’s
MapReduce outside of a fully-connected network, DPRiME
separates nodes on the network into a master and workers;
the master distributes sections of the data to available onehop
workers to process in parallel. Upon returning results to
its master, a worker is assigned an unfinished task. Five data
mining classifiers were implemented to process the data: decision
trees, k-means, k-nearest neighbor, Naiıve Bayes, and
artificial neural networks. Ensembles were used so the classification
tasks could be performed in parallel. This framework
is well-suited for many tasks because it handles communications,
node movement, node failure, packet loss, data
partitioning, and result collection automatically. Therefore,
DPRiME allows users with little knowledge of networking
or distributed systems to harness the processing power of
an entire network of single- and multi-hop nodes.
SHORT PAPERS:
-
Title: On the Identification of Intra-seasonal Changes in the Indian Summer Monsoon
Authors: Shivam Tripathi and Rao S. Govindaraju
Abstract: Intra-seasonal changes in the Indian summer monsoon are
generally characterized by its active and break (A&B) states.
Existing methods for identifying the A&B states using rainfall data rely on subjective thresholds, ignore temporal dependence in the data, and disregard inherent uncertainty in
their identication. This paper develops a method to identify intra-seasonal changes in the monsoon using a hidden
Markov model (HMM) that allows objective classification
of the monsoon states. The method facilitates probabilistic
interpretation which is especially useful during the transition period between the two monsoon states. The developed
method can also be used to - (i) identify monsoon states
in real time, (ii) forecast rainfall values, and (iii) generate
synthetic data. Comparisons of the results from the proposed model with those from existing methods suggest that
the new method is a promising for detecting intra-seasonal
changes in the Indian summer monsoon.
-
Title: Reduction of Ground-Based Sensor Sites for Spatio-Temporal Analysis of Aerosols
Authors: Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic
Abstract: In many remote sensing applications it is important to use
multiple sensors to be able to understand the major spatiotemporal
distribution patterns of an observed phenomenon. A
particular remote sensing application addressed in this study is
estimation of an important property of atmosphere, called Aerosol
Optical Depth (AOD). Remote sensing data for AOD estimation
are collected from ground and satellite-based sensors. Satellite based
measurements can be used as attributes for estimation of
AOD and in this way could lead to better understanding of spatiotemporal
aerosol patterns on a global scale. Ground-based AOD
estimation is more accurate and is traditionally used as groundtruth
information in validation of satellite-based AOD
estimations. In contrast to this traditional role of ground-based
sensors, a data mining approach allows more active use of
ground-based measurements as labels in supervised learning of a
regression model for AOD estimation from satellite
measurements. Considering the high operational costs of groundbased
sensors, we are studying a budget-cut scenario that requires
a reduction in a number of ground-based sensors. To minimize
loss of information, the objective is to retain sensors that are the
most useful as a source of labeled data. The proposed goodness
criterion for the selection is how close the accuracy of a
regression model built on data from a reduced sensor set is to the
accuracy of a model built of the entire set of sensors. We
developed an iterative method that removes sensors one by one
from locations where AOD can be predicted most accurately using
training data from the remaining sites. Extensive experiments on
two years of globally distributed AERONET ground-based sensor
data provide strong evidence that sensors selected using the
proposed algorithm are more informative than the competing
approaches that select sensors at random or that select sensors
based on spatial diversity.
-
Title: OcVFDT: One-class Very Fast Decision Tree for One-class Classification of Data Streams
Authors: Chen Li, Yang Zhang, and Xue Li
Abstract: Current research on data stream classification mainly focuses on supervised learning, in which a fully labeled data
stream is needed for training. However, fully labeled data
streams are expensive to obtain, which make the supervised
learning approach difficult to be applied to real-life applications. In this paper, we model applications, such as credit
fraud detection and intrusion detection, as a one-class data
stream classification problem. The cost of fully labeling the
data stream is reduced as users only need to provide some
positive samples together with the unlabeled samples to the
learner. Based on VFDT and POSC4.5, we propose our
OcVFDT (One-class Very Fast Decision Tree) algorithm.
Experimental study on both synthetic and real-life datasets
shows that the OcVFDT has excellent classification performance. Even 80% of the samples in data stream are unlabeled, the classification performance of OcVFDT is still
very close to that of VFDT, which is trained on fully labeled stream.
-
Title: A Frequent Pattern Based Framework for Event Detection in Sensor Network Stream Data
Authors: Li Wan, Jianxin Liao, and Xiaomin Zhu
Abstract: In this paper, we presented a frequent pattern based framework for
event detection in stream data, it consists of frequent pattern
discovery, frequent pattern selection and modeling three phases:
In the first phase, a MNOE (Mining Non-Overlapping Episode)
algorithm is proposed to find the non-overlapping frequent pattern
in time series. In the frequent pattern selection phase, we proposed
an EGMAMC (Episode Generated Memory Aggregation Markov
Chain) model to help us selecting episodes which can describe
stream data significantly. Then we defined feature flows to
represent the instances of discovered frequent patterns and
categorized the distribution of frequent pattern instances into three
categories according to the spectrum of their feature flows. At last,
we proposed a clustering algorithm EDPA (Event Detection by
Pattern Aggregation) to aggregate strongly correlated frequent
patterns together. We argue that strongly correlated frequent
patterns form events and frequent patterns in different categories
can be aggregated to form different kinds of events. Experiments
on real-world sensor network datasets demonstrate that the
proposed MNOE algorithm is more efficient than the existing
non-overlapping episode mining algorithm and EDPA performs
better when the input frequent patterns are maximal, significant
and non-overlapping.
-
Title: Supervised Clustering via Principal Component Analysis in a Retrieval Application
Authors: Esteban Garcia-Cuesta, Ines M. Galvan, and Antonio J. de Castro
Abstract: In regression problems where the number of predictors exceeds the number of observations and the correlation between the predictors is high, a dimensionality reduction or
a variable selection approach is demanded. In this paper we
deal with a real application where we want to retrieve the
physical characteristics of a combustion process from the
measurements obtained with a spectroscopic sensor. This
application shows up a multicollinearity problem but furthermore it is considered an ill-posed problem.
Guided by this application scenario, we propose a clustering
approach to find out homogeneous subsets of data which are
embedded in arbitrary oriented linear manifold. This model
is developed under certain assumptions guided by a priori
problem knowledge. The resulting division preserves both,
the priori assumptions and the homogeneity in the models.
Thereby we break the whole problem in n subproblems improving its individual prediction accuracy versus a global
solution. We show the obtained improvements in a real application scenario related with estimating the temperature
from spectroscopic data in a remote sensing framework.
-
Title: A Novel Measure for Validating Clustering Results Applied to Road Traffic
Authors: Yosr Naija and Kaouther Blibech Sinaoui
Abstract: The clustering validation and clustering interpretation are
the two last steps of clustering process. The validation step
permits to evaluate the goodness of clustering results using
some measures. Valid results are then generally interpreted
and used in cluster analysis. The validity measures are classfied into three categories: unsupervised measures, supervised measures and relative measures. Several supervised
measures have been proposed to perform supervised evaluation such as entropy, purity, F-measure, Jaccard coefficient
and Rand statistic. Generally, these measures evaluate results according to class labels. However, they are not always
able to distinguish interpretable clusters because most of
them depends on the number of labels. This paper proposes
a new supervised evaluation measure - called "homogeneity
degree"- that permits to merge the steps of validation and
interpretation. Our measure is applied to a real traffic data
set and is used to interpret some traffic situations. Comparison with other evaluation measures shows the performance
of our proposal.
-
Title: SkyTree: Scalable Skyline Computation for Sensor Data
Authors: Jongwuk Lee and Seung-won Hwang
Abstract: Skyline queries have gained attention for supporting multicriteria analysis of large-scale datasets. While a lot of skyline algorithms have been proposed, most of the algorithms
build upon pre-computed index structures that cannot generally be supported over sensor data of dynamically changing attribute values. We aim to design a scalable non-index
skyline computation algorithm for sensor data. More specifically, we propose Algorithm SkyTree constructing a dynamic
lattice that divides a specific region into several subregions
based on a pivot point maximizing dominance region. Such
structure enables to perform region-wise dominance tests,
which eliminates unnecessary point-wise dominance tests.
In addition, we ensure the progressiveness that has not been
supported by any non-index algorithm, where we can identify k points maximizing the sum of dominance regions as
the greedy approximation method. The k points are used
to reduce communication cost between sensors in computing
global skyline. Our evaluation results validate the efficiency
of Algorithm SkyTree, both in terms of response time and
communication overhead, over existing algorithms.
-
Title: Clustering of Power Quality Event Data Collected via Monitoring Systems
Installed on the Electricity Network
Authors: Mennan Guder, Nihan Kesim Cicekli, Ozgul Salor, and Isik Cadirci
Abstract: In this paper, a k-means-based clustering method applied to
power quality event data is described. The data are collected by
the power quality (PQ) monitors, which are developed through
the National PQ Project and installed on the electricity
network. The PQ monitors detect the PQ events defined as
voltage sags, swells, and interruptions by the IEC Standard
61000-4-30, and collect the raw data of the event. The proposed
method aims to cope with the huge event data size and cluster
the event types so that PQ events are ultimately classified. The
method helps to manage the event data to come up with PQ
assessments for the specific measurement points and to make
comparisons of various measurement points in terms of PQ of
the electricity network.
CHALLENGE PAPERS:
- Title: Change Detection in Rainfall and Temperature Patterns over India
Authors: Shivam Tripathi and Rao S. Govindaraju
Abstract: The changes in rainfall and temperature patterns over India were detected using Mann-Kendall trend test, Bayesian
change point analysis, and a hidden Markov model. A regionalization method was developed to identify homogeneous
regions that experience similar weather states. The regionalization helped in nding contiguous regions with strong
change signals. The data were investigated at dierent temporal and spatial resolution to explore the nature of changes.
The study found that all India summer monsoon is stable,
but the winter or the north-east monsoon is gradually intensifying. It also detected an abrupt drop in the winter
and spring temperature over north-central India and a gradual increase in the summer temperature over the peninsular
India. Robustness of the detected changes were evaluated
using recent reanalysis dataset.
- Title: Anomaly Detection and Spatio-Temporal Analysis of Global Climate System
Authors: Mahashweta Das and Srinivasan Parthasarathy
Abstract: Knowledge discovery from temporal, spatial and spatio-temporal data is pivotal for understanding and predicting
the behavior of Earth’s ecosystem model. An important influence
leaving its impact on the ecosystem is the global climate
system. In this paper, the Earth Science data that we
have analyzed consists of daily global air temperature and
precipitation measurements, aggregated from heterogeneous
sensors for fifty years (1950-1999). The enormous amount
of data that is available for analysis requires employment of
data mining techniques for discovering interesting patterns,
detecting significant changes and extracting meaningful insights
from the data. Our work considers the problem of
detecting anomalous (abnormal or unexpected) behavior in
the global climate system, discovering teleconnection patterns
and providing consequential insights to the analysts.
|
|
|
Thanks to Our
Sponsors!

 |