3rd International Workshop on
Knowledge Discovery from Sensor Data
(SensorKDD-2009)
Held in conjunction with

 

Home

Submissions

Challenge

Deadlines

Organizers

Publications

Invited
Speakers

Accepted
Papers

Workshop
Proceedings

Workshop
Program

Contacts

 

LINKS

KDD 2009

Registration

Travel

SensorKDD-
2007

SensorKDD-
2008

   

Accepted Papers

The review process is over and we accepted 14 research papers (6 full papers and 8 short papers) and 2 challenge papers. The accepted full papers are listed here, the accepted short papers are listed here, and the accepted challenge papers are listed here.

FULL PAPERS:

  1. Title: Handling Outliers and Concept Drift in Online Mass Flow Prediction in CFB Boilers
    Authors: J. Bakker, M. Pechenizkiy, I. Zliobaite, A. Ivannikov, and T. Karkkainen
    Abstract:
    In this paper we consider an application of data mining technology to the analysis of time series data from a pilot circulating fluidized bed (CFB) reactor. We focus on the problem of the online mass prediction in CFB boilers. We present a framework based on switching regression models depending on perceived changes in the data. We analyze three alternatives for change detection. Additionally, a noise canceling and a state determination and windowing mechanisms are used for improving the robustness of online prediction. We validate our ideas on real data collected from the pilot CFB boiler.
  2. Title: An Exploration of Climate Data Using Complex Networks
    Authors: Karsten Steinhaeuser, Nitesh V. Chawla, and Auroop R. Ganguly
    Abstract:
    To discover patterns in historical data, climate scientists have applied various clustering methods with the goal of identifying regions that share some common climatological behavior. However, past approaches are limited by the fact that they either consider only a single time period (snapshot) of multivariate data, or they consider only a single variable by using the time series data as multi-dimensional feature vector. In both cases, potentially useful information may be lost. Moreover, clusters in high-dimensional data space can be dicult to interpret, prompting the need for a more e ective data representation. We address both of these issues by employing a complex network (graph) to represent climate data, a more intuitive model that can be used for analysis while also having a direct mapping to the physical world for interpretation. A cross correlation function is used to weight network edges, thus respecting the temporal nature of the data, and a community detection algorithm identifies multivariate clusters. Examining networks for consecutive periods allows us to study structural changes over time. We show that communities have a climatological interpretation and that disturbances in structure can be an indicator of climate events (or lack thereof). Finally, we discuss how this model can be applied for the discovery of more complex concepts such as unknown teleconnections or the development of multivariate climate indices and predictive insights.
  3. Title: A Comparison of SNOTEL and AMSR-E Snow Water Equivalent Datasets in
    Western U.S. Watersheds
    Authors:Cody L. Moser, Oubeidillah Aziz, Glenn A. Tootle, Venkat Lakshmi, and Greg Kerr
    Abstract:
    It is a consensus among earth scientists that climate change will result in an increased frequency of extreme events (e.g., precipitation, snow). Streamflow forecasts and flood/drought analyses, given this high variability in the climatic driver (snowpack), are vital in the western United States. However, the ability to produce accurate forecasts and analyses is dependent upon the quality (accuracy) of these predictors (snowpack). Current snowpack datasets are based upon in-situ telemetry. Recent satellite deployments offer an alternative remote sensing data source of snowpack. The proposed research will investigate (compare) remote sensing datasets in western U.S. watersheds in which snowpack is the primary driver of streamflow. A comparison is made between snow water equivalent (SWE) data from in-situ snowpack telemetry (SNOTEL) sites and the advanced microwave scanning radiometer – earth observing system (AMSR-E) aboard NASA’s Aqua satellite. Principal component techniques and Singular Value Decomposition are applied to determine similarities and differences between the datasets and investigate regional snowpack behaviors. Given the challenges (including costs, operation and maintenance) of deploying SNOTEL stations, the objective of the research is to determine if satellite based remote sensed SWE data provide a comparable option to in-situ datasets. Watersheds investigated include the North Platte River, the Upper Green River, and the Upper Colorado River. The time period analyzed is 2003-2008, due to the recent deployment of the NASA Aqua satellite. Two distinct snow regions were found to behave similarly between both datasets using principal component analysis. Singular Value Decomposition linked both data products with streamflow in the region and found similar behaviors among datasets. However, only 11 of the 84 SNOTEL sites investigated correlated at a significance of 90% or greater with its corresponding AMSR-E cell. Also, when comparing SNOTEL data with the corresponding satellite cell, there was a consistent difference in the magnitude (Snow Water Equivalent) of the datasets. Finally, both datasets were utilized and compared in a statistically based streamflow forecast of several gages.
  4. Title: EDISKCO: Energy Efficient Distributed in-Sensor-Network K-center Clustering with Outliers
    Authors: Marwan Hassani, Emmanuel Muller, and Thomas Seidl
    Abstract:
    Clustering is an established data mining technique for grouping objects based on similarity. For sensor networks one aims at grouping sensor measurements in groups of similar
    measurements. As sensor networks have limited resources in terms of available memory and energy, a major task sensor clustering is ecient computation on sensor nodes. As a dominating energy consuming task, communication has to be reduced for a better energy eciency. Considering memory, one has to reduce the amount of stored information on each sensor node. For in-network clustering, k-center based approaches provide k representatives out of the collected sensor measurements. We propose EDISKCO, an outlier aware incremental
    method for ecient detection of k-center clusters. Our novel approach is energy aware and reduces amount of required transmissions while producing high quality clustering results. In thorough experiments on synthetic and real world data sets, we show that our approach outperforms a competing technique in both clustering quality and energy eciency. Thus, we achieve overall signi cantly better life times of our sensor networks.
  5. Title: Phenological Event Detection from Multitemporal Image Data
    Authors: Ranga Raju Vatsavai
    Abstract:
    Monitoring biomass over large geographic regions for seasonal changes in vegetation and crop phenology is important for many applications. In this paper we a present a novel clustering based change detection method usingMODIS NDVI time series data. We used well known EM technique to find GMM parameters and Bayesian Information Criteria (BIC) for determining the number of clusters. KL Divergence measure is then used to establish the cluster correspondence across two years (2001 and 2006) to determine changes between these two years. The changes identi ed were further analyzed for understanding phenological events. This preliminary study shows interesting relationships between key phenological events such as onset, length, end of growing seasons.
  6. Title: Mining in a Mobile Environment
    Authors: Sean McRoskey, James Notwell, Nitesh V. Chawla, and Christian Poellabauer
    Abstract:
    Distributed PRocessing in Mobile Environments (DPRiME) is a framework for processing large data sets across an ad-hoc network. Developed to address the shortcomings of Google’s MapReduce outside of a fully-connected network, DPRiME separates nodes on the network into a master and workers; the master distributes sections of the data to available onehop workers to process in parallel. Upon returning results to its master, a worker is assigned an unfinished task. Five data mining classifiers were implemented to process the data: decision trees, k-means, k-nearest neighbor, Naiıve Bayes, and artificial neural networks. Ensembles were used so the classification tasks could be performed in parallel. This framework is well-suited for many tasks because it handles communications, node movement, node failure, packet loss, data partitioning, and result collection automatically. Therefore, DPRiME allows users with little knowledge of networking or distributed systems to harness the processing power of an entire network of single- and multi-hop nodes.

 

SHORT PAPERS:

  1. Title: On the Identification of Intra-seasonal Changes in the Indian Summer Monsoon
    Authors: Shivam Tripathi and Rao S. Govindaraju
    Abstract:
    Intra-seasonal changes in the Indian summer monsoon are generally characterized by its active and break (A&B) states. Existing methods for identifying the A&B states using rainfall data rely on subjective thresholds, ignore temporal dependence in the data, and disregard inherent uncertainty in their identi cation. This paper develops a method to identify intra-seasonal changes in the monsoon using a hidden Markov model (HMM) that allows objective classification of the monsoon states. The method facilitates probabilistic interpretation which is especially useful during the transition period between the two monsoon states. The developed method can also be used to - (i) identify monsoon states in real time, (ii) forecast rainfall values, and (iii) generate synthetic data. Comparisons of the results from the proposed model with those from existing methods suggest that the new method is a promising for detecting intra-seasonal changes in the Indian summer monsoon.
  2. Title: Reduction of Ground-Based Sensor Sites for Spatio-Temporal Analysis of Aerosols
    Authors: Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic
    Abstract:
    In many remote sensing applications it is important to use multiple sensors to be able to understand the major spatiotemporal distribution patterns of an observed phenomenon. A particular remote sensing application addressed in this study is estimation of an important property of atmosphere, called Aerosol Optical Depth (AOD). Remote sensing data for AOD estimation are collected from ground and satellite-based sensors. Satellite based measurements can be used as attributes for estimation of AOD and in this way could lead to better understanding of spatiotemporal aerosol patterns on a global scale. Ground-based AOD estimation is more accurate and is traditionally used as groundtruth information in validation of satellite-based AOD estimations. In contrast to this traditional role of ground-based sensors, a data mining approach allows more active use of ground-based measurements as labels in supervised learning of a regression model for AOD estimation from satellite measurements. Considering the high operational costs of groundbased sensors, we are studying a budget-cut scenario that requires a reduction in a number of ground-based sensors. To minimize loss of information, the objective is to retain sensors that are the most useful as a source of labeled data. The proposed goodness criterion for the selection is how close the accuracy of a regression model built on data from a reduced sensor set is to the accuracy of a model built of the entire set of sensors. We developed an iterative method that removes sensors one by one from locations where AOD can be predicted most accurately using training data from the remaining sites. Extensive experiments on two years of globally distributed AERONET ground-based sensor data provide strong evidence that sensors selected using the proposed algorithm are more informative than the competing approaches that select sensors at random or that select sensors based on spatial diversity.
  3. Title: OcVFDT: One-class Very Fast Decision Tree for One-class Classification of Data Streams
    Authors: Chen Li, Yang Zhang, and Xue Li
    Abstract:
    Current research on data stream classification mainly focuses on supervised learning, in which a fully labeled data stream is needed for training. However, fully labeled data streams are expensive to obtain, which make the supervised learning approach difficult to be applied to real-life applications. In this paper, we model applications, such as credit fraud detection and intrusion detection, as a one-class data stream classification problem. The cost of fully labeling the data stream is reduced as users only need to provide some positive samples together with the unlabeled samples to the learner. Based on VFDT and POSC4.5, we propose our OcVFDT (One-class Very Fast Decision Tree) algorithm. Experimental study on both synthetic and real-life datasets shows that the OcVFDT has excellent classification performance. Even 80% of the samples in data stream are unlabeled, the classification performance of OcVFDT is still very close to that of VFDT, which is trained on fully labeled stream.
  4. Title: A Frequent Pattern Based Framework for Event Detection in Sensor Network Stream Data
    Authors: Li Wan, Jianxin Liao, and Xiaomin Zhu
    Abstract:
    In this paper, we presented a frequent pattern based framework for event detection in stream data, it consists of frequent pattern discovery, frequent pattern selection and modeling three phases: In the first phase, a MNOE (Mining Non-Overlapping Episode) algorithm is proposed to find the non-overlapping frequent pattern in time series. In the frequent pattern selection phase, we proposed an EGMAMC (Episode Generated Memory Aggregation Markov Chain) model to help us selecting episodes which can describe stream data significantly. Then we defined feature flows to represent the instances of discovered frequent patterns and categorized the distribution of frequent pattern instances into three categories according to the spectrum of their feature flows. At last, we proposed a clustering algorithm EDPA (Event Detection by Pattern Aggregation) to aggregate strongly correlated frequent patterns together. We argue that strongly correlated frequent patterns form events and frequent patterns in different categories can be aggregated to form different kinds of events. Experiments on real-world sensor network datasets demonstrate that the proposed MNOE algorithm is more efficient than the existing non-overlapping episode mining algorithm and EDPA performs better when the input frequent patterns are maximal, significant and non-overlapping.
  5. Title: Supervised Clustering via Principal Component Analysis in a Retrieval Application
    Authors: Esteban Garcia-Cuesta, Ines M. Galvan, and Antonio J. de Castro
    Abstract:
    In regression problems where the number of predictors exceeds the number of observations and the correlation between the predictors is high, a dimensionality reduction or a variable selection approach is demanded. In this paper we deal with a real application where we want to retrieve the physical characteristics of a combustion process from the measurements obtained with a spectroscopic sensor. This application shows up a multicollinearity problem but furthermore it is considered an ill-posed problem. Guided by this application scenario, we propose a clustering approach to find out homogeneous subsets of data which are embedded in arbitrary oriented linear manifold. This model is developed under certain assumptions guided by a priori problem knowledge. The resulting division preserves both, the priori assumptions and the homogeneity in the models. Thereby we break the whole problem in n subproblems improving its individual prediction accuracy versus a global solution. We show the obtained improvements in a real application scenario related with estimating the temperature from spectroscopic data in a remote sensing framework.
  6. Title: A Novel Measure for Validating Clustering Results Applied to Road Traffic
    Authors: Yosr Naija and Kaouther Blibech Sinaoui
    Abstract:
    The clustering validation and clustering interpretation are the two last steps of clustering process. The validation step permits to evaluate the goodness of clustering results using some measures. Valid results are then generally interpreted and used in cluster analysis. The validity measures are classfied into three categories: unsupervised measures, supervised measures and relative measures. Several supervised measures have been proposed to perform supervised evaluation such as entropy, purity, F-measure, Jaccard coefficient and Rand statistic. Generally, these measures evaluate results according to class labels. However, they are not always able to distinguish interpretable clusters because most of them depends on the number of labels. This paper proposes a new supervised evaluation measure - called "homogeneity degree"- that permits to merge the steps of validation and interpretation. Our measure is applied to a real traffic data set and is used to interpret some traffic situations. Comparison with other evaluation measures shows the performance of our proposal.
  7. Title: SkyTree: Scalable Skyline Computation for Sensor Data
    Authors: Jongwuk Lee and Seung-won Hwang
    Abstract:
    Skyline queries have gained attention for supporting multicriteria analysis of large-scale datasets. While a lot of skyline algorithms have been proposed, most of the algorithms build upon pre-computed index structures that cannot generally be supported over sensor data of dynamically changing attribute values. We aim to design a scalable non-index skyline computation algorithm for sensor data. More specifically, we propose Algorithm SkyTree constructing a dynamic lattice that divides a specific region into several subregions based on a pivot point maximizing dominance region. Such structure enables to perform region-wise dominance tests, which eliminates unnecessary point-wise dominance tests. In addition, we ensure the progressiveness that has not been supported by any non-index algorithm, where we can identify k points maximizing the sum of dominance regions as the greedy approximation method. The k points are used to reduce communication cost between sensors in computing global skyline. Our evaluation results validate the efficiency of Algorithm SkyTree, both in terms of response time and communication overhead, over existing algorithms.
  8. Title: Clustering of Power Quality Event Data Collected via Monitoring Systems
    Installed on the Electricity Network
    Authors: Mennan Guder, Nihan Kesim Cicekli, Ozgul Salor, and Isik Cadirci
    Abstract:
    In this paper, a k-means-based clustering method applied to power quality event data is described. The data are collected by the power quality (PQ) monitors, which are developed through the National PQ Project and installed on the electricity network. The PQ monitors detect the PQ events defined as voltage sags, swells, and interruptions by the IEC Standard 61000-4-30, and collect the raw data of the event. The proposed method aims to cope with the huge event data size and cluster the event types so that PQ events are ultimately classified. The method helps to manage the event data to come up with PQ assessments for the specific measurement points and to make comparisons of various measurement points in terms of PQ of the electricity network.

 

CHALLENGE PAPERS:

  1. Title: Change Detection in Rainfall and Temperature Patterns over India
    Authors: Shivam Tripathi and Rao S. Govindaraju
    Abstract:
    The changes in rainfall and temperature patterns over India were detected using Mann-Kendall trend test, Bayesian change point analysis, and a hidden Markov model. A regionalization method was developed to identify homogeneous regions that experience similar weather states. The regionalization helped in nding contiguous regions with strong change signals. The data were investigated at di erent temporal and spatial resolution to explore the nature of changes. The study found that all India summer monsoon is stable, but the winter or the north-east monsoon is gradually intensifying. It also detected an abrupt drop in the winter and spring temperature over north-central India and a gradual increase in the summer temperature over the peninsular India. Robustness of the detected changes were evaluated using recent reanalysis dataset.
  2. Title: Anomaly Detection and Spatio-Temporal Analysis of Global Climate System
    Authors: Mahashweta Das and Srinivasan Parthasarathy
    Abstract:
    Knowledge discovery from temporal, spatial and spatio-temporal data is pivotal for understanding and predicting the behavior of Earth’s ecosystem model. An important influence
    leaving its impact on the ecosystem is the global climate system. In this paper, the Earth Science data that we have analyzed consists of daily global air temperature and precipitation measurements, aggregated from heterogeneous sensors for fifty years (1950-1999). The enormous amount of data that is available for analysis requires employment of data mining techniques for discovering interesting patterns, detecting significant changes and extracting meaningful insights from the data. Our work considers the problem of detecting anomalous (abnormal or unexpected) behavior in the global climate system, discovering teleconnection patterns and providing consequential insights to the analysts.

 

 

   

 

Thanks to Our
Sponsors!

ORNL

CONET


COPYRIGHT February 2009 | Contact: Olufemi A. Omitaomu