Tracking frequent items over distributed probabilistic data


Tracking frequent items (also called heavy hitters) is one of the most fundamental queries in real-time data due to its wide applications, such as logistics monitoring, association rule based analysis, etc. Recently, with the growing popularity of Internet of Things (IoT) and pervasive computing, a large amount of real-time data is usually collected from multiple sources in a distributed environment. Unfortunately, data collected from each source is often uncertain due to various factors: imprecise reading, data integration from multiple sources (or versions), transmission errors, etc. In addition, due to network delay and limited by the economic budget associated with large-scale data communication over a distributed network, an essential problem is to track the global frequent items from all distributed uncertain data sites with the minimum communication cost. In this paper, we focus on the problem of tracking distributed probabilistic frequent items (TDPF). Specifically, given k distributed sites S = {S1, … , Sk}, each of which is associated with an uncertain database Di of size ni, a centralized server (or called a coordinator) H, a minimum support ratio r, and a probabilistic threshold t, we are required to find a set of items with minimum communication cost, each item X of which satisfies Pr(sup(X) ≥ r × N) > t, where sup(X) is a random variable to describe the support of X and N= ∑i= 1 kni. In order to reduce the communication cost, we propose a local threshold-based deterministic algorithm and a sketch-based sampling approximate algorithm, respectively. The effectiveness and efficiency of the proposed algorithms are verified with extensive experiments on both real and synthetic uncertain datasets.

Publication Title

World Wide Web