Final Summary Report

Executive summary

As the scale of today’s networked techno-social systems continues to increase, the analysis of their global phenomena becomes increasingly difficult, due to the continuous production of streams of data scattered among distributed, possibly resource-constrained nodes, and requiring reliable resolution in (near) real-time.

The FP7 ICT FET Open LIFT project explores a novel approach for realising sophisticated, large-scale distributed data-stream analysis systems, relying on processing local data in situ. Our key insight is that, for a wide range of distributed data analysis tasks, we can employ novel geometric techniques for intelligently decomposing the monitoring of complex holistic conditions and functions into safe, local constraints that can be tracked independently at each node (without communication), while guaranteeing correctness for the global-monitoring operation. While some solutions exist for the limited case of linear functions of the data, it is hard to deal with general, non-linear functions: in this case, a node’s local function value essentially tells us absolutely nothing about the global function value. Our fundamental idea is to design novel algorithmic tools that monitor the input domain of the global function rather than its range. Each node can then be assigned a safe zone (SZ) for its local values that can offer guarantees for the value of the global function over the entire collection of nodes.

This represents a dramatic shift in conventional thinking and the state-of-the-art. We aim to reduce the amount of communication and data collection across nodes to a minimum, requiring nodes to communicate only when their local constraints are violated. Privacy protection, in the case when transmitted data contain sensitive information, is also revolutionized in our view. We have investigated real-life scenarios from cloud health monitoring, large-scale analysis of human mobility and traffic phenomena, internet-scale distributed querying, and monitoring sensor networks. 

Our dissemination results include more than 30 scientific publications, including high-level peer-reviewed journals and conferences. The fact that journals such as PVLDB, TKDE and conferences such as ICDE and VLDB in data management KDD and ECML in data mining, and ACM GIS and AGILE in geographic information science could be addressed guarantees a strong potential impact and high visibility for the LIFT project. In addition, we have presented LIFT results in more than 50 scientific events, such as invited keynote speeches, panellists discussions, or summer schools.

 

Summary description of project context and objectives

As the scale of today’s networked techno-social systems continues to increase, the analysis of their global phenomena becomes increasingly difficult, due to the continuous production of streams of data scattered among distributed, possibly resource-constrained nodes, and requiring reliable resolution in (near) real-time. As both the scale of today’s networked systems, and the volumes and rates of the associated data streams continue to increase, algorithms and tools for effectively analysing them are becoming an important research mandate. Powerful tools for sophisticated data processing and analysis – such as relational and OLAP query engines, decision trees, association rule mining, feature selection and classification, ranking, statistical analysis, phase-change detection, anomaly and outlier detection, etc. – are indispensable in extracting useful correlations, patterns, and knowledge from data. These tools are used for understanding hidden mechanisms, decision making, optimising the operation of complex and autonomous systems, improving user interaction with automated services, and many other critical tasks. For example, in a keynote he gave on “Extreme Data mining” at SIGMOD 2008, Sridhar Ramaswamy from Google said, referring to the log files generated at the Google data centre: “they are our only source of truth”. At the same time they are “orders of magnitude larger than any other data in our system”.

The data-management and data-mining research communities pride themselves on the ability of their technologies to scale to large amounts of data. Unfortunately, traditional data processing and analysis systems are built on the assumption of centralised and static data sets, and this assumption is painfully unrealistic in today’s connected, data-driven world: The largest data collections will inevitably be generated by distributed, dynamic sources of evolving information streams, and reside on the nodes of large-scale networked systems. However, current technologies for data analysis fail to excel in two crucial scalability dimensions: (1) the degree of distribution and (2) the dynamicity of the data. The degree of distribution is the key scalability metric for massive networked systems, such as the Internet (estimated at hundreds of millions of nodes) or even modern P2P networks (easily comprising dynamic collections of thousands of peer nodes). By contrast, the largest database systems in the world scale to at most a few hundred nodes. In addition, the information residing in such systems is invariably dynamic in nature, with continuous streams of data flowing from a variety of sources, including packet traces, system logs, communication protocols, multiple sensor modalities, and mobile pervasive-computing devices and applications.

Such massively-distributed data streams need to be processed in a variety of ways, with a number of critical tasks, e.g., network health monitoring and traffic engineering, mobile-user services, decision making in large-scale sensor nets and robot swarms, often 

(1) involving dynamically evolving streams of data, scattered among several, possibly resource-constrained, distributed nodes, and 

(2) requiring reliable and quick resolution in (near) real-time. 

The stringent requirements for such distributed-stream processing engines raise a host of novel research challenges which, for the most part, remain open.

The local inference paradigm. In the LIFT project, we explore a novel approach for realising sophisticated, large-scale distributed data-stream analysis systems. Our approach represents a radical departure from the current paradigm of centralising all the data. It relies on processing local streams of data in situ (i.e., locally at the nodes where the data is observed). At the core of our approach, we will develop general, principled algorithmic paradigms for solving a broad spectrum of complex, distributed data-stream analysis problems in a communication-efficient manner, making a strong case for local data processing and inference as a first choice in large-scale distributed systems. Our key technical insight is that, for several global data-analysis tasks, including a wide range of continuous queries and complex data mining tasks, we can employ novel geometric techniques for intelligently decomposing the monitoring of complex holistic conditions and functions into safe, local constraints that can be tracked independently at each node (without communication), while guaranteeing correctness for the global-monitoring operation. Thus, our approach aims to reduce the amount of communication and data collection across nodes to a minimum, essentially requiring nodes to communicate only when their local constraints are violated, implying that a global “phase change” or a significant shift in the value of the global target function is highly-probable. 

From linear to non-linear functions. The idea of processing a global function through local computations represents a dramatic shift in conventional thinking and the current state-of-the-art as exemplified by the centralisation paradigm; and, at first glance, decomposing the tracking of complex functions – as needed for data mining – over distributed data into local constraints seems like an impossible task. While some solutions do exist for the simple, and severely-limited, case of linear functions of the data it is very hard to identify a way to extend the approach to general, non-linear functions. In a nutshell, the key difficulty is that, for non-linear functions, a node’s local function value essentially tells us absolutely nothing about the global function value we intend to monitor. Our fundamental idea is to design novel algorithmic tools, inspired by insights from computational geometry, that monitor the input domain of the global function rather than its range. Each node can then be assigned a safe zone (SZ) for its local data-stream values that can offer guarantees for the value of the global function over the entire collection of nodes. 

Departure points: Theory and practice of safe zoning, summarisation, modeling, and privacy. An initial version of the geometric decomposition idea was recently introduced by project partners, with preliminary results from specific applications. Taking this as a starting point, we will rigorously develop the fundamental geometric-decomposition idea, and combine it with novel, powerful geometric insights in order to extend its applicability to several new application domains and develop principled tools for optimal geometric monitoring. Furthermore, we will extend the basic geometric approach with powerful new features that allow nodes to handle massive local data streams with limited space and CPU-time resources and further reduce communication exchanges – these features include local data summarisation (for effectively capturing and monitoring streaming data at the nodes) and local data modeling (for capturing local data-stream dynamics). Preliminary versions of the summarisation and modeling ideas have been explored by project partners in specific applications, but will be significantly extended, optimised, and coupled with the powerful geometric-decomposition paradigm as part of our research effort.

Through the minimisation of communication exchanges, our novel approach to global function monitoring also provides a new principled framework for naturally privacy-preserving distributed computations, a topic that becomes ever more important in today’s techno-social networks and is a show-stopper for many centralised approaches. Intuitively, by continuously tracking local SZs, our techniques require communication across nodes only when significant shifts in the global function are highly probable – thus, our monitoring schemes naturally try to restrict information sharing across nodes to the minimum level required for the global monitoring task. This can obviously help alleviate important ownership, administrative, and, perhaps most importantly, privacy concerns on data sharing, that can be critical when dealing with large-scale distributed systems. And, of course, the use of local summaries and models can only help enhance privacy since it further restricts the disclosure of information across nodes to concise summaries and/or models of the local data (that are specifically tuned for the overall global monitoring task) in addition to further reducing communication. These ideas, once again, provide a novel framework for guaranteeing privacy in efficient distributed computations that is explored in the LIFT project.

A new research program. It is only with the ability to track general, non-linear functions that an essentially de-centralised approach to data-stream analysis becomes a possibility for the full range of data processing and analysis tasks that have to be supported in the modern world of knowledge processing. What constitutes the envisaged breakthrough is that we aim not for a mere collection of tools and tricks for local inference. Based on a conceptually simple, yet highly powerful and theoretically well-founded idea, the local approach of LIFT holds the potential for a new unified research program for scalable distributed data mining under real-time conditions and in dynamic domains. It holds practically unlimited potential for extensions, refinements, and applications. Not only can we capitalize on the best learning algorithms invented so far and transform many of them to new magnitudes of scalability in a decentralised setting (by “localizing” and “safe zoning” them); we also expect completely new algorithms which can take optimal advantage of the local approach. 

Substantiating this claim by (1) developing the theoretical core; (2) its extensions and variants; and (3) practical demonstrators, is the mission of the LIFT project. 

Proof-of-Concept Application Domains and Scenarios

As part of our research effort, we have explored the use of our techniques in the context of four important, diverse, and forward-looking application domains: 

(1) Real-Time Analysis of User-Mobility Data. Mining of mobility data has recently become a hot topic. We will extend mobility data mining to a highly-distributed real-time scenario. The purpose is to provide powerful analytical and prediction methods for the management of mobility networks. At local level, personalised local patterns and models will be continuously mined and maintained, supporting on-line personalised decision making on mobility management for the driver. At infrastructure level, adequate global patterns, aggregations and models about the overall population will be reconstructed with minimal communication overhead and within a privacy-preserving framework..

(2) Internet-Scale Query Processing (ISQP) Systems, where the goal is to track massive widely-distributed streams for global phenomena in large-scale networks (e.g., DDoS attack detection) or P2P systems (e.g., semantic community formation). Our effort here is closely related to the recent visionary proposal of realising a peer-based, Public-Health monitoring service for the Internet.

(3) Monitoring Large-Scale Sensor Networks, with the goal of real-time collective decision-making for natural-disaster alerts. We will apply the LIFT approach to the specific requirements of sensor networks, which mainly consists of a high penalty for communication overhead (as a result of the negative impact on battery lifetime). An actual hardware implementation of sensor networks is not a primary objective of this project. However, we plan to rigorously implement and test the developed algorithms on real data. 

(4) Real-time Monitoring of Massive Distributed Logs in Clouds and Data Centres, in order to optimise overall system management and operations. The logs are streaming, distributed and enormous. We have candidate real-life massive streaming log-datasets, whose use we will investigate, for example by Microsoft. We will investigate the use of these data-sets and apply our methods to show that the required data mining analyses can be carried distributively with low overhead and affordable central resources. 

In the project we investigate first the theoretical foundations plus important extensions (shape of safe zones, local violation handling and local summaries, probabilistic guarantees, privacy) which capture the various trade-offs present in an application domain. Then, in a diverse set of domains we demonstrate the theoretical and practical viability of the approach, followed by demonstrations in the chosen domains according to the measurable success criteria. 

Evaluation & Success Criteria

The project’s success will be measured by the following technical/scientific validation criteria:

Communication reduction with respect to global/state-of-the art solutions. While still dependent on the application, savings can reach 90% in the proposed applications and in some of them may even come close to 100% of the communication volume, as the initial experiments indicate. Other quantitative validation criteria are processing time and number of false alarms.

Number of domains to which the approach can be deployed. We will enable several futuristic applications that are well beyond the capabilities of current methods. 

The Degree of Genericity we can achieve (a theoretical success criterion). The SZs approach is highly general; still, its application to different domains demands specific choices in terms of the shape of the zones (e.g., convex polyhedra, boxes), the density modeling, the optimisation method, and possibly the allowance of false negatives. This may cause different trade-offs between response time, communication overhead and correctness for the domains. 

Description of the main S&T results

The work of the LIFT project has been structured into 5 technical work packages (WPs) and has been carried out in three phases:

 • WP1 developed the basic theory of safe zones, local summarisation and modeling, on which the project is based. It provides the project with algorithms for monitoring of distributed data, including optimality guarantees. Also, a set of “safe zone-enabled” standard data mining algorithms have been developed as well as an integration of the safe zone approach with distributed online learning.

• WP2 complements WP1 with a set of approaches for privacy-preserving data mining. In contrast to WP1, which seeks to minimise communication per se, the privacy-preserving operations developed here minimise and control the communication of sensitive information. 

• WP3 bridges the gap between the theoretical results in WPs 1 and 2 and their practical implementation in the scenarios in WP4. Based on the theoretical requirements, it defines a reference architecture, on which the implementation in each scenarios will be built.

• WP4 demonstrates the feasibility and impact of the LIFT methodology in four selected challenging scenarios. In each task, it defines the relevant monitoring tasks, implements a demonstrator based on the technology defined in WP3, and collaborates with WP5 in the scenario-specific evaluation of the LIFT approach. 

• WP5 has carried out the evaluation of the LIFT approach. It coordinates the single evaluation activities in each scenario, and evaluates the feasibility and impact in other application areas.

Phases of the LIFT project have been:

Phase 1 (M1 – M12) developed the theoretic framework underlying this project. Starting from existing methods, new scientific requirements have been identified and corresponding methods have been developed. In addition, privacy guidelines and regulations that have to be followed have been identified. To complement these efforts, bottom-up analyses of the scenarios including their requirements and available data have been made. 

Phase 2 (M13 – M24) was dealing with scenario modeling and theory refinement. This phase was characterised by frequent interactions between the theoretical work packages and the scenarios. Building on the outcome of phase 1, the theory and methods were refined and adapted to each of the scenarios, while new scenario requirements were stimulating the further development of the algorithmic methods. Design decisions about the reference software and data architecture have been made. In addition, the reference architecture was finalised based on both the theoretical requirements and on the generalisation on the software requirements in each of the scenarios. 

Phase 3 (M25 – M34) was demonstrating and evaluating the impact of the methods developed in this project. In this phase, novel approaches for monitoring distributed data were finalised and applied to each of the scenarios. The Safe-Zone-Approach was be applied to a set of standard data mining tools to ease the transition to new applications. In the demonstrators the system was deployed and evaluated in the context of high-profile use cases with high potential of impact. This phase also includes dissemination activities, in particular a scientific workshop presenting main LIFT results and related work was organized on the VLDB 2013 conference in Trento: First International Workshop on Big Dynamic Distributed Data (http://www.softnet.tuc.gr/bd3/)

 

Main results of the work on theory

A major goal of the LIFT project is to reduce the global testing and monitoring of massively distributed systems to the checking of local constraints, thus minimizing communication overhead. Major results have been:

• generalizing previous work on geometric monitoring;

• applications of the mathematical and algorithmic techniques developed so far to solve a problem in static distributed databases;

• developing the general theory of the safe-zone approach, and studying its complexity;

• utilizing local prediction models as mechanisms for tracking the temporal dynamics of local data streams and further reduce communication;

The work extending and applying the concepts on safe-zones (geometric monitoring) and sketches has manifested in numerous publications in high level database conferences, journals, and workshop papers. We are giving a brief overview on the main results:

G. Sagy, D. Keren, I. Sharfman, A. Schuster. Distributed Threshold Querying of General Functions by a Difference of Monotonic Representation. Proc. of the Intl. Conf. On Very Large Databases, 2011

This research, deals with threshold queries of general functions over a distributed database. The goal of a threshold query is to detect all objects whose score exceeds a given threshold. This type of query is used in many problems in data mining, event triggering, and top-k selection. Often, threshold queries are performed over distributed data, where an object’s score is computed by aggregating the value of each attribute, applying a scoring function over the aggregation, and thresholding the function’s value. However, joining all the distributed relations to a central database might incur prohibitive overheads in bandwidth, CPU, and storage accesses. Efficient algorithms required to reduce these costs exist only for monotonic aggregation threshold queries and certain specific scoring functions. Thus, this problem is very closely related to the LIFT paradigm of minimizing communication overhead in massively distributed sys-tems. We presented a novel approach for efficiently performing general distributed threshold queries. First, a solution for monotonic functions was presented, which applies and develops the geometric ideas presented in the bounding volumes approach; then, we introduced a technique to solve for other functions by representing them as a difference of monotonic functions. Experiments with real-world data demonstrated the method’s effectiveness in achieving low communication and access costs. The bounding volume approach is applied to solve a general threshold problem over static, distributed databases, where the thresholded function does not have to satisfy a condition such as linearity, convexity, or mono- tonicity. First, the problem is solved for monotonic functions. Using bounding volumes, we show how local filtering can prune out a high percentage of the objects. Then general functions are treated by representing them as a difference of monotonic functions.

D. Keren. , I. Sharfman, A. Schuster, A. Livne. Shape Sensitive Geometric Monitoring. IEEE Transactions on Knowledge and Data Engineering, 2012.

We extend the basic geometric paradigm towards using Safe-Zones with different shapes. In this paper we prove that every safe-zone defined as above is convex (this is regardless of the properties of the monitored function f ()). On the other hand, since every convex set is closed under averaging, it follows that every convex subset of S can serve as a safe-zone. Therefore, it makes sense to seek an optimal convex subset of S and use it as a safe-zone. There are a few possible definitions for “optimal”; generally, a good safe-zone is one in which the local vectors will remain for a long time (this is important since every breach of the safe-zone results in communication), and also one which is relatively simple to define. The latter requirement follows from the need of every node to continuously test whether its dynamic data vector is in the safe-zone, and – especially in the case of thin nodes (i.e. battery-operated sensors) – it is desirable that this testing process will consume as little resources as possible.

We have experimented with safe-zones consisting of convex polyhedra, and compared them with the basic geometric method. Data was restricted to lie in two-dimensional Euclidean space, obtained from wind measure-ments of the El-Nino system in two directions. Using the optimized convex subsets as safe-zones improved performance over previous work.

As compared to the previous bounding volume approach, the new method has the advantage that it searches for an optimal safe-zone. The drawback is that an (offline) difficult optimization has to be solved – finding an optimal convex subset of the potentially high-dimensional, complicated set S. We have made some progress in this direction for the problem of monitoring the median function, which is very important in many applications (robust analysis, sketching, ranking, finding percentiles and heavy-hitters over distributed streams).

N. Giatrakos, A. Deligiannakis,M. Garofalakis, I. Sharfman, and A. Schuster. “Prediction-Based Geometric Monitoring over Distributed Data Streams”. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012.

In this work, we employ dynamic local prediction models that aim to describe local stream evolution at each site. Such models have already proven beneficial in terms of bandwidth consumption in distributed settings time. The idea is that these local models become part of the information exchanged between the remote sites and the coordinator, so that both parties are “in-sync” with respect to predicted local stream evolution, thus reducing the amount of communication significantly. To guarantee communication savings, we describe two sufficient conditions. The first condition is of mostly theoretical interest and essentially requires the containment of the convex hull defined by the uip’s by the convex hull of the static method — which is hard to guarantee in practice. The second condition relies on the idea of monitoring the intersection of the two convex hulls (predictor and static), which is easily shown to be non-empty and is obviously contained within the convex hull of the static method. Based on this key insight, we propose a novel monitoring framework that imposes local bounding regions that cover this convex hull intersection while requiring only minimal information sharing across nodes We also explore relaxations of the containment and intersection conditions that expose a number of novel alternative tracking mechanisms. In particular, instead of requiring continuous containment or intersection monitoring, we propose simple mechanisms that attempt to render such conditions highly likely while also minimizing information sharing across sites (and, thus, reducing communication). 

D. Keren, G. Sagy, A. Abboud,D. Ben-David, A. Schuster, I. Sharfman and A. Deligiannakis. “Geometric Monitoring of Heterogeneous Streams”. IEEE Transactions on Knowledge and Data Engineering, 2013

Interest in stream monitoring is shifting toward the distributed case. In many important applications (e.g. sensor networks and network operations centers) the data which has to be monitored is of very high volume, dynamic, and distributed, making it infeasible to collect the distinct data streams to a central node and process them there. Often, the monitoring problem consists of determining whether the value of a global function, which depends on the union of all streams, crossed a certain threshold. A great deal of effort is directed at reducing communication overhead by transforming the monitoring of the global function to the testing of local constraints, checked independently at the nodes. Recently, the geometric method proved to be very useful for constructing such local constraints for general (non-linear, non-monotonic) functions. Alas, in all existing variants of geometric monitoring, the constraints at all nodes are identical, and thus unsuitable for heterogeneous streams. To remedy this, we introduce a more general approach for geometric monitoring of heterogeneous streams (HGM), which defines constraints that are tailored to fit the distinct data distributions at the nodes. We formulate the recovery of such constraints as the solution of an optimization problem, and analyze its complexity. Next, an algorithm to efficiently solve this problem is presented, which determines constraints that reduce communication and are also simple to check, making the approach appropriate for thin nodes, such as sensors with limited processing power. The applicability of the method to various real problems is demonstrated in experiments, which yield substantial improvement over previous work. Both experiments and theoretical analysis show that this advantage increases with the dimension of the data. In addition, we consider a novel method for recovery from constraint violation.

M. Boley, A. Schuster, I. Sharman, D. Keren, and M. Kamp. Communication-efficient distributed online prediction by monitoring distances of local models. In First International Workshop on Big Dynamic Distributed Data (at VLDB 2013).

We applied the LIFT primitives to online prediction problems where data points are observed at local nodes in a distributed environment and there is a trade-o between maximizing predic- tion accuracy and minimizing network communication. This situation abounds in a wide range of machine learning applications, in which communication induces a severe cost.

We give the first distributed prediction protocol for linear models that, at the same time, aims to provide a high online in-place prediction performance and explicitly tries to minimize communication. In terms of predictive power the protocol retains the asymptotic optimal regret of the existing distributed mini-batch algorithm. In addition, it allows reducing the communication among the local nodes substantially. This is achieved by a dynamic data dependent communication schedule, which, in contrast to previous algorithms, remains applicable when the data is non-stationary and shows rapid concept drifts. We demonstrate this empirically with controlled synthetic data and real-world datasets from stock markets and the short-message service Twitter.

The main idea is to synchronize the local models to their mean model in order to reduce their variance, but to do so only in system states that show a high divergence among the models. This divergence, measured by the average model distance to the mean model indicates the synchronizations that are most important in terms of their correcting effects on the predictions. In stable phases this idea allows communicative quiescence, while, in hard phases where variance reduction is crucial, the protocol will trigger a lot of model synchronizations. In order to efficiently implement this strategy one has to monitor the non-linear divergence function without communication overhead. We solved this problem by using the LIFT primitive of local safe-zones in the function domain.

M. Nanni, A. Monreale, R. Trasarti, V. Grossi and D. Pedreschi. Distributed monitoring of cluster quality for car insurance  customer segmentation. Technical Report: TR-13-11, Department of  Computer Science, University of Pisa, 2013. Under preparation for submission.

We introduce the problem of monitoring the quality of a clustering in a distributed and dynamic environment, where each element to cluster represents a node of the system, with the ability to communicate towards a central site. By applying the LIFT primitives, and in particular the Safe Zones enhanced by predictive models, we modeled the problem as a non-linear function monitoring in order to provide each node with a Safe Zone – and therefore a local condition to check – that allows to largely reduce communications towards the central site. Local violations, then, are treated at two different levels, first by trying with a node balancing procedure at the level of a single cluster and, if that fails, trying with a cross-cluster balancing procedure. The general cluster monitor problem has been instantiated to a specific application (see Scenario 1 in the applications section) where it proved to significantly reduce communications; also, the work in “A. Monreale et al. Privacy in Distributed Monitoring” studied privacy protection measures for this kind of applications (See next section on privacy).

 

Main results of the work on privacy

The work extends the LIFT theoretical work on safe-zones and sketches towards integration of privacy measures and techniques as well as applying privacy-preserving techniques in the context of LIFT application scenarios. Methods based on differential privacy, and based on randomization and sketching have been developed and applied in the LIFT application scenarios. Here, the scenario 1 “Real-Time Analysis of User-Mobility Data” is of main interest, because it is addressing the monitoring of personal mobility of humans, whereas the other scenarios only deal with non-personal data in large scale infrastructures (internet scale, sensor network scale, cloud- and cluster scale). The first paper elaborates the fundamental interaction between the LIFT safe-zone approach and the concept of differential privacy.

A. Friedman, I. Sharfman, D. Keren and A. Schuster. Privacy Preserving Distributed Stream Monitoring. NDSS 2014.

Applications such as sensor network monitoring, distributed intrusion detection, and real-time analysis of financial data necessitate the processing of distributed data streams on the fly. Monitoring queries constitute a significant portion of the tasks carried over distributed streams. While efficient data processing algorithms enable such applications, they require access to large amounts of often personal information, and could consequently create privacy risks. To mitigate such risks, we study the application of differential privacy to distributed stream monitoring. Under differential privacy constraints, each exchange of information incurs only a small and controlled loss of privacy. However, in continuous set-tings, where information exchange is ongoing, small losses of privacy will accumulate. To limit this loss, the data exchange must be stopped at some point. We address this problem by adapting efficient communication techniques to privacy-preserving stream monitoring over a distributed sys tem. We study the relationship between efficient communication and privacy loss, and demonstrate that for given privacy constraints, our approach allows the system to be monitored over periods that are three orders of magnitude longer than would be possible with a naive approach.

In the setup we consider, we assume that data arrives at fixed time intervals, referred to as rounds, where at each round a new data item is received at each node. The goal of the distributed monitoring algorithm is to verify at each round a given property over the union of the stream prefixes. For privacy protection we rely on the notion of differential privacy, which requires that the probability distribution of the results of the computation be only marginally affected by each input record. In differential privacy, each information exchange that is derived from data on individ uals incurs a cost in privacy. With any new information exchange, the cost accumulates. To restrict privacy leaks, when the accumulated cost grows beyond a pre-determined threshold (a privacy budget), information exchange should be stopped. Theoretical infeasibility results suggest that these constraints cannot be circumvented. However, the lifetime of a stream monitoring system can be greatly extended, without violating the privacy constraints. We ad- dress this important challenge by combining two powerful ideas. The first idea is to transform the global monitored condition into local constraints that can be monitored in- dependently by each of the nodes in the system. The local constraints are constructed such that as long as none of them have been breached, the global condition is maintained. This allows node synchronization to be reduced, resulting in many silent (communication-free) rounds. The second idea is to leverage the reduction in communication costs towards fewer privacy leaks by applying privacy protection to a series of silent rounds simultaneously, thereby improving the privacy- accuracy trade-off provided by the system.

A. Monreale, M. Nanni, V. Grossi, R. Trasarti D. Pedreschi . Privacy in Distributed Monitoring . Technical Report: TR-13-16, Department of  Computer Science, University of Pisa, 2013.  Under preparation for submission.

In this work we tackle the privacy issues that can emerge from applications dealing with cluster monitoring in dynamic, distributed environments (see “Nanni et al., Distributed monitoring of cluster quality for car insurance customer segmentation”, already introduced in the previous section on contributions to theory). In particular, we propose a solution based on additive randomization, that exploits the results in literature to bound the possible reconstruction of perturbed data by an adversary. We test the proposed privacy-preserving framework in a real-world application for the monitoring of customer mobility behaviors in the context of car insurances (see Scenario 1 in the applications section).  In our experiments on real world data coming from GPS devices of private cars, we show that our privacy-preserving framework provides acceptable results in terms of amounts of  communications, privacy protection and quality of the global  function to be monitored.

F. Pratesi, A. Monreale, W. H. Wang, S. Rinzivillo, D. Pedreschi, G. Andrienko, and N. Andrienko. Privacy-Aware Distributed Mobility Data Analytics. 21st Italian Symposium on Advanced Database Systems, SEBD 2013.

This paper presents a framework to integrate LIFT sketching techniques with differential privacy in an analytical processing within a distributed setting, and tackle the problem of obtaining aggregate information about vehicle traffic in a city from movement data collected by individual vehicles and shipped to a central server. Movement data are sensitive because they may describe typical movement behaviors and therefore be used for re-identification of individuals in a database. We provide a privacy-preserving framework for movement data aggregation based on trajectory generalization in a distributed environment. The proposed solution, based on the differential privacy model and on sketching techniques for efficient data compression, provides a formal data protection safeguard. Using real-life data, we demonstrate the effectiveness of our approach also in terms of data utility preserved by the data transformation.

We consider a LIFT system sending frequency information about individual’s mobility to a central coordinator. In this case a local frequency vector in some context may contains sensitive information describing the individual activity. As an example, in a scenario where the coordinator is responsible for computing the aggregation of movement data on a territory by combining the information received by each node, the local frequency vector could describe the mobility behavior of each node. As an example, the attacker could learn the driver’s most frequent move; this information can be very sensitive because such move usually correspond to user’s transportation between home and work place. In our setting, we assume that each node in our system is secure; in other words we do not consider attacks at the node level. We also assume that the coordinator is untrusted. Therefore, we focus on de- signing privacy-preserving techniques to defend against an untrusted coordinator. In particular, our goal is to compute a distributed aggregation of frequency distribution vectors while preserving privacy. Here, to guarantee privacy we need to hide the real count of each position in the local frequency vector. 

We develop a different privacy- preserving solutions based on differential privacy, which is a strong privacy model independent on the background knowledge of an adversary. Each of our solutions is characterized by different trade-off be-tween privacy and data utility.

The design of a solution requires to define the computation by each node and by the coordinator respectively. The node computation mainly involves transforming data to achieve desired privacy guarantee. We present three privacy-preserving data transformation approaches. The first one, named UniversalNoise, is based on the classical differential privacy. It can provide strong privacy guarantee but high loss of data utility, due to the generation of negative flows and noise of very high magnitude. These two issues are managed in the second solution, named BoundedNoise, by relaxing the privacy guarantee to (eps, delta)- differential privacy, where  we measures the privacy loss. We showed that: (1) the BoundedNoise approach can improve data utility significantly, and (2) in some cases, the BoundedNoise approach may provide low level of guaranteed privacy in practice. Indeed we can show that sometimes the privacy loss can be high. As a consequence, we propose a third solution named BalancedNoise that tries to maintain the balance between privacy and utility under control by setting appropriate values of eps and  delta. The mechanism allows the nodes to specify the level of privacy and the maximum privacy loss  and find the best solution that is capable to minimize the noise magnitude and the possible negative flows, so that it can achieve good utility. Besides the design of the privacy-preserving data transformation methods, we also design sketch approaches to reduce the communication between nodes and the coordinator.

A. Monreale, W. H. Wang, F. Pratesi, S. Rinzivillo, D. Pedreschi, G. Andrienko, and N. Andrienko. Privacy-preserving Distributed Movement Data Aggregation. AGILE 2013

This paper develops a novel approach to privacy-preserving analytical processing within a distributed setting, and tackle the problem of obtaining aggregated information about vehicle traffic in a city from movement data collected by individual vehicles and shipped to a central server. Movement data are sensitive because people’s whereabouts have the potential to reveal intimate personal traits, such as religious or sexual preferences, and may allow re-identification of individuals in a database. We provide a privacy-preserving framework for movement data aggregation based on trajectory generalization in a distributed environment. The proposed solution, based on the differential privacy model and on sketching techniques for efficient data compression, provides a formal data protection safeguard. Using real-life data, we demonstrate the effectiveness of our approach also in terms of data utility preserved by the data transformation.

Understanding of the human mobility behavior in a city is important for improving the use of city space and accessibility of various places and utilities, managing the traffic network, and reducing traffic jams. Generalization and aggregation of individual movement data can provide an overall description of traffic flows in a given time interval and their variation over time, without revealing the identity of individuals. Known methods for generalization and aggregation of movement data that require having all individual data in a central station. This centralized setting entails two important problems: a) the amount of information to be collected and processed may exceed the capacity of the storage and computational resources, and b) the raw data describe the mobility behavior of the individuals with great detail that could enable the inference of very sensitive information related to the personal private sphere.

The LIFT approach however is to pre-process relevant information locally on the sensor devices and just sending aggregated reduced information to a central coordinator, resulting in a privacy-preserving distributed computation framework for the aggregation of movement data. We assume that on-board location devices in vehicles continuously trace the positions of the vehicles and can periodically send derived information about their movements to a central station, which stores it. The vehicles provide a statistical sample of the whole population, so that the information can be used to compute a summary of the traffic conditions on the whole territory. To protect individual privacy, we apply a data transformation method based on the well-known differential privacy model. To reduce the amount of information that each vehicle transmits to the central station, we apply the sketch techniques to the differentially private data to obtain a compressed repre-sentation. The central station, that we call coordinator, is able to reconstruct the movement data represented by the sketched data that, although transformed for guaranteeing privacy, preserve some important properties of the original data that make them useful for mobility analysis.

C. Kopp, M. Mock, and M. May. Privacy-preserving distributed monitoring of visit quantities. Proc. of the 20th International Conference on Advances in Geographic Information Systems (ACM GIS 2012). ACM, 2012.

The organization and planning of services (e.g. shopping fa-cilities, infrastructure) requires up-to-date knowledge about the usage behavior of customers. Especially quantitative information about the number of customers and their fre-quency of visiting is important. In this paper we present a framework which enables the collection of quantitative visit information for arbitrary sets of locations in a dis- tributed and privacy-preserving way. While trajectory analysis is typically performed on a central database requiring the transmission of sensitive personal movement information, the main principle of our LIFT approach is the local pro-cessing of movement data. Only aggregated statistics are transmitted anonymously to a central coordinator, which generates the global statistics. In this paper we present our approach including the methodical background that enables distributed data processing as well as the architecture of the framework. We further discuss our approach with respect to potential privacy attacks as well as its application in prac-tice. We have implemented the local processing mechanism on an Android mobile phone in order to ensure the feasibility of our approach.

The users’ privacy is protected using a series of privacy mechanisms, which are applied in a hierarchical order. The first key element of our approach is the local evaluation of movement trajectories so that the original mobility information does not leave the mobile device. This approach is described in more detail the section on the mobility application scenario. It locally evaluates pattern occurrences and aggregates their number over time. Due to the temporal aggregation (which may be complemented by a spatial aggregation based on location typification) the current position of a user is not revealed. The resulting pattern count statistics are communicated to the coordinator. They do not contain personal identifiers, and web anonymization and encryption techniques are used to prevent that different messages originating from the same sender are related to each other. Thus an adversary cannot reconstruct mobility profiles over several messagesAlthough messages do not contain any person-related information, an attacker might try to use the pattern information as quasi-identifier. For example, a location set containing the “White House” would reveal some of the mobility behavior of a very restricted set of persons. In addition, identification may be possible if a user has very extreme visit counts which only very few people may reach, such as an animal keeper in a zoo. Therefore, for guaranteeing privacy protection and avoiding inferences about the user mobility behavior, we employ k-anonymity as second key element in our approach. In order to establish k-anonymity an additional, semi-trusted coordinator is added to the system. This coordinator works as anonymizer. The basic idea is to separate the semantic information of apattern from its pattern count distribution during the (necessarily centralized) anonymization step. This means that the nodes submit only pattern ids to the anonymizer without any attached semantics. Thus the anonymizer does not know which pattern an id represents and cannot draw inferences from possibly infringing patterns and pattern counts. The anonymizer applies an algorithm for assuring k-anonymity to the pattern count distribution and sends the private result to the global coordinator. The global coordinator finally links the privatized pattern count distributions with the semantic information of the patterns. As a result all pattern count distributions adhere to k-anonymity without revealing any private information during the anonymization step. Note, however, that privacy can only be protected if the coordinator and the anonymizer do not cooperate.

M. Kamp, C. Kopp, M. Mock, M. Boley, and M. May. Privacy-Preserving Mobility Monitoring using Sketches of Stationary Sensor Nodes. ECML/PKDD, 2013.

Two fundamental tasks of mobility modeling are (1) to track the number of distinct persons that are present at a location of interest and (2) to reconstruct flows of persons between two or more different locations. Stationary sensors, such as Bluetooth scanners, have been ap-plied to both tasks with remarkable success. However, this approach has privacy problems. For instance, Bluetooth scanners store the MAC address of a device that can in principle be linked to a single person. Unique hashing of the address only partially solves the problem because such a pseudonym is still vulnerable to various linking attacks. In this paper we propose a solution to both tasks using an extension of linear counting sketches. The idea is to map several individuals to the same position in a sketch, while at the same time the inaccuracies introduced by this overloading are compensated by using several independent sketches. This idea provides, for the first time, a general set of primitives for privacy preserving mobility modeling from Bluetooth and similar address-based devices.

Today’s sensor technologies such as GPS, RFID, GSM, and Bluetooth have revolutionized data collection in this area, although significant problems remain to be solved. One of those problems are privacy concerns. They mandate that, while the count of groups of people can be inferred, inference on an individual person remains infeasible. Directly tracing IDs through the sensors by a central coordinator violates this privacy constraint, because the amount of information stored allows for linking attacks. In such an attack, sensor information is linked to additional knowledge in order to identify a person and infer upon her movement behavior. Hence, application designers have to design and use new, privacy preserving methods.

The contribution of this paper is to provide a general set of primitives for privacy-preserving mobility monitoring and modeling using stationary sensor devices. Following the LIFT approach, information is processed locally on the sensor device, using sketch-based techniques to store just enough information to perform the desired inference task and discards the rest. Thereby, privacy constraints are much easier to enforce. We show how the Coordinator can reconstruct based on the sketches of individual sensors unions of sensor ranges, thus providing the ability to monitor crowds with overlapping stationary sensors, and intersections of sensor readings, thus providing the ability to monitor flow from one sensor to another without the need for storing or comparing any individual id.

Technically, the method we propose is based on Linear Counting sketches, a data structure that allows to probabilistically count the distinct amount of unique items in the presence of duplicates. Linear Counting not only obfuscates the individual entities by hashing, but furthermore provides a probabilistic form of k-anonymity. This form of anonymity guarantees that, by having access to all stored information, an attacker can not be certain on a single individual but can at most infer upon k individuals as a group. Furthermore, Linear Counting is an efficient and easy to implement method that outperforms other approaches in terms of accuracy and privacy on the cost of higher memory usage. 

The main thread to privacy in the presented application scenarios is the so called linking attack, i.e., an attacker infiltrates or takes over the monitoring system and links this knowledge to background information in order to draw novel conclusions. For example, in a standard monitoring system that distributes the sensor readings, i.e., the device addresses, an attacker that knows the device address of a certain person as background knowledge, and furthermore infiltrates the monitoring system, is able to track this person throughout the monitored area.

Sketching prevents these linking attacks in two ways, obfuscation and k- anonymity. Obfuscation is accomplished by hashing the device address to sketch positions. Hence, before an attacker is able to re-identify a device, she has to infer the employed hash function. However, this very basic obfuscation technique can be vanquished using statistical analysis on sensor readings. The second anonymization technique is accomplished by the natural property of sketches to compress the address space, implicating collisions of addresses when mapped to sketch positions. Whereas these collisions entail a loss in accuracy, they create a form of anonymity, because an attacker can only infer upon a set of devices whose addresses are all mapped to that very same sketch position. Formally, a monitoring system guarantees k-anonymity if every monitored entity is indistinguishable from at least k other entities. Using linear count sketches with a loadfactor t, results in t collisions per bucket on expectation. Hence, the expected level of anonymity is t. However, because the monitoring system is a probabilistic setup, the number of collisions can not be guaranteed. We denote this form of anonymity expected k-anonymity.

To show the practical applicability of LIFT methodology, the complete approach has been fully implemented strictly according to the LIFT reference architecture on embedded sensors with low-overhead, low-performance hardware (Beagelboard), see “A. Burmeister. Implementierung und Evaluierung eines Sketch-basierten, verteilten Monitoring Systems. Bachelor Thesis, University of Magdeburg, 2013.”.

 

Main results of the work on architecture and applications

One objective LIFT in year two was the design of a reference architecture, which would provide a general concept and design patterns for the implementation of the application scenarios. To this end, we  performed a detailed architectural requirements analysis of each application scenario. Subsequently, the basic components and communication patterns are extracted and the interfaces for integrating LIFT technology are defined. Implementations according to the reference architecture have been made in several application scenarios, showing the practicability of the defined structures.

A major LIFT objective is exploring and exploiting local data synopses elaborated in the more theoretical work as tools for effectively tracking massively-distributed streaming data using the geometric and safe-zone methods; also, dealing with concrete examples of monitored functions (e.g., median) arising in these settings. The application scenarios have been exemplifying:

• practical considerations for solving the safe zone problem (hierarchical clustering and geometric considerations);

• application of the geometric-monitoring approach to the important problem of outlier detection in sensor networks and the monitoring of internet queries;

• utilizing local prediction models as mechanisms for tracking the temporal dynamics of local data streams and further reduce communication

In detail, results have been achieved in four considered application scenarios:

Scenario 1: Real-Time Analysis of User-Mobility Data

As privacy related issues play in important role in this application scenario, most of the results have already been addressed under the discussion of LIFT results on privacy above. We are here just adding those results which are in addition preliminary concerned with large-scale scalability and communication efficiency.  

M. Nanni, R. Trasarti, Rossetti G., and D. Pedreschi. Efficient distributed computation of human mobility aggregates through user mobility profiles. In Proc. of the ACM SIGKDD International Workshop on Urban Computing (UrbComp 2012), pages 87-94. ACM, 2012.

A basic task of urban mobility management is the real-time monitoring of traffic within key areas of the territory, such as main entrances to the city, important attractors and possible bottlenecks. Some of them are well known areas, while others can appear, disappear or simply change during the year, or even during the week, due for instance to road works, accidents and special events (strikes, demonstrations, concerts, new toll road fares). Especially in the latter cases, it would be useful to have a traffic monitoring system able to dynamically adapt to reference areas specified by the user.

In this work we propose and study a solution exploiting on-board location devices in private cars mobility that continuously trace the position of the vehicle and periodically communicate it to a central station. Such vehicles provide a statistical sample of the whole population, and therefore can be used to compute a summary of the traffic conditions for the mobility manager. However, the large mass of information to be transmitted and processed to achieve that might be too much for a real-time monitoring system, the main problem being the systematic communication from each vehicle to a unique, centralized coordinator station. In this work we tackle the problem by adopting the general LIFT view of distributed systems for the computation of a global function, consisting in minimizing the amount of information communicated through a careful coordination of the single nodes (vehicles) of the system. Our approach involves the use of predictive models that allow the central station to guess (in most cases and within some given error threshold) the location of the monitored vehicles and then to estimate the density of key areas without communications with the nodes.

This solution follows strictly the ideas based on LIFT Safe Zones, and therefore assumes that most objects are static or most of the time they move around some specific points in space, such as the home or work location. The basic idea, then, is to define a default location for each object v, and when no update arrives to the central coordinator, it assumes that v is inside its default location.

More concretely, through analysis of historical data each node can be assigned to an optimal location that is used as its default position; then, basically the controller computes densities assuming that each node lies in its default position. Each node has assigned a geographical area such that, as long as it moves within that area the value computed by the controller is still a good approximation. When the node moves outside its given area, it communicates the location update to the controller, which will use it to compute a correct estimation.

However, if the context of mobility is characterized by massive and rapid changes in the data, since locations are highly dynamic, this approach is inadequate. For this reason, we are referring to the local prediction method for safe zones that (in principle) better fits this scenario. The basic assumption behind this approach is that the objects are not necessarily static, yet their movements are relatively slow. As an effect, when an object visits a given location, its associated region (see description of static Safe Zones above) will most likely contain several of the next locations of the object, yet no single location is able to capture a large part of the mobility of the object. The protocol works as for static Safe Zones, but when an update must be communicated, the node is assigned to a new default loca-tion and to its corresponding geographical area, computed around its most recent measured location. Recomputing a new region (essentially, a new Safe Zone) is made possible and easy by the lin earity of the global function to monitor (a sum of contributions), which enables to modify the Safe Zone of a node without compro-mising those of other objects.

M. Nanni, A. Monreale, R. Trasarti, V. Grossi and D. Pedreschi. Distributed monitoring of cluster quality for car insurance customer segmentation. Technical Report: TR-13-11, Department of Computer Science, University of Pisa, 2013. Under preparation for submission.

In this work, already mentioned among the contributions to theory for the introduction of a general cluster monitoring method, we developed a full application of cluster monitoring for the extraction and quality monitoring of driving profiles of mobile users. This application is meant to support car insurance companies in the management and profiling of their customers, a task that requires to perform a customer segmentation based on features that characterize how a customer drives, and therefore on his/her mobility . 

We describe a methodology to extract several indicators characterizing the driving profile of customers, and provide a clustering-oriented instantiation of the segmentation problem, based on such indicators. Then, we consider the availability of a continuous flow of fresh mobility data sent by the circulating vehicles, aiming at keeping our segments constantly up-to-date. We tackle a major scalability issue that emerges in this context when the number of customers is large, namely the communication bottleneck, by applying a LIFT-based solution, which reduces the communications between vehicles and company servers to the essential. Finally, we validate the framework on a large database of real mobility data, coming from GPS devices of private cars.  

C. Florescu, M. Mock, C. Körner, and M. May. Efficient Mobility Pattern Stream Matching on Mobile Devices. Proc of the Workshop on Ubiquitous Data Mining (UDM 2012), pages 23-27, 2012.

The increasing amount of mobile phones that are equipped with localization technology offers a great opportunity for the collection of mobility data. This data can be used for detecting mobility patterns. Matching mobility patterns in streams of spatio- temporal events implies a trade-off between efficiency and pattern complexity. Existing work deals either with low expressive patterns, which can be evaluated efficiently, or with very complex patterns on powerful machines. We propose an approach which solves the trade-off and is able to match flexible and sufficiently complex pat- terns while delivering a good performance on a resource-constrained mobile device. The supported patterns include full regular expressions as well as relative and absolute time constraints. We present the definition of our pattern language and the implementation and performance evaluation of the pattern matching on a mobile device, using the LIFT architecture in hierarchy of filters which continuously process the GPS input stream.

Our mobility model is based on counts of occurrences of events, whereby an event represents the occurrence of a specific predefined spatio-temporal behavior in the observed GPS track. The local mobility model represents the behavior of a specific user and is locally computed on the device itself, whereas the global model is build by aggregating all local models on a single node (global coordinator). LIFT technology is used to reduce the amount of communication needed for maintaining the global model correct over time. The basic approach thereby is to de- fine a so-called SafeZone, in which the local model can safely vary without notifying the global coordinator. In this paper, we fo-cus on the question whether the input for generating the local model can be computed efficiently on a mobile device. Being able to compute a model locally is a prerequisite for applying LIFT technology for communication reduction. The local mobility model is computed by processing the stream of GPS updates as provided by an Android Location Provider through a hierarchy of filters as defined in the LIFT architecture.

In our experiments we have shown that the detection of state-of-the art complex mobility patterns can be implemented on a resource-constrained environment such as a mobile device. Our experiments show that the pattern matching can process the matching of 800,000 locations and up to 10,000 complex patterns in much less than one second. For handling more locations or more patterns, measures can be taken to reduce the number of GPS position updates by configuring the Android Location Provider appropriately or by adding intermediate GPS smoothing filters. For example, for a frequency of 5 seconds per position update request, our application can efficiently scale up to at least 800,000 locations and over 300,000 complex patterns.

A combination of the local pattern detection model integrated with ECM-Sketches (see O. Papapetrou, M. Garofalakis, and A. Deligiannakis. “Sketch-based Querying of Distributed Sliding-Window Data Streams”. in application scenario 2 has been demonstrated on the LIFT 2nd year’s review.

C. Kopp, M. Mock, O. Papapetrou and M. May. Large-scale Online Mobility Monitoring with Exponential Histograms. First International Workshop on Big Dynamic Distributed Data (BD3), held at the 39th International Conference on Very Large Data Bases, VLDB 2013

The spread of digital signage and its instantaneous adapt- ability of content challenges out-of-home advertising to con- duct performance evaluations in an online fashion. This implies a tremendous increase in the granularity of evaluations as well as a complete new way of data collection, storage and analysis. In this paper we propose a distributed system for the large-scale online monitoring of poster performance indicators based on the evaluation of mobility data collected by smartphones. In order to enable scalability in the or- der of millions of users and locations, we use a local data processing paradigm and apply exponential histograms for an efficient storage of visit statistics over sliding windows. In addition to an immediate event centralization we also explore a hierarchical variant of the LIFT architecture based on a merging technique for exponential histograms. We provide an evaluation on the basis of a real-world data set containing more than 300 million GPS points corresponding to the movement activity of nearly 3,000 persons. The experiments show the accuracy and efficiency of our system.

In this paper we propose a distributed system for the large-scale online monitoring of poster performance indicators based on the evaluation of mobility data collected by smartphones. We want to be able to perform online queries which obtain performance measures for the recent past in a sliding window style,  allowing the online monitoring of poster performance and thus the targeted placement of advertisement spots. The key component of our approach to handle massive streams of data is to use exponential histograms for data compression. This data structure has the advantage that it offers sliding window query capabilities with a guaranteed maximum relative error. In addition, exponential histograms can be applied in a distributed setting (see O. Papapetrou, M. Garofalakis, and A. Deligiannakis. “Sketch-based Querying of Distributed Sliding-Window Data Streams”. In Proceedings of the 38th International Conference on Very Large Databases, 2012)  thus allowing for scalability when the number of users increases.

Our online system relies on an Android implementation that we have used in the work described above to detect visit patterns on mobile phones in a privacy preserving manner. We are collecting and processing the mobility data locally on the smartphones and use exponential histograms on the coordinator and intermediate nodes for the efficient storage and querying of visit statistics in a sliding window fashion. Our experiments on a multiplied real-world data set with nearly 3,000 persons show that the usage of exponential histograms results in an average error of less than 1/10 of the maximum acceptable error while reducing the storage space to an amount as small as 9.7% of the baseline storage space.

Scenario 2: Internet-Scale Query Processing (ISQP) Systems

The work presented here provides both fundamental theoretical progress applicable to all kind of applications as well as application in large-scale Internet query processing. 

O. Papapetrou, M. Garofalakis, and A. Deligiannakis. “Sketch-based Querying of Distributed Sliding-Window Data Streams”. In Proceedings of the 38th International Conference on Very Large Databases, 2012.

The sliding window model for continuous data streams allows users and applications to concentrate solely on recent streaming data by automatically “aging out” old arrivals in the stream. The ability to focus the streaming-data analysis only on recent information (e.g., events arriving in the last 30 minutes, the last 109 events) is a critical requirement for many real-time applications, where “stale” data has little or no value. We show how to extend LIFT techniques to effectively handle sliding- window queries over massively-distributed data streams. To further reduce communication, as well as the computational and space requirements at each participating node, we propose novel sketching techniques that allow the summarization of streaming data over sliding windows, with probabilistic accuracy guarantees. We describe and analyze three variants of our sliding-window sketch structure, two based on deterministic algorithms for maintaining the sliding-window statistics and one relying on randomization, and demonstrate how each variant enables answering point queries as well as inner product queries, with strong probabilistic guarantees. Our sketch structures find applicability to a variety of problems, such as maintaining frequency statistics over sliding windows, finding heavy hitters, and computing quantiles.

To enable the effective aggregation of distributed information in LIFT scenarios, we show how sketches of individual local streams can be composed to generate the summary of the order-preserving addition of all streams. To the best of our knowledge, our solution is the first to allow such aggregation over deterministic sliding window algorithms, thereby drastically reducing the network and memory requirements compared to previous techniques. Our ongoing work in the area aims to apply the proposed sliding-window sketches for covering the entire LIFT workflow, including the sketch-based geometric monitoring of local constraints at each node and the effective resolution of constraint violations.

M. Garofalakis and D.Keren  and V. Samoladas. Sketch-based geometric monitoring of distributed stream queries. VLDB 2013

Here the goal is to study three types of queries which are of crucial importance both on their own and as building blocks for other queries: full-join (represented as inner products between two distinct streams), self-join (represented as the norm squared of a stream), and range query (represented as the inner product of a stream with a fixed vector). Our initial work on this problem, applying the previous geometric method based on "bounding spheres", did not perform that well for the following reason: whenever a local violation of the condition at a node occurs, it has to send its local data to the coordinator in order so attempt to balance this violation with the other local vectors. The coordinator then must request data from other nodes, to achieve the desired balancing. However, the sketch vectors which constitute the local data vectors are typically large (even a small sketch is built as a 5 X 800 matrix), and sending them eventually results in a large communication overhead. The competing "CG method" (based on the paper "Approximate Continuous Querying of Distributed Streams”, by G. Cormode and M. Garofalakis, ACM TODS, 33(2), 2008), keeps "flushing" local changes from the nodes to the coordinator whenever a local condition is violated, and while it yields more flushes, they are typically smaller, resulting in better communication overhead. To solve this problem, we have introduced a new algorithm that monitors the sketch matrices in a reduced "norm space", which consists of the vector whose components are the norms of the sketch rows (hence its size is e.g. 5, as opposed to 5 X 800). It can be proved that as long as there is no threshold-crossing in this low-dimensional space, there is none in the sketch space either. This approach enabled us to achieve better results than the CG method. Further, to check how much room for improvement remains, we have compared the algorithm to an "oracle-based" paradigm, which makes the (unrealistic) assumption that the coordinator knows exactly when a global threshold-crossing takes place, and only then polls the nodes in order to resolve the resulting violation. The new method performed very well, just barely below the "oracle" in order of communication overhead.

We use real-life data sets for our experiments. The first data set, WCup3, was drawn from the Internet Traffic Archive and contains HTTP requests sent to the servers hosting the World Cup 1998 web site (totaling approximately 1.35 billion requests over a three-month period). The second data set, Cdad4, comprises SNMP network usage data obtained from CRAWDAD (the Community Resource for Archiving Wireless Data at Dartmouth). It consists of measurements of to tal network traffic every five minutes over a four month period at a large number of access points (approximately 200) inside a corpo- rate research center (IBM Watson). We tracked the distribution of the size attribute from WCup and the shortRet attribute from Cdad, since both these attributes take a very large number of values thus making streaming estimation challenging. The results presented measure the communication cost incurred by our methods. Compared to the baseline techniques of our earlier work (. Cormode and M. Garofalakis. “Approximate Continuous Querying of Distributed Streams”. ACM TODS, 33(2), 2008.), we achieved a communication reduction up to 35%.

O. Papapetrou and M. Garofalakis. Continuous Fragmented Skylines. IEEE ICDE 2014.

Algorithms for efficiently constructing and maintaining skylines (i.e., the set of Pareto-optimal multi-dimensional objects) in distributed systems have been widely studied in recent years. All proposed algorithms, however, require that there exists a single point of reference for each object - a single node - where the vector corresponding to the object is maintained. For a number of real-world applications, such as maintaining the skyline of cities with the most extreme weather situations (temperature, humidity, wind, etc.), this is an unrealistic constraint, since the values of each object, i.e., the weather statistics for each city, are typically determined by aggregating values over a large number of sensors/stations dispersed throughout the city. We term this new, challenging type of distributed skyline queries fragmented skylines. We give the first algorithmic results on the important problem of monitoring continuous fragmented skyline queries. Our proposed monitoring algorithms rely on intelligently decomposing the fragmented skyline problem into a set of threshold-crossing queries, which are subsequently monitored at each of the participating nodes by employing the geometric/SZ method. Note that the power of the geometric method allows us to support skylines on generated, functional dimensions (e.g., the variance of temperature values throughout the city) defined over data fragmented across (possibly) several distributed nodes. Still, a key concern here lies in the number of threshold queries to monitor, since a naive scheme could end up with as many as O(N2) queries (where N is the number of objects in the data set). Thus, we propose several new ideas for drastically reducing the number of monitored threshold queries while guaranteeing the correct skyline result. Through an extensive experimental evaluation, using both real and synthetic data, we examine the applicability and efficiency of our proposed algorithms, and compare it to the naive alternative of sending all updates to a coordinator for maintaining the current skyline. Our experimental results demonstrate significant performance improvements over the baseline algorithm, which easily reach two orders of magnitude for most data sets.

We have conducted experiments with two real-world data sets, WEATHER and MOVIES. WEATHER was down- loaded from the website of the National Oceanic and At- mospheric Administration (NOAA). The data set includes weather statistics collected from a network of sensors dis- tributed around the globe. For our experiments, we used a subset of the data set for years 2010 and 2011, by excluding the sensors with incomplete location meta-data or infrequent readings. The resulting data set contained 93.6 million readings of 5423 sensors distributed in 257 countries. For all consdired data, we achieved a reduction of communication cost compared to setting that centralizes all data of more than 90%.

Scenario 3: Monitoring Large Scale Sensor Networks

S. Burdakis and A. Deligiannakis. “Detecting Outliers in Sensor Networks using the Geometric Approach”. In Proceedings of the IEEE International Conference on Data Engineering, 2012.

This work tackles the problem of detecting nodes with “unusual” measurement vectors in sensor network environments. A node is not deemed as an outlier if its measurement vector is similar to at least  a user defined number (often termed as the minimum support) of other sensor nodes. This detection process is often referred to as outlier detection.

A significant amount of effort has been recently placed on detecting outliers. However, none of the existing techniques has tackled the general problem of being able to detect with 100% accuracy the similarity amongst any desired pair of sensor nodes, while at the same time allowing, in some cases, sensor nodes to refrain from transmitting any information regarding their measurements. In a nutshell, existing outlier detection techniques suffer from one, or more, of the following drawbacks: (i) they cannot identify the similarity of two nodes with 100% accuracy (assuming no message losses); or (ii) they require the transmission of information that is comparable in size to a centralized query evaluation; or (iii) they require nodes to perform transmissions at each epoch; or (iv) they are tailored to the evaluation of specific similarity functions and/or cannot handle similarity functions over the measurements collected at different nodes (such as the correlation coefficient, or the L1 norm).

As we demonstrate in our work, several common similarity functions, such as L∞,L1,L2, and Lk norms, cosine similarity, correlation coefficient, and Extended Jaccard Coefficient can be transformed in a way that allows us to utilize the geometric approach for monitoring the corresponding similarity of any pair of sensor nodes. Whe show that  the correlation coefficient can be computed as a function over the average of the derived vectors maintained at each node. This allows us to use the geometric approach for monitoring the similarity of any pair of nodes. Similar findings are shown for other similarity functions as well. We then demonstrate how the monitoring task needs to be adapted in the case of sensor networks, as the pair-wise computation allows us to avoid using a global coordinator, but rather have both nodes in each pair act as a coordinator. Extensions to monitoring whether the minimum support of each node exceeds/falls below a threshold are shown to be easy to handle in our framework. Different optimizations, based on the characteristics of the monitored similarity function, are applied. Experiments with real data demonstrate that efficient monitoring of outliers can be performed utilizing a fraction of the bandwidth that a centralized method would require.

In our experiments we utilized two real world data sets. The first data set, termed Intel Lab Data, includes temperature, humidity and light measurements collected by motes in the Intel Research, Berkeley Lab 2. We selected the measurements of the following nodes (in the specified order) that lie in nearby lab locations: 38, 39, 40, 41, 43, 37, 35, 36. In experiments where we vary the number of used nodes, any experiment containing K nodes, contains measurements from the K first nodes in the above list, for 30000 epochs. The second data set, termed as Weather Data, includes air temperature, relative humidity and solar irradiance measurements from the station in the University of Washington and for the year 20023. We used these measurements to generate readings for up to 9 motes for a period of 2000 epochs. We measure communication reduction against a baseline method, termed NAIVE, in which the sensor nodes transmit their measurement vectors to the base station at each epoch. Communication was reduced depending on the parameterization of our method to 30% - 10 %.

The sensor network application has been fully implemented on sensor hardware and has been demonstrated LIFT 2nd and 3rd year review.

Scenario 4:  Real-time Monitoring of Massive Distributed Logs in Clouds and Data Centres

M. Gabel, A. Schuster, R.-G. Bachrach, and N. Bjørner. Latent fault detection in large scale services. In Proc. of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2012.

Unexpected machine failures, with their resulting service outages and data loss, pose challenges to datacenter management. Existing failure detection techniques rely on domain knowledge, precious (often unavailable) training data, textual console logs, or intrusive service modifications.

We hypothesize that many machine failures are not a result of abrupt changes but rather a result of a long period of degraded performance. This is confirmed in our experiments, in which over 20% of machine failures were preceded by such latent faults.

We propose a proactive approach for failure prevention. We present a novel framework for statistical latent fault detection using only ordinary machine counters collected as standard practice. We demonstrate three detection methods within this framework. Derived tests are domain-independent and unsupervised, require neither background information nor tuning, and scale to very large services. We prove strong guarantees on the false positive rates of our tests. This work shows introduces the basic mechanisms for detecting outlier machines based on statistical tests. However, it still assumes that all data for the tests is centralized. This restriction is alleviated in the next paper.

M. Gabel, D. Keren, A. Schuster. Communication-efficient Outlier Detection for Scale-out Systems. First International Workshop on Big Dynamic Distributed Data (BD3), held at the 39th International Conference on Very Large Data Bases, VLDB 2013.

Modern scale-out services are built on top of large datacen- ters composed of thousands of individual machines. These must be continuously monitored because unexpected failures can overload fail-over mechanism and cause large-scale out- ages. Such monitoring can be accomplished by periodically measuring hundreds of performance metrics and looking for outliers, often caused by misconfigurations, hardware failures or even software bugs. Previous work has shown that many failures are indeed preceded by such performance outliers, known as performance problems or latent faults.

In this work we adapt an existing unsupervised statistical framework for latent fault detection to provide an online, communication- and computation-reduced version based on sketching and safe-zones. The existing framework is effective in predicting machine failures days before they happen, but requires each monitored machine to send all its periodic metric measurements, which is prohibitive in some settings and requires that the data- center provide parallel storage and processing. Our adapted framework is able to reduce the amount of data sent and the processing cost at the central coordinator by processing the data in situ, making it usable in wider settings.

We utilize techniques from the domain of stream processing, specifically sketching and safe zones, to trade-off ac- curacy for communication and computation, without com- promising its advantages. Like the original framework, our adapted framework is unsupervised, does not require do- main knowledge, and provides statistical guarantees on the rate of false positives. Initial experiments show that scores yielded by the adapted framework match the original scores very well, while reducing communications by over 90%

Rather than centralizing all local statistic vectors x(m, t) of each monitored machine m on a central coordinator for executing a statistical test on outlier machines, each machine m will first apply a sketching function f to its vectors, and send only the sketch xˆ = f (x(m, t)). A modified test will be applied to the sketches rather than the original vector: One well-suited sketch is the AMS sketch which involves a random linear projection to k dimensions. In our setting, each machine would project its counter vectors to k dimensions using a specially constructed projection matrix: The sign test function depends only on the normalized direction from x(m, t) to the other vectors. Since the sign-test uses normalized directions and the projection matrix R preserves their symmetry around x(m, t), we can apply the sign test directly to the sketched vectors xˆ(m, t). Moreover, the sign test p-value does not depend on the dimensionality of the vectors, and so we can use it as. 

In order to further reduce the need for communication, we use the LIFT safe zones approach to monitor both the global mean and the global variance of each counter maintained at each machine m. Given the last known global mean and variance of the last T samples, we define some lower and upper threshold, for example 0.5 and 2 times the last known values. If there is any violation, the coordinator polls each node for the current mean and variance, and distributes the new global mean and variance to all nodes. We can trade-off accuracy and communication by adjusting the high and low thresholds when monitoring. Violations are less likely if global values are allowed to drift further from their last known values – reducing communication but also decreasing accuracy. We monitor each counter independently, so it is enough to show how we monitor a single counter X. 

We began with a general framework and derived tests designed to detect latent faults. This frame- work, though easy to implement on a data-parallel systems, does not directly address the large amount of communication and processing it requires. We applied the principles and techniques of local inference to reduce the amount of transmitted data: sketching is used to reduce the amount of data that must be sent, and safe zones are used to monitor the data range and scale it to approximately uniform variance. Following the principle of local inference, our nodes make use of already available information to avoid communication by predicting future data behavior. Beyond that, the coordinator balances available slack across the nodes, so that each node is allowed more deviation based on its local data.

We set out to reduce communication by about an order of magnitude, and we conclude that we have achieved this goal. Our experiments on data from a real-world production system show that we are able to achieve 15% reduction in communication without any new false positives or false negatives. If a small amount of incorrect classifications (1%) is allowed, we can reduce communication to 10% of the original centralized communication. We can conclude that the LIFT techniques and principles apply are well-suited to this setting.

The adapted latent fault detector is robust. In practice, it turns out to be is very resilient to scaling errors. Even when the true variance is allowed to drift quite far from the last reference point, latent fault detection performance seldom drops. Similarly, even at small sketch sizes (216 counters reduced to 5 dimensions) the adapted latent fault detector performs close to the centralize version. We plan to further explore the reasons for and limits of this robustness, and how it depends on the data.

Potential impact and dissemination activities

In an official Communication from the EC to the European Parliament, Brussels, 20 April 2009, it is stated that the “strategy for research on future and emerging technologies in Europe is in line with the objectives of the Commission’s European Economic Recovery Plan”. This Communication proposes “bolstering Europe’s competitiveness and the innovation ecosystem in the long term by means of greater investment in higher risk research in the strategically important area of information and communication technologies (ICT)”. The Communication further states “The FET scheme fosters excellence by means of collaboration combining the best in science and engineering”. 

We expect LIFT to have a profound impact on the three main technological challenges set out in the 2009-2010 ICT work programme.

Pervasive and trustworthy network and services infrastructure- The proposed research extends the capabilities of large-scale distributed data systems. It does so by attacking fundamental technological barriers in distributed systems, distributed data processing and distributed data streams. Such barriers are strong inhibitors to the development of unified and coherent systems that manipulate geographically distributed data sets. It has impact not only for one class of distributed infrastructures, but for several, including P2P, grids and, sensor networks. 

More robust, context-aware and easy-to-use ICT systems that self improve and self-adapt within their respective environments - Monitoring the state of the system is a key requirement for self-improvement and adaptation. The impact of LIFT here is to offer fundamental algorithms that can dramatically improve the quality and efficiency of monitoring, facilitating it in cases where this was impossible before.

Increasingly smaller, cheaper, more reliable and low consumption electronic components and systems that constitute the basis for innovation in all major products and service. Computers consume a high amount of electrical power, and communication is a key element in the energy budget. LIFT contributes to cut down the communication overhead by a large amount.

Thus LIFT helps to remove technological roadblocks related to all three technological main challenges, and thus helps improving the competitiveness of European industry and enabling Europe to master and shape future developments in ICT so that the demands of its society and economy are met. It will also advance European research and science. In the longer term, LIFT may reveal a powerful approach for addressing the challenges for ICT research driven by socio-economic goals, in particular towards novel ICTs for mobility, environmental sustainability and energy efficiency (related to the scenario of mobility management). Such an ambitious goal can only be reached by an international consortium; no individual partner would have sufficiently broad expertise in both theory and applications.

To sum up, the overall aim of LIFT is clearly ambitious, as it targets one of the most challenging questions open today in our increasingly ICT-based society: how to turn ubiquitous, individual knowledge into global knowledge of complex techno-social (or techno-environmental) systems? A new theory would deeply influence the emerging paradigm of large techno-social networks that evolve and interact locally without strict central supervision. LIFT will elevate European research in a domain that is currently dominated by US-based companies and research. Such strong dominance may eventually result in all European data being shipped to, stored at, and processed in North America. There is a current trend of accumulating activity logs to central hubs that serve as universal knowledge providers – the Google paradigm. A successful deployment of large-scale ubiquitous data systems according to the LIFT vision would create a viable alternative. The LIFT paradigm does not only improve scalability and energy-efficiency, but also addresses the societal need for more responsible, privacy-aware technologies.

A major source of scientific dissemination are, of course, the published papers. The fact that journals such as PVLDB, TKDE and conferences such as ICDE and VLDB in data management KDD and ECML in data mining, and ACM GIS and AGILE in geographic information science could be addressed guarantees a strong potential impact and high visibility for the LIFT project. 

Major peer reviewed publications are: 

G. Sagy, D. Keren, I. Sharfman, A. Schuster. Distributed Threshold Querying of General Functions by a Difference of Monotonic Representation. Proc. of the Intl. Conf. On Very Large Databases, 2011

N. Giatrakos, Y. Kotidis, A. Deligiannakis, V. Vassalos, and Y. Theodoridis. In-Network Approximate Computation of Outliers with Quality Guarantees. Information Systems, 2011.

N. Giatrakos, A. Deligiannakis,M. Garofalakis, I. Sharfman, and A. Schuster. “Prediction-Based Geometric Monitoring over Distributed Data Streams”. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012.

D. Keren. , I. Sharfman, A. Schuster, A. Livne. Shape Sensitive Geometric Monitoring. IEEE Transactions on Knowledge and Data Engineering, 2012.

O. Papapetrou, M. Garofalakis, and A. Deligiannakis. “Sketch-based Querying of Distributed Sliding-Window Data Streams”. In Proceedings of the 38th International Conference on Very Large Databases, 2012.

G. Cormode, M. Garofalakis, P. J. Haas, and C. M. Jermaine. “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”. Foundations and Trends in Databases, 4(1-3), 2012.

S. Mascetti, A. Monreale, A. Ricci, A. Gerino. Anonymity: a Comparison between the Legal and Computer Science Perspectives. In The 5rd International Conference on Computers, Privacy, and Data Protection: “European Data Protection: Coming of Age”, 2012.

D. Pedreschi, A. Monreale, and F. Giannotti. Privacy by design in data mining. In Awareness Magazine, 2012. DOI:  10.2417/3201202.004005.

S.-C. Florescu, M. Mock, C. Körner, and M. May. Efficient Mobility Pattern Stream Matching on Mobile Devices. Proc of the Workshop on Ubiquitous Data Mining (UDM 2012), pages 23-27, 2012.

S. Burdakis and A. Deligiannakis. Detecting outliers in sensor networks using the geometric approach. In Proc. of the 28th Int. Conf. on Data Engineering (ICDE’12), pages 1108-1119, 2012.

M. Gabel, A. Schuster, R.-G. Bachrach, and N. Bjørner. Latent fault detection in large scale services. In Proc. of the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2012.

C. Kopp, M. Mock, and M. May. Privacy-preserving distributed monitoring of visit quantities. Proc. of the 20th International Conference on Advances in Geographic Information Systems (ACM GIS 2012). ACM, 2012.

M. Baglioni, J. A. F. de Macedo, C. Renso, R. Trasarti, and M. Wachowicz: How you move reveals who you are: understanding human behavior by analyzing trajectory data. Knowledge and Information Systems KAIS Journal (KAIS), 2012.

M. Nanni, R. Trasarti, Rossetti G., and D. Pedreschi. Efficient distributed computation of human mobility aggregates through user mobility profiles. In Proc. of the ACM SIGKDD International Workshop on Urban Computing (UrbComp 2012), pages 87-94. ACM, 2012.

M. Kamp, C. Kopp, M. Mock, M. Boley, and M. May. Privacy-Preserving Mobility Monitoring using Sketches of Stationary Sensor Nodes. ECML/PKDD, 2013.

A. Monreale, W. H. Wang, F. Pratesi, S. Rinzivillo, D. Pedreschi, G. Andrienko, and N. Andrienko. Privacy-preserving Distributed Movement Data Aggregation. AGILE 2013

F. Pratesi, A. Monreale, W. H. Wang, S. Rinzivillo, D. Pedreschi, G. Andrienko, and N. Andrienko. Privacy-Aware Distributed Mobility Data Analytics. 21st Italian Symposium on Advanced Database Systems (SEBD) 2013.

M. Gabel, D. Keren, A. Schuster. Communication-efficient Outlier Detection for Scale-out Systems. First International Workshop on Big Dynamic Distributed Data (BD3), held at the 39th International Conference on Very Large Data Bases, VLDB 2013.

M. Kamp, M. Boley, T. Gärtner. Beating Human Analysts in Nowcasting Corporate Earnings by using Publicly Available Stock Price and Correlation Features. Workshop on Domain Driven Data Mining (DDDM 2013) at ICDM 2013.

M. Garofalakis, D. Keren and V. Samoladas.  Sketch-based geometric monitoring of distributed stream queries. 39th International Conference on Very Large Data Bases, VLDB 2013.

D. Keren, G. Sagy, A. Abboud,D. Ben-David, I. Sharfman, and A. Schuster. Safe-Zones for Monitoring Distributed Streams. First International Workshop on Big Dynamic Distributed Data (BD3), held at the 39th International Conference on Very Large Data Bases, VLDB 2013.

C. Kopp, M. Mock, O. Papapetrou and M. May. Large-scale Online Mobility Monitoring with Exponential Histograms. First International Workshop on Big Dynamic Distributed Data (BD3), held at the 39th International Conference on Very Large Data Bases, VLDB 2013.

M. Boley, M. Kamp, D. Keren, A. Schuster and I. Sharfman. Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations. First International Workshop on Big Dynamic Distributed Data (BD3), held at the 39th International Conference on Very Large Data Bases, VLDB 2013.

D. Keren, G. Sagy, A. Abboud,D. Ben-David, A. Schuster, I. Sharfman and A. Deligiannakis. “Geometric Monitoring of Heterogeneous Streams”. IEEE Transactions on Knowledge and Data Engineering, 2013

M. Hoffmann, M. Mock, M. May. Road-quality classification and bump detection with bicycle-mounted smartphones. Workshop on Ubiquitous Data Mining on Int. Conf. on Artificial Intelligence (IJCAI - 2013), Bejing, pp. 39-43.

M. Nanni, R. Trasarti, P. Cintia, B. Furletti, C. Renso, L. Gabrielli, S. Rinzivillo, F. Giannotti. "Mobility Profiling". To appear in "Data Science and Simulation in Transportation Research", edited by Davy Janssens, Ansar-Ul-Haque Yasar and Luk Knapen, December 2013. DOI: 10.4018/978-1-4666-4920-0 URL: www.igi-global.com/book/data-science-simulation-transportation-research/78944

O. Papapetrou and M. Garofalakis. Continuous Fragmented Skylines. IEEE International Conference on Data Engineering (ICDE) 2014.

A. Friedman, I. Sharfman, D. Keren and A. Schuster. Privacy Preserving Distributed Stream Monitoring. NDSS 2014.

We expect the overview on sketches and related methods (““Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”) by Minos Garofalakis and others, partially funded by LIFT to become an influential source of information on this topic. It is positive that this important line of research, where researchers coming from Europe (Graham Cormode, Minos Garofalakis, and others) but working in the US always had a strong role, now comes partially back to Europe in terms of affiliations. 

 

Major further dissemination activities

Michael May gave an invited presentation on the Next Generation Data Mining Summit ’11 organized by Prof. Hillol Kargupta, a leading researcher in distributed data mining. The NGDM summit is a series of workshops held every two years that has high profile attendance and was influential in shaping the area of distributed data mining and data mining in general in the past.

Minos Garofalakis gave an invited colloquium talk on query processing over distributed data streams (covering the LIFT approach) at the Athens University of Economics and Business. He is also currently editing a book (with Johannes Gehrke (Cornell) and Rajeev Rastogi (Yahoo!) on Data Stream Management to appear in Springer-Verlag’s “Data Centric Systems and Applications Series”; the book will include a chapter on distributed data streams authored by LIFT researchers. 

Minos Garofalakis, Assaf Schuster, Izchak Sharfman, and Daniel Keren are invited participants at the upcoming NII Shonan meeting on “Large Scale Distributed Computation” which will take place in Japan in Jan. 2012. The NII Shonan Meetings follow the same style of the well-known Dagstuhl Seminars, aiming to promote cutting-edge informatics research by providing another premier international venue for both world-class scientists and promising young researchers and practitioners to exchange knowledge and discuss important research findings. This meeting can be expected to have a significant impact on shaping the research agenda of the field of distributed stream monitoring (we mention in passing that the call text (http://www.nii.ac.jp/shonan/blog/2011/04/11/large-scale-distributed-computation/)  demonstrates very well the attention and impact that LIFT technology already has made on the community today, and the timeliness of the LIFT project). 

Assaf Schuster gave two talks at Microsoft Research about monitoring in peer to peer networks, and geometric distributed monitoring.

Daniel Keren held a keynote lecture at the SYSTOR 2011 conference.

Fosca Giannotti gave invited talks at the following events, related to the management and analysis of large mobility datasets: IEEE MDM 2011 (12th IEEE International Conference on Mobile Data Management), June 9, 2011, Sweden; GEOINFO 2010 (XII Brazilian symposium on geoinformatics, Campos do Jordao, SP, Brasil, December 1, 2010; EGC 2011 (Symp. on Extraction et gestion des conaissans), Brest, France, January 2011. In these talks, the LIFT project and the LIFT approach have been briefly presented and pointed out as the future way to go in order to handle the huge, streaming flow of distributed mobility data that future mobility applications will involve.

Michael May gave an invited talk at the Ubiquitous Data Mining Workshop (UDM 2012), Montpellier, France, August 2012, organized by himself and Prof. João Gama. He further gave an invited talk about big data in technical processes, presenting the LIFT cloud scenario, and was an invited speaker to a podium discussion at the SAS Forum, Mainz, Germany, November 2012. In addition, Michael May gave an invited talk about the LIFT mobility application scenarios at the MODAP Workshop Mobile Analytics Meets Social Media, Sankt Augustin, Germany, June 2012.

Michael May and Christine Kopp visited the Bavarian State Office for Data Protection (Ansbach, Germany, September 2012) and discussed the usage of sketching techniques for the privacy preserving analysis of Bluetooth data in the LIFT mobility scenario.

Christine Kopp gave a presentation about the privacy-preserving distributed monitoring in the LIFT mobility scenario at the Dagstuhl Seminar “Mobility Data Mining and Privacy”, Dagstuhl, Germany, August 2012.

Minos Garofalakis was an invited tutorial speaker on “Data-Management Challenges for Big Data Analytics” at the 2nd MODAP-MOVE Summer School, Leysin, Switzerland, July 2012. He was also an invited participant and speaker on "Querying Distributed Data Streams" at the Workshop on Innovative Querying of Streams (INQUEST'2012), University of Oxford, UK, September 2012. Finally, he delivered an invited keynote talk on scalable database algorithms for managing uncertainty at the 6th International Conference on Scalable Uncertainty Management (SUM'2012), Marburg, Germany, September 2012.

Minos Garofalakis, Assaf Schuster, Izchak Sharfman, and Daniel Keren were invited participants and they all delivered talks on LIFT at the 2012 NII Shonan meeting on “Large Scale Distributed Computation”, Shonan Village, Japan, January 2012. The NII Shonan Meetings follow the same style of the well-known Dagstuhl Seminars, aiming to promote cutting-edge informatics research by providing another premier international venue for both world-class scientists and promising young researchers and practitioners to exchange knowledge and discuss important research findings. We expect the cross-fertilization of systems and theoretical research during this meeting to have a significant impact on shaping the research agenda of the field of distributed stream monitoring.

Antonios Deligiannakis gave a talk at the University of Maryland on the use of the geometric method and safe zones for distributed monitoring, focusing on outlier detection in sensor networks and prediction modelling techniques.

Assaf Schuster gave an invited talk on “On-the-Fly Processing of Big, Distributed, Streaming Data” at the International Conference on Software Science, Technology and Engineering (SWSTE), June 2012, Herzelia, Israel. He will further give an invited talk about “Scalable Data Stream Processing” at the International Conference on Parallel, Distributed and Grid Computing (PDGC), December 2012. JUIT Campus, Solan, Himachal Pradesh, INDIA.

Arik Friedman gave a talk on applications of safe zones to privacy at the DIMACS Workshop on Recent Work on Differential Privacy across Computer Science, DIMACS Center, Rutgers University, New Jersey, USA, October 2012.

Fosca Giannotti and Dino Pedreschi gave an invited talk at the University of Amsterdam on social data mining and privacy, Amsterdam, The Netherlands, April 2012.

Dino Pedreschi gave an invited talk on “Social mining and the New Deal on Data” at the workshop Global Scientific Data Infrastructures”, Taormina, Italy, May 2012.

Anna Monreale gave a lecture on “Privacy-preserving Mobilytics” at the 2nd MODAP-MOVE Summer School, Leysin, Switzerland, July 2012. Anna Monreale also gave a talk on “Privacy Issues in LIFT Systems” at the ENFORCE Project meeting, Bologna, Italy, July 2012.

Mirco Nanni gave an ivited talk on “Mobility Data Mining Meets Network Analysis” at the MODAP-Workshop Mobile Analytics Meets Social Media, Sankt Augustin, Germany, June 2012.

Roberto Trasarti gave an invited talk on “Mobility Data Analysis and Networks” at GDR'12 French Symposium, Porquerolles, France, May 2012.

Moshe Gabel gave a presentation at the Tokyo Institute of Technology on latent fault detection in cloud data centers in February, 2013 in Tokyo, Japan.

Dino Pedreschi and Fosca Giannotti gave an invited talk at the Italian Privacy Authority, Rome in May, 2013. 

Fosca Giannotti held a lecture at the IDEA Summer School, Delft, The Netherlands in July 2013.

Francesca Pratesi gave an invited talk at the ENFORCE Meeting in Pisa, Italy in July, 2013.

Fosca Giannotti gave an invited talk at PST EIT ICT LABS WORKSHOP in Trento, Italy in April, 2013 as well as an invited talk at the Giornata di Studio: Big data una sfida ricca di opportunità in Udine, Italy in July, 2013. She was furthermore a penalist at the ICDM Panel: “Big -- The Value of Data”, on the topic Tansparency&Privacy, ICDM 2012.

Dino Pedreschi and Anna Monreale organized the IEEE Workshop on Privacy in Social Data (PinSoDa’12) at the ICDM12 in December 2012 in Brussels, Belgium.

Dino Pedreschi and Fosca Giannotti organized the workshop “Towards a European Laboratory on Big Data Analytics and Social Mining: Bootstrap Workshop” in Pisa, Italy in July 2013.

Daniel Keren gave an invited talk at the Yahoo Research Lab Haifa in August, 2013.

Mario Boley gave a presentation at the Math and CS Department of Eindhoven University of Technology in September 2013.

Christine Kopp held a lecture at the DATASIM Summer School in Hasselt, Belgium in July 2013.

Assaf Schuster gave an invited talk at SRDC 2013, TRANSFORM Summer School on Research Directions in Distributed Computing in June, 2013 in Heraklion, Greece. He will furthermore hold two keynotes 14th International Conference on Runtime Verification in September, 2014 Toronto, Canada and the International Conference on Applied Algorithms January, 2014 in Kolkata, India.

Minos Garofalakis gave a keynote speech at the 21st Italian Symposium on Advanced Database Systems (SEBDʼ2013),June, 2013 in Rocella Jonica, Italy as well as two invited talks at the Big Data Analytics 2013, May, 2013 at the Microsoft Research, Cambridge, USA and the 2013 Information Theory and Applications (ITA) Workshop, February, 2013 in San Diego, USA.

Minos Garofalakis furthermore held two invited seminars, once at the University of California, San Diego, USA in February, 2013 and at the University of Texas, Austin, USA as well in February, 2013.

Michael May organized together with Prof. Joao Gama the Workshop on Ubiquitous Data Mining (UDM 2013) in conjunction with IJCAI 2013, August, 2013 in Beijing, China.

Minos Garofalakis was invited panellist at the 39th International Conference on Very Large Databases (VLDBʼ2013), August, 2013 in Trento, Italy. Moreover, he organized the First International Workshop on Big, Dynamic, Distributed Data (BD3) (http://www.softnet.tuc.gr/bd3/), In conjunction with VLDB'2013, August, 2013 in Trento, Italy. This workshop was the LIFT open scientific project peer-reviewed workshop, receiving high level international attention and contributions as well as presenting LIFT results in four presentations. 

 

Concertation with other EC funded projects and activities

A new FET-Open project started in Sept. 2011 with four LIFT partners participating (HU, Technion, Fraunhofer, CNR): DataSim (Data Science in the Era of Electric Vehicles). It has a partial overlap with LIFT in the application to electric vehicle data originally envisioned. This project is an excellent opportunity to test LIFT technology in a large-scale practical application and we expect a strong interaction between the two projects. It also demonstrates the impact that LIFT can have on future research.

LIFT partners Fraunhofer and CNR also strongly collaborated with the FET-OPEN CA MODAP (Mobility, Data Mining, and Privacy), bringing LIFT aspects (especially as related to WP2 and the mobility scenario in WP4) into the discussion.  The NGDM 2011 summit had a MODAP organized session to which LIFT partners contributed.

The LIFT partners Fraunhofer, CNR and TUC strongly collaborated with the FET-OPEN CA MODAP (Mobility, Data Mining, and Privacy), bringing LIFT aspects (especially as related to WP2 and the mobility scenario in WP4) into the discussion. In particular, two tutorials were given at the 2nd MODAP-MOVE Summer School and two invited talks at the MODAP industry workshop Mobile Analytics Meets Social Media.

Daniel Keren, Assaf Schuster and Izchak Sharfman presented the LIFT approach at the kick-off meeting of the FET-Open project DATASIM (Data Science in the Era of Electric Vehicles) in Hasselt, Belgium, October 2011, which lead to a lively discussion of LIFT technology and its potential use in the DATASIM project. Due to its privacy-preserving nature, LIFT technology developed within the Bluetooth scenario in WP4 is especially applicable for the collection and evaluation of origin-destination (OD) matrices, which is one main task in WP4 of DATASIM.

A further European project making use of LIFT-technology has started on September 1st 2012: INSIGHT. It utilizes LIFT technology for the purpose of disaster monitoring in a real world setting and thus helps to bring the LIFT paradigm into real-world applications and to end users (in this case Dublin City Council, Ireland, and Bundesamt für Bevölkerungsschutz und Katastrophenhilfe, Germany). From LIFT, both Technion and Fraunhofer participate in this project.

The LIFT approach will be used as a key technique in the European project FERARI that planned to start on the 1st of February, 2014. From LIFT, Fraunhofer, Technion and TUC participate in this project. FERARI (Flexible Event pRocessing for big dAta aRchItectures) is aiming at combining LIFT technology with Complex Event Processing in Big-Data Archictures in order to make the online analysis of massively distributed data streams applicable in the context of business and technological applications. 

 

Exploitation and business potential

A key task in modern customer relationship management (CRM) is to understand the needs of each customer and to devise policies to harmonize them at the best with the company objectives. In most cases that translates into identifying a portfolio of customer profiles, each representing the needs and requirements of a reasonable number of customers. Then, each profile can be treated separately in order to devise the market strategies and business models that best fit its customers’ peculiarities. In the business intelligence field, this process is best known as customer segmentation, and is traditionally implemented by applying predefined customer classification rules (for instance based on RFM indices such as recency of last contact with customer, frequency of transactions, monetary volume involved in the relation with the customer) or, in more recent times, by using clustering algorithms. This kind of process can be applied to any kind of business focused on services towards several customers, including the classical domain of retail selling as well as e- commerce and many others. As business oriented  exploitation of LIFT methodology, we have developed an application of customer segmentation in the very actual and rather uncommon ground of car insurance business, where the main features that characterize a customer are related to how he/she drives, and therefore on his/her mobility. We adopt a data mining-oriented approach, working from a realistic big data perspective, where the application can benefit from a continuous flow of fresh information. In particular, we developed a framework where customer segments are built at the initialization step, and then they are continuously monitored to ensure that the available segmentation still fits well the customer behaviors. We explored this real-world problem in contact with the Italian Company OctoTelematics, which collects mobility data for insurance purposes and see a high direct potential of business exploitation of LIFT technology. 

Project Contact Details

Website: www.lift-eu.org

Coordinator: Dr. Michael Mock

Fraunhofer IAIS

D-53754 St. Augustin

michael.mock@iais.fraunhofer.de

Fraunhofer IAIS

News

LIFT project flyer

The book chapter "Mobility Profiling" by M. NannyR. Trasarti, P. Cintia, B. Furletti, C. Renso, L. Gabrielli,S. Rinzivillo and F. Giannotti is published in Data Science and Simulation in Transportation Research, 2013.

The project workshop "First International Workshop on Big, Dynamic, Distributed Data (BD3)" is held in conjunction with VLDB'2013, August, 2013 in Trento, Italy.

The next meeting is 29. August 2013 in Riva del Garda, Trento, Italy.

 

ImpressumContact