Servicing “mice”-streaming queries as a service

MSR-TR-2013-124 |

Processing streaming queries at scale has received a lot of attention recently. Systems, such as S4, Storm, Photon, MillWheel, and Kinesis, can efficiently execute (“elephant”) queries that process streams with high rates, and typically scale-out the execution of each query. We envision a platform in the opposite side of the spectrum. Our queries process relatively slower streams, but the challenge is to handle very high density of queries per server. Such a system democratizes the access to a streaming platform, and enable “mice” queries, on behalf of users, to consume stream data, e.g. news, weather, travel related information, and so on. A challenging problem for the design of such a system is how to place queries among available servers. On one hand, we expect balanced workload among servers. On the other hand, we expect queries are packed into servers such that the network traffic is minimized by reducing the chance of sending identical data streams to multiple servers. This trade-off makes it hard to effectively place queries into servers. When we consider queries subscribing to more than one sources and query/server dynamics, this problem becomes even harder.

In this paper, we formalize the problem of placing “mice” queries into the servers of a streaming platform. We propose approximation algorithms and derive approximation bounds for the cases, including (1) the offline case where queries are stable and known ahead of time, akin to an “oracle”, and (2) the online case without departures and known query popularities.For the general online problem, we propose effective heuristic algorithms. An extensive set of experiments demonstrate that the proposed algorithms provide good performance in a wide-range of scenarios.