Lasso Detection using Partial-State Caching

MSR-TR-2017-37 |

We study the problem of finding liveness violations in real-world
asynchronous distributed systems. Unlike a safety property, which
asserts that certain bad states should never occur during execution, a
liveness property states that a program should not remain in a bad
state for an infinitely long period of time. Checking for liveness
violations is an essential testing activity to ensure that a system
will always make progress in production.

The violation of a liveness property can be demonstrated by a finite
execution where the same system state repeats twice (known as
lasso). However, this requires the ability to capture the state
precisely, which is arguably impossible in real-world systems. For
this reason, previous approaches have instead relied on demonstrating
a long execution where the system remains in a bad state. However,
this hampers debugging because the produced trace can be very long,
making it hard to understand.

Our work aims to find liveness violations in real-world systems while
still producing lassos as a bug witness. Our technique relies only on
partially caching the system state, which is feasible to achieve
efficiently in practice. To make up for imprecision in caching, we use
retries: a potential lasso, where the same partial state repeats
twice, is replayed multiple times to gain certainty that the execution
is indeed stuck in a bad state.

We have implemented our technique in the P# programming language and
evaluated it on real production systems and several challenging
academic benchmarks.