As you read the papers, consider the following questions:
Question 1: What types of failures can occur in a distributed stream processing system, and how can these failures affect the output seen by users?
Question 2: In the paper by Hwang et. al., the authors examine several techniques to recover the state of a failed processing node. How do these techniques compare in terms of recovery speed and runtime overhead?
Question 3: What is the main challenge in handling network partitions? What are possible approaches to deal with this type of failures?