Dear Akka Community,
We’ve encountered a JournalFailureException during entity recoveries in our production environment, and we suspect it might be linked to an outage in our Cassandra provider, DataStax. We’re uncertain if such behavior is possible in Akka, so we’re seeking clarification on this matter.
During a recent outage on March 1, 2024, we observed an Akka persistent entity creating a record with a sequence number it shouldn’t have – essentially causing a repeat. It appears that during the outage, the entity may have progressed ahead of itself, writing new events despite potentially having only partial or outdated data due to the data center outage. This scenario differs from a network partition, where multiple entities might persist the same data twice. Here, the entity attempted to recover from past events but failed to do so adequately, leading it to proceed with persisting new events and causing a sequence number collision.
We don’t believe this is a network partition issue, as the old data was originally persisted two years ago, and the entity was reactivated to persist new events. However, it failed to recover from the old events and proceeded to persist new events with a sequence number of 1. To provide insight into the situation, here’s a snapshot of such a collision from the event journal’s “messages” table(certain fields are masked for confidentiality):
As seen in the snapshot, the first event was persisted on Sep 15, 2021, and the new event was persisted on Mar 1, 2024, both with the same sequence number.
Our question is: under what circumstances can Akka create a corrupted event journal during a database outage? We expect Akka not to push through with persisting events while the data center is experiencing issues returning the full set of data.
We appreciate any insights or guidance on this matter.
1 post - 1 participant