Context: the database we’re using only handles insertion batch sizes of 100. We are getting into scenarios where our tag_writes buffer appears to be growing at a crazy rate and we’re unable to recover unless we restart our services.
We’ve set akka.persistence.cassandra.events-by-tag.max-message-batch-size
to 100. Our flush interval is 50ms.
We are using akka-persistence-cassandra
version 1.0.1. Would upgrading help? I see this PR that seems to touch some of this code: https://github.com/akka/akka-persistence-cassandra/pull/841/files
I think what’s happening is the following, though I could be completely wrong.
- We get rate limited by our DB occasionally and get some OverloadedExceptions.
- After enough of these exceptions, we start getting
Writing tags has failed. This means that any eventsByTag query will be out of date. The write will be retried. Reason com.datastax.oss.driver.api.core.servererrors.WriteFailureException: Cassandra failure during write query at consistency QUORUM (1 responses were required but only 0 replica responded, 1 failed)
- And then not long after, we start getting
Buffer for tagged events is getting too large (401)
and it keeps building up (to over 10k in some instances!)- In the debug logs up to that, I see lots of
Sequence nr > than write progress. Sending to TagWriter
- Since the buffer keeps increasing (and never seems to decrease) we’re unable to continue writing tags to our DB. We have to manually restart things over and over again to start fresh. Is there some sort of back pressure for when the buffer gets too large?
- In the debug logs up to that, I see lots of
- After this we keep getting timeouts, preventing our actors from making forward progress
- We are also getting a boatload of errors around recovery timeouts. Do we need to bump up the
event-recovery-timeout
config value?Supervisor RestartSupervisor saw failure: Exception during recovery. Last known sequence number [0]. PersistenceId [Aggregate|12345678-1234-5678-1234c630376bbcc0], due to: Replay timed out, didn't get event within [30000 milliseconds], highest sequence number seen [0]
- We are also getting a boatload of errors around recovery timeouts. Do we need to bump up the
Any help here is highly appreciated. I can provide our configuration for different params if needed.
1 post - 1 participant