Hi folks, I maintain an akka-cluster–sharding service that reads events from an sqs queue, and performs external requests to about 4 other services for validation, aggregation, etc and keeps all of that in memory and publishes updates on that data aggregation downstream. So the service throughput is directly tied to these external calls.
The cluster received an unusual amount of event backlogs and including the high GC count+times, we also saw this error in the logs (usually a single pod), that seems to indicate that akka (http?) actor just terminated (trying to request an external service) and was never able to perform calls again during the remaining service uptime, requiring the entire cluster to be restarted.
akka.stream.StreamTcpException: The connection actor has terminated. Stopping now.
Killing the pod just seems to move the problem to another pod after some time, likely a specific Shard Entity.
What I want to know is:
- Is this log message correlated to garbage collection of Shard Entities somehow?
- Is there a way to restart these terminated actors without having to restart the cluster?
Thanks in advance and let me know if I need to provide further context.
1 post - 1 participant