Hi,
We are designing a Flow using akka streams which does following things

Sample code looks like this

Ticker source to list and submit files periodically
val ticker = Source
.tick(1.second, 300.millis, ())
.buffer(1, OverflowStrategy.backpressure)
.mapConcat(_ => {
Paths.get(“C:\Users\Test_Dir”).toFile.listFiles()
})
Logic for processing files one by one and delete once done
def processFile(): Flow[File, String, NotUsed] = {
Flow[File]
.flatMapConcat(file =>
FileIO
.fromPath(Paths.get(file.getAbsolutePath))
.mapMaterializedValue { f =>
f.onComplete {
case Success® => {
if (r.getCount > 0 && r.status.isSuccess) {
Files.delete(file.toPath)
}
}
case Failure(e) => println(s"Something went wrong when reading: $e")
}
NotUsed
}
.recover {
case ex: Exception => {
ByteString("")
}
}
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue, allowTruncation = true)
.map(_.utf8String)
.mapAsync(1)(FileSerdes.deserialize(file.getAbsolutePath)) //here goes some custom //implementation for each line
)
}
Graph builder
===========
val graphBuilder= GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val balance = builder.add(BalanceString)
val merge = builder.add(MergeString)
(1 to 2).foreach { _ => //2 is configurable. it can be multiple
balance ~> processFile~> merge
}
FlowShape(balance.in, merge.out)
}
Sink definition
==================
val storeReadings = Sink
.foreachString // It will be Kafka eventually
Materializing the grpah

val result2 = ticker
.via(graphBuilder)
.runWith(storeReadings )

Now , Following are the challenges i’m running into
===================================

The deployment of this application has to be in multi nodes each watching the same NFS mount for files
Once the files are in, all the nodes list the same files this results duplicates at the sink (although, it is getting avoided to certain extent as we are deleting the file as soon as we read it. but not completely)
Assume we have large number files to process (say 3k). Ticker source re-list the same files(which are still left in the directory with in the same node and across) which will results in duplicates
Exception handling : Assume a file is processed and gets deleted. the other nodes will also try to process the same and see NoSuchFileFound exception. im handling these scenarios with recover block. will that results in message loss?

your help will be much appreciated.

6 posts - 3 participants

Streaming Multiple files periodically from shared directory