Hi,
We are designing a Flow using akka streams which does following things
- Periodically look for files in a directory (an NFS mount) - Using Ticker source
- List the files
- Read lines from each file -Using flatMapConcat
- Delete the file upon reading - Using mapMaterializedValue
- Store each line to Kafka (Wrapping inside some object)
Sample code looks like this
-
Ticker source to list and submit files periodically
val ticker = Source
.tick(1.second, 300.millis, ())
.buffer(1, OverflowStrategy.backpressure)
.mapConcat(_ => {
Paths.get(“C:\Users\Test_Dir”).toFile.listFiles()
}) -
Logic for processing files one by one and delete once done
def processFile(): Flow[File, String, NotUsed] = {
Flow[File]
.flatMapConcat(file =>
FileIO
.fromPath(Paths.get(file.getAbsolutePath))
.mapMaterializedValue { f =>
f.onComplete {
case Success® => {
if (r.getCount > 0 && r.status.isSuccess) {
Files.delete(file.toPath)
}
}
case Failure(e) => println(s"Something went wrong when reading: $e")
}
NotUsed
}
.recover {
case ex: Exception => {
ByteString("")
}
}
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue, allowTruncation = true)
.map(_.utf8String)
.mapAsync(1)(FileSerdes.deserialize(file.getAbsolutePath)) //here goes some custom //implementation for each line
)
} -
Graph builder
===========
val graphBuilder= GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val balance = builder.add(BalanceString)
val merge = builder.add(MergeString)
(1 to 2).foreach { _ => //2 is configurable. it can be multiple
balance ~> processFile~> merge
}
FlowShape(balance.in, merge.out)
} -
Sink definition
==================
val storeReadings = Sink
.foreachString // It will be Kafka eventually -
Materializing the grpah
val result2 = ticker
.via(graphBuilder)
.runWith(storeReadings )
- Now , Following are the challenges i’m running into
===================================
- The deployment of this application has to be in multi nodes each watching the same NFS mount for files
- Once the files are in, all the nodes list the same files this results duplicates at the sink (although, it is getting avoided to certain extent as we are deleting the file as soon as we read it. but not completely)
- Assume we have large number files to process (say 3k). Ticker source re-list the same files(which are still left in the directory with in the same node and across) which will results in duplicates
- Exception handling : Assume a file is processed and gets deleted. the other nodes will also try to process the same and see NoSuchFileFound exception. im handling these scenarios with recover block. will that results in message loss?
your help will be much appreciated.
6 posts - 3 participants