Initial state for a dataflow job -


i'm trying figure out how "seed" window state of our streaming dataflow jobs. scenario have stream of forum messages, want emit running count of messages each topic time, have streaming dataflow job global window , triggers emit each time record topic comes in. far. prior stream source, have large file we'd process our historical counts, also, because topics live forever, need historical count inform outputs stream source, kind've need same logic run on file, start running on stream source when file exhausted, while keeping window state.

current ideas:

  • write custom unbounded source that. reads on file until it's exhausted , starts reading stream. not fun because writing custom sources not fun.
  • run logic in batch mode on file, , last step emit state stream sink somehow, have streaming version of logic start reads both state stream , data stream, , somehow combines two. seems make sense, not sure how make sure streaming job reads state source, initialise, before reading data stream.
  • pipe historical data stream, write job reads both streams. same problems second solution, not sure how make sure 1 stream "consumed" first.

edit: latest option, , we're going with, write calculation job such doesn't matter @ order events arrive in, we'll push archive pub/sub topic , work. works in case, affects downstream consumer (need either support updates or retractions) i'd interested know other solutions people have seeding window states.

you can suggested in bullet point 2 --- run 2 pipelines (in same main), first populates pubsub topic large file. similar streamingwordextract example does.


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -