Initial state for a dataflow job -

August 15, 2011

i'm trying figure out how "seed" window state of our streaming dataflow jobs. scenario have stream of forum messages, want emit running count of messages each topic time, have streaming dataflow job global window , triggers emit each time record topic comes in. far. prior stream source, have large file we'd process our historical counts, also, because topics live forever, need historical count inform outputs stream source, kind've need same logic run on file, start running on stream source when file exhausted, while keeping window state.

current ideas:

write custom unbounded source that. reads on file until it's exhausted , starts reading stream. not fun because writing custom sources not fun.
run logic in batch mode on file, , last step emit state stream sink somehow, have streaming version of logic start reads both state stream , data stream, , somehow combines two. seems make sense, not sure how make sure streaming job reads state source, initialise, before reading data stream.
pipe historical data stream, write job reads both streams. same problems second solution, not sure how make sure 1 stream "consumed" first.

edit: latest option, , we're going with, write calculation job such doesn't matter @ order events arrive in, we'll push archive pub/sub topic , work. works in case, affects downstream consumer (need either support updates or retractions) i'd interested know other solutions people have seeding window states.

you can suggested in bullet point 2 --- run 2 pipelines (in same main), first populates pubsub topic large file. similar streamingwordextract example does.

Search This Blog

Color

Initial state for a dataflow job -

Comments

Post a Comment

Popular posts from this blog

android - net_scheduler holding wakelock -

sql - MySQL : Getting Entries from a many-to-many table -

java - Retrieving data from database using jsp (Hibernate + Spring + Maven) -