hadoop - Oozie hourly coordinator timing out on future actions -


at 5 minute mark of every hour, have data past hour loaded hdfs. thought setup coordinator job run @ 10 minute mark of every hour process data while doing check if directory hour exists. ends happening coordinator perform normal on past hour's data @ time of submission, continue working fine next 2 hours , future actions go 'waiting' 'timedout'. guess there default max limit how long action can stay in 'waiting'. seems bit counterintuitive time out limit apply actions @ absolute future time. anyway, here's sample of coordinator.xml. i'm looking suggestions on either how design in way makes more sense or on how raise default timeout.

<datasets>     <dataset name="hourly_cl" frequency="${coord:hours(1)}" initial-instance="2016-02-08t11:10z" timezone="pst">         <uri-template>hdfs://user/tzl/warehouse/incoming/logmessages.log.${year}${month}${day}/${hour}/</uri-template>         <done-flag></done-flag>     </dataset>     <dataset name="hourly_cl_out" frequency="${coord:hours(1)}" initial-instance="2016-02-05t11:10z" timezone="pst">         <uri-template>hdfs://user/tzl/warehouse/output/logmessages.log.${year}${month}${day}/${hour}/</uri-template>         <done-flag></done-flag>     </dataset> </datasets>  <input-events>     <data-in name="coordinput1" dataset="hourly_cl">         <instance>${coord:current(-1)}</instance>     </data-in> </input-events> <output-events>     <data-out name="clout" dataset="hourly_cl_out">         <instance>${coord:current(-1)}</instance>     </data-out> </output-events>  <action>     <workflow>         <app-path>${apppath}</app-path>     <configuration>         <property>             <name>inputpath</name>             <value>${coord:datain('coordinput1')}</value>         </property>         <property>             <name>outputpath</name>         <value>${coord:dataout('clout')}</value>         </property>     </configuration>     </workflow> </action> 

also noticed while looking @ logs oozie checks every minute each data directory. in other words @ 18:01 it'll check these exists logmessages.log.20160208/18

logmessages.log.20160208/19

logmessages.log.20160208/20

logmessages.log.20160208/21

...

and @ 18:02 again it'll check logmessages.log.20160208/18

logmessages.log.20160208/19

logmessages.log.20160208/20

logmessages.log.20160208/21

...

this taking unnecessary cpu cycles. assumed setting frequency hour, smart enough not waste time checking future datasets when i've defined instance past hour's data: current(-1)

i solved issue simple property adjustment. introducing under coordinator-app

<coordinator-app name="cl_test" frequency="${coord:hours(1)}" start="..."   end="..." timezone="pst" xmlns="uri:oozie:coordinator:0.2">     <controls>         <timeout>1440</timeout>         <concurrency>2</concurrency>         <throttle>1</throttle>     </controls> ... ... </coordinator-app> 

specifically, <throttle> property limits how many actions can put waiting status. setting 1, timeout time applies next action in 'waiting' status. <timeout> changes timeout limit 'waiting' actions, while believe <concurrency> limits how many actions can running @ once.


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -