How do we know when Heritrix completes a crawl job? -

March 15, 2015

in our application, heritrix being used crawl engine , once crawl job finished, manually kicking off endpoint download pdfs website. automate downloading pdf task crawl job complete. heritrix provide uri/webservice method - returns status of job? (or) need create polling app continuously monitor status of job?

i don't know if there option without continious monitoring can use heritrix api status job, smth like

curl -v -d "action=" -k -u admin:admin --anyauth --location -h "accept: application/xml" https://localhost:8443/engine/job/myjob

gives xml can read job status.

another, maybe easier (yet not 'professional') option check if jobs warcs directory contains file .open extension. if not - job finished.

Search This Blog

Color

How do we know when Heritrix completes a crawl job? -

Comments

Post a Comment

Popular posts from this blog

android - net_scheduler holding wakelock -

sql - MySQL : Getting Entries from a many-to-many table -

java - Retrieving data from database using jsp (Hibernate + Spring + Maven) -