How do we know when Heritrix completes a crawl job? -


in our application, heritrix being used crawl engine , once crawl job finished, manually kicking off endpoint download pdfs website. automate downloading pdf task crawl job complete. heritrix provide uri/webservice method - returns status of job? (or) need create polling app continuously monitor status of job?

i don't know if there option without continious monitoring can use heritrix api status job, smth like

curl -v -d "action=" -k -u admin:admin --anyauth --location -h "accept: application/xml" https://localhost:8443/engine/job/myjob 

gives xml can read job status.

another, maybe easier (yet not 'professional') option check if jobs warcs directory contains file .open extension. if not - job finished.


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -