How do we know when Heritrix completes a crawl job? -
in our application, heritrix being used crawl engine , once crawl job finished, manually kicking off endpoint download pdfs website. automate downloading pdf task crawl job complete. heritrix provide uri/webservice method - returns status of job? (or) need create polling app continuously monitor status of job?
i don't know if there option without continious monitoring can use heritrix api status job, smth like
curl -v -d "action=" -k -u admin:admin --anyauth --location -h "accept: application/xml" https://localhost:8443/engine/job/myjob gives xml can read job status.
another, maybe easier (yet not 'professional') option check if jobs warcs directory contains file .open extension. if not - job finished.
Comments
Post a Comment