Web scraping in Java/Scala -


i need extract keywords, title , description of long list of urls (initially ~250,000 urls per day , ~15,000,000 urls per day)

how recommend executing this? preferably , solution extended 15,000,000 events per day. preferably in scala or java

so far i've looked at:

  • spray - i'm not familiar spray yet can't quite evaluate it. useful framework task?
  • vertx - i've worked vertx before, if framework explain how best way implement vertx?
  • scala scraper - not familiar @ all. framework use case , loads need?
  • nutch - i'm not sure how if want use inside code. i'm not sure need solr usecase. had experience it?

i'll happy hear of other options if think they're better

i know can dig in each of these solution , decide whether it's or not there seem many options direction appreciated.

thanks in advance

we use stormcrawler our search engine, stolencamerafinder. it's in java , i've clocked fetching on 4 million urls per day politeness setting of 1 url per second per host. bottleneck wasn't stormcrawler url diversity. per host part important, never fetch more 1 url per second each host (technically leaves 1 second rest between fetches). example, if had 60 urls yahoo.com/* , 100 million flickr.com/* still never exceed 120/min.

you can index data straight elasticsearch , works well. stormcrawler has hooks right should able running quite easily.


Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -