Web scraping in Java/Scala -
i need extract keywords, title , description of long list of urls (initially ~250,000 urls per day , ~15,000,000 urls per day)
how recommend executing this? preferably , solution extended 15,000,000 events per day. preferably in scala or java
so far i've looked at:
- spray - i'm not familiar spray yet can't quite evaluate it. useful framework task?
- vertx - i've worked vertx before, if framework explain how best way implement vertx?
- scala scraper - not familiar @ all. framework use case , loads need?
- nutch - i'm not sure how if want use inside code. i'm not sure need solr usecase. had experience it?
i'll happy hear of other options if think they're better
i know can dig in each of these solution , decide whether it's or not there seem many options direction appreciated.
thanks in advance
we use stormcrawler our search engine, stolencamerafinder. it's in java , i've clocked fetching on 4 million urls per day politeness setting of 1 url per second per host. bottleneck wasn't stormcrawler url diversity. per host part important, never fetch more 1 url per second each host (technically leaves 1 second rest between fetches). example, if had 60 urls yahoo.com/* , 100 million flickr.com/* still never exceed 120/min.
you can index data straight elasticsearch , works well. stormcrawler has hooks right should able running quite easily.
Comments
Post a Comment