Web scraping in Java/Scala -

May 15, 2010

i need extract keywords, title , description of long list of urls (initially ~250,000 urls per day , ~15,000,000 urls per day)

how recommend executing this? preferably , solution extended 15,000,000 events per day. preferably in scala or java

so far i've looked at:

spray - i'm not familiar spray yet can't quite evaluate it. useful framework task?
vertx - i've worked vertx before, if framework explain how best way implement vertx?
scala scraper - not familiar @ all. framework use case , loads need?
nutch - i'm not sure how if want use inside code. i'm not sure need solr usecase. had experience it?

i'll happy hear of other options if think they're better

i know can dig in each of these solution , decide whether it's or not there seem many options direction appreciated.

thanks in advance

we use stormcrawler our search engine, stolencamerafinder. it's in java , i've clocked fetching on 4 million urls per day politeness setting of 1 url per second per host. bottleneck wasn't stormcrawler url diversity. per host part important, never fetch more 1 url per second each host (technically leaves 1 second rest between fetches). example, if had 60 urls yahoo.com/* , 100 million flickr.com/* still never exceed 120/min.

you can index data straight elasticsearch , works well. stormcrawler has hooks right should able running quite easily.

Search This Blog

Color

Web scraping in Java/Scala -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -