Python- Incomplete Data (Web Scraping) -

September 15, 2014

this code:

from bs4 import beautifulsoup import urllib2 import re import sys   main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/" test_url = urllib2.urlopen(main_url) readhtml = test_url.read() test_url.close()   soup = beautifulsoup(readhtml, "html.parser")  url = soup.find('div',attrs={"class":"entry-content"}).findall('div', attrs={"class":none})  count = 1  fobj = open('d:\scrapping\parveen_again2.xml', 'w') getting in url:    url = getting.find('a')    if url.has_attr('href'):           urls = url['href']                  test_url = urllib2.urlopen(urls, timeout=36)           readhtml = test_url.read()           test_url.close()            soup1 = beautifulsoup(readhtml, "html.parser")            title = soup1.find('title')           title = title.get_text('+')           title = title.split("|")            author = soup1.find('div',attrs={"class":"entry-meta"}).find('span',attrs={"class":"categories-links"})             author = author.findall('a')            fobj.write("<add><doc>\n")           fobj.write("<field name=\"id\">sukhansara.com_pg1author"+author[0].string.encode('utf8')+"count"+str(count)+"</field>\n")           fobj.write("<field name=\"title\">"+title[0].encode('utf8')+"</field>\n")           fobj.write("<field name=\"content\">")            count += 1             poetry = soup1.find('div',attrs={"class":"entry-content"}).findall('div')            x=1           check = true            while check:                  if poetry[x+1].string.encode('utf8') != author[0].string.encode('utf8'):                         fobj.write(poetry[x].string.encode('utf8')+"|")                         x+=1                  else:                         check = false           fobj.write(poetry[x].string.encode('utf8'))            fobj.write("</field>\n")           fobj.write("<field name=\"group\">ur_poetry</field>\n")           fobj.write("<field name=\"author\">"+author[0].string.encode('utf8')+"</field>\n")           fobj.write("<field name=\"url\">"+urls+"</field>\n")           fobj.write("<add><doc>\n\n")    fobj.close()  print "done printing"

sonetimes 24 poetry 24 urls , 81. there 100 urls? every time when reach 81 error occur

attributeerror: 'nonetype' object has no attribute 'encode'

or sometime set timeout error. doing wrong?

switching requests , maintaining opened session should make work:

import requests  requests.session() session:     main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/"      readhtml = session.get(main_url).content     soup = beautifulsoup(readhtml, "html.parser")      # ...

Search This Blog

Color

Python- Incomplete Data (Web Scraping) -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -