Python- Incomplete Data (Web Scraping) -
this code:
from bs4 import beautifulsoup import urllib2 import re import sys main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/" test_url = urllib2.urlopen(main_url) readhtml = test_url.read() test_url.close() soup = beautifulsoup(readhtml, "html.parser") url = soup.find('div',attrs={"class":"entry-content"}).findall('div', attrs={"class":none}) count = 1 fobj = open('d:\scrapping\parveen_again2.xml', 'w') getting in url: url = getting.find('a') if url.has_attr('href'): urls = url['href'] test_url = urllib2.urlopen(urls, timeout=36) readhtml = test_url.read() test_url.close() soup1 = beautifulsoup(readhtml, "html.parser") title = soup1.find('title') title = title.get_text('+') title = title.split("|") author = soup1.find('div',attrs={"class":"entry-meta"}).find('span',attrs={"class":"categories-links"}) author = author.findall('a') fobj.write("<add><doc>\n") fobj.write("<field name=\"id\">sukhansara.com_pg1author"+author[0].string.encode('utf8')+"count"+str(count)+"</field>\n") fobj.write("<field name=\"title\">"+title[0].encode('utf8')+"</field>\n") fobj.write("<field name=\"content\">") count += 1 poetry = soup1.find('div',attrs={"class":"entry-content"}).findall('div') x=1 check = true while check: if poetry[x+1].string.encode('utf8') != author[0].string.encode('utf8'): fobj.write(poetry[x].string.encode('utf8')+"|") x+=1 else: check = false fobj.write(poetry[x].string.encode('utf8')) fobj.write("</field>\n") fobj.write("<field name=\"group\">ur_poetry</field>\n") fobj.write("<field name=\"author\">"+author[0].string.encode('utf8')+"</field>\n") fobj.write("<field name=\"url\">"+urls+"</field>\n") fobj.write("<add><doc>\n\n") fobj.close() print "done printing"
sonetimes 24 poetry 24 urls , 81. there 100 urls? every time when reach 81 error occur
attributeerror: 'nonetype' object has no attribute 'encode'
or sometime set timeout error. doing wrong?
switching requests
, maintaining opened session should make work:
import requests requests.session() session: main_url = "http://sukhansara.com/سخن-سرا-پر-خوش-آمدید/newposts/parveenshakir/psghazals/" readhtml = session.get(main_url).content soup = beautifulsoup(readhtml, "html.parser") # ...
Comments
Post a Comment