Empty CSV in web scrapping - Python -
i try create csv tables appear in each link. this link
in link there 36 links, 36 csv should generated. when run code, 36 csv created empty. code below:
import csv import urllib2 bs4 import beautifulsoup first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/a.html").read() soup=beautifulsoup(first) w=[] q in soup.find_all('tr'): link in q.find_all('a'): w.append(link["href"]) l=[] t in w: l.append(t.replace(".","",1)) def record (part) : url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part) u=urllib2.urlopen(url) try: html=u.read() finally: u.close() soup=beautifulsoup(html) c=[] n in soup.find_all('center'): b in n.find_all('a')[2:]: c.append(b.text) t=(len(c))/2 part=part[:-6] name=part.replace("/","") open('{}.csv'.format(name), 'wb') f: writer = csv.writer(f) in range(t): url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i) u = urllib2.urlopen(url) try: html = u.read() finally: u.close() soup=beautifulsoup(html) tr in soup.find_all('tr')[1:]: tds = tr.find_all('td') row = [elem.text.encode('utf-8') elem in tds[:6]] writer.writerow(row)
with for
, run created function create csv per link.
n in l: record(n)
edit: according advice of alecxe, change code, , it's working ok fist 2 links. moreover, there's message http error 404: not found
. revise in directory , there 2 csv created correctly.
here's code:
import csv import urllib2 bs4 import beautifulsoup def record(part): soup = beautifulsoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part))) c=[] n in soup.find_all('center'): b in n.find_all('a')[1:]: c.append(b.text) t = (len(links)) / 2 part = part[:-6] name = part.replace("/", "") open('{}.csv'.format(name), 'wb') f: writer = csv.writer(f) in range(t): url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i) soup = beautifulsoup(urllib2.urlopen(url)) tr in soup.find_all('tr')[1:]: tds = tr.find_all('td') row = [elem.text.encode('utf-8') elem in tds[:6]] writer.writerow(row) soup = beautifulsoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/a.html")) links = [tr.a["href"].replace(".", "", 1) tr in soup.find_all('tr')] link in links: record(link)
soup.find_all('center')
finds nothing.
replace:
c=[] n in soup.find_all('center'): b in n.find_all('a')[2:]: c.append(b.text)
with:
c = [link.text link in soup.find('table').find_all('a')[2:]]
also, can pass urllib2.urlopen(url)
directly beautifulsoup
constructor:
soup = beautifulsoup(urllib2.urlopen(url))
also, since have 1 link in row, can simplify way getting list of links. instead of:
w=[] q in soup.find_all('tr'): link in q.find_all('a'): w.append(link["href"])
do this:
links = [tr.a["href"] tr in soup.find_all('tr')]
also, pay attention how naming variables , code formatting. see:
Comments
Post a Comment