Empty CSV in web scrapping

Empty CSV in web scrapping - Python -

September 15, 2012

i try create csv tables appear in each link. this link

in link there 36 links, 36 csv should generated. when run code, 36 csv created empty. code below:

import csv import urllib2 bs4 import beautifulsoup     first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/a.html").read() soup=beautifulsoup(first) w=[] q in soup.find_all('tr'):     link in q.find_all('a'):         w.append(link["href"])    l=[]  t in w:     l.append(t.replace(".","",1))      def record (part) :           url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part)         u=urllib2.urlopen(url)         try:             html=u.read()         finally:             u.close()         soup=beautifulsoup(html)         c=[]         n in soup.find_all('center'):             b in n.find_all('a')[2:]:                 c.append(b.text)          t=(len(c))/2         part=part[:-6]         name=part.replace("/","")           open('{}.csv'.format(name), 'wb') f:             writer = csv.writer(f)             in range(t):                 url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i)                 u = urllib2.urlopen(url)                 try:                     html = u.read()                 finally:                     u.close()                 soup=beautifulsoup(html)                 tr in soup.find_all('tr')[1:]:                     tds = tr.find_all('td')                     row = [elem.text.encode('utf-8') elem in tds[:6]]                     writer.writerow(row)

with for, run created function create csv per link.

 n in l:         record(n)

edit: according advice of alecxe, change code, , it's working ok fist 2 links. moreover, there's message http error 404: not found . revise in directory , there 2 csv created correctly.

here's code:

import csv import urllib2 bs4 import beautifulsoup        def record(part):         soup = beautifulsoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part)))         c=[]         n in soup.find_all('center'):             b in n.find_all('a')[1:]:                 c.append(b.text)          t = (len(links)) / 2         part = part[:-6]         name = part.replace("/", "")          open('{}.csv'.format(name), 'wb') f:             writer = csv.writer(f)             in range(t):                 url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i)                 soup = beautifulsoup(urllib2.urlopen(url))                 tr in soup.find_all('tr')[1:]:                     tds = tr.find_all('td')                     row = [elem.text.encode('utf-8') elem in tds[:6]]                     writer.writerow(row)       soup = beautifulsoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/a.html"))     links = [tr.a["href"].replace(".", "", 1) tr in soup.find_all('tr')]      link in links:         record(link)

soup.find_all('center') finds nothing.

replace:

c=[] n in soup.find_all('center'):     b in n.find_all('a')[2:]:         c.append(b.text)

with:

c = [link.text link in soup.find('table').find_all('a')[2:]]

also, can pass urllib2.urlopen(url) directly beautifulsoup constructor:

soup = beautifulsoup(urllib2.urlopen(url))

also, since have 1 link in row, can simplify way getting list of links. instead of:

w=[] q in soup.find_all('tr'):     link in q.find_all('a'):         w.append(link["href"])

do this:

links = [tr.a["href"] tr in soup.find_all('tr')]

also, pay attention how naming variables , code formatting. see:

Search This Blog

And

Empty CSV in web scrapping - Python -

Comments

Post a Comment

Popular posts from this blog

google app engine - 403 Forbidden POST - Flask WTForms -

Android layout hidden on keyboard show -

Parse xml element into list in Python -