Recursive Scraping with Python and Scrapy: Information Not Retrieved -

May 15, 2010

i trying use scrapy pull contact information pratt website, information not being retrieved. code follows:

from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.selector import htmlxpathselector, selector scrapy.http import request  class espider(crawlspider):     name = "pratt"     allowed_domains = ["pratt.edu"]     start_urls = ["https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302"]      rules = (rule (sgmllinkextractor(restrict_xpaths=('/html/body/div[3]/div/div[2]/div/div/p/a',))     , callback="parse_items", follow= true),     )      def parse_items(self, response):         contacts = selector(response)         print contacts.xpath('/html/body/div[3]/div/div[2]/table/tbody/tr[2]/td[2]/h3').extract()         print contacts.xpath('/html/body/div[3]/div/div[2]/table/tbody/tr[2]/td[2]/a').extract()

beginning on start_url, want go through each person's link , grab name , email address next page. when run scraper, receive following output:

2014-03-10 16:46:37-0400 [scrapy] info: scrapy 0.22.2 started (bot: emailspider) 2014-03-10 16:46:37-0400 [scrapy] info: optional features available: ssl, http11 2014-03-10 16:46:37-0400 [scrapy] info: overridden settings: {'newspider_module': 'emailspider.spiders', 'spider_modules': ['emailspider.spiders'], 'depth_limit': 1, 'bot_name': 'emailspider'} 2014-03-10 16:46:37-0400 [scrapy] info: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2014-03-10 16:46:37-0400 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2014-03-10 16:46:37-0400 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2014-03-10 16:46:37-0400 [scrapy] info: enabled item pipelines:  2014-03-10 16:46:37-0400 [pratt] info: spider opened 2014-03-10 16:46:37-0400 [pratt] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-03-10 16:46:37-0400 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2014-03-10 16:46:37-0400 [scrapy] debug: web service listening on 0.0.0.0:6080 2014-03-10 16:46:41-0400 [pratt] debug: crawled (200) <get https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302> (referer: none) 2014-03-10 16:46:44-0400 [pratt] debug: crawled (200) <get https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/bio/?id=eabruzzo> (referer: https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302) [] [] 2014-03-10 16:46:47-0400 [pratt] debug: crawled (200) <get https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/bio/?id=ehinrich> (referer: https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302) [] []

(i halted program after few iterations). looks pages being scraped, empty lists being returned. idea why going on? much.

your xpaths weren't correct. and, actually, fragile, try avoid using absolute xpaths in crawler. better find container unique id or class name , rely on it.

here's fixed version of spider (relying on div id="content"):

from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.item import item, field scrapy.selector import selector   class prattitem(item):     name = field()     email = field()   class espider(crawlspider):     name = "pratt"     allowed_domains = ["pratt.edu"]     start_urls = ["https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302"]      rules = (rule(sgmllinkextractor(restrict_xpaths=('//div[@id="content"]',)), callback="parse_items", follow=true), )      def parse_items(self, response):         contacts = selector(response)          item = prattitem()         item['name'] = contacts.xpath('//div[@id="content"]/table//h2/text()').extract()[0]         item['email'] = contacts.xpath('//div[@id="content"]/table/tr[2]/td[2]/a/@href').extract()[0]         return item

hope helps.

Search This Blog

And

Recursive Scraping with Python and Scrapy: Information Not Retrieved -

Comments

Post a Comment

Popular posts from this blog

visual studio - vb.net filter binding source by time -

php - SPIP: From Tag directly to an article -

jquery - isAjaxRequest always return false -