Recursive Scraping with Python and Scrapy: Information Not Retrieved -
i trying use scrapy pull contact information pratt website, information not being retrieved. code follows:
from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.selector import htmlxpathselector, selector scrapy.http import request class espider(crawlspider): name = "pratt" allowed_domains = ["pratt.edu"] start_urls = ["https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302"] rules = (rule (sgmllinkextractor(restrict_xpaths=('/html/body/div[3]/div/div[2]/div/div/p/a',)) , callback="parse_items", follow= true), ) def parse_items(self, response): contacts = selector(response) print contacts.xpath('/html/body/div[3]/div/div[2]/table/tbody/tr[2]/td[2]/h3').extract() print contacts.xpath('/html/body/div[3]/div/div[2]/table/tbody/tr[2]/td[2]/a').extract() beginning on start_url, want go through each person's link , grab name , email address next page. when run scraper, receive following output:
2014-03-10 16:46:37-0400 [scrapy] info: scrapy 0.22.2 started (bot: emailspider) 2014-03-10 16:46:37-0400 [scrapy] info: optional features available: ssl, http11 2014-03-10 16:46:37-0400 [scrapy] info: overridden settings: {'newspider_module': 'emailspider.spiders', 'spider_modules': ['emailspider.spiders'], 'depth_limit': 1, 'bot_name': 'emailspider'} 2014-03-10 16:46:37-0400 [scrapy] info: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2014-03-10 16:46:37-0400 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2014-03-10 16:46:37-0400 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2014-03-10 16:46:37-0400 [scrapy] info: enabled item pipelines: 2014-03-10 16:46:37-0400 [pratt] info: spider opened 2014-03-10 16:46:37-0400 [pratt] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-03-10 16:46:37-0400 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2014-03-10 16:46:37-0400 [scrapy] debug: web service listening on 0.0.0.0:6080 2014-03-10 16:46:41-0400 [pratt] debug: crawled (200) <get https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302> (referer: none) 2014-03-10 16:46:44-0400 [pratt] debug: crawled (200) <get https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/bio/?id=eabruzzo> (referer: https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302) [] [] 2014-03-10 16:46:47-0400 [pratt] debug: crawled (200) <get https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/bio/?id=ehinrich> (referer: https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302) [] [] (i halted program after few iterations). looks pages being scraped, empty lists being returned. idea why going on? much.
your xpaths weren't correct. and, actually, fragile, try avoid using absolute xpaths in crawler. better find container unique id or class name , rely on it.
here's fixed version of spider (relying on div id="content"):
from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.item import item, field scrapy.selector import selector class prattitem(item): name = field() email = field() class espider(crawlspider): name = "pratt" allowed_domains = ["pratt.edu"] start_urls = ["https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302"] rules = (rule(sgmllinkextractor(restrict_xpaths=('//div[@id="content"]',)), callback="parse_items", follow=true), ) def parse_items(self, response): contacts = selector(response) item = prattitem() item['name'] = contacts.xpath('//div[@id="content"]/table//h2/text()').extract()[0] item['email'] = contacts.xpath('//div[@id="content"]/table/tr[2]/td[2]/a/@href').extract()[0] return item hope helps.
Comments
Post a Comment