html - Python: Parse all elements under a div -
i trying parse elements under div using beautifulsoup issue don't know elements underneath div prior parsing. example div can have text data in paragraph mode , bullet format along href elements. each url open can have different elements underneath specific div class looking at:
example:
url can have following:
<div class='content'> <p> hello have link </p> <li> have bullet point <a href="foo.com">foo</a> </div> but url b
can have
<div class='content'> <p> have paragraph </p> </div> i started doing this:
content = souping_page.body.find('div', attrs={'class': 'content}) but how go beyond little confuse. hoping create 1 string parse data end result.
at end want following string obtain each example:
example 1: final output
parse_data = hello have link have bullet point parse_links = foo.com example 2: final output
parse_data = have paragraph
you can text of text element.get_text():
>>> bs4 import beautifulsoup >>> sample1 = beautifulsoup('''\ ... <div class='content'> ... <p> hello have link </p> ... ... <li> have bullet point ... ... <a href="foo.com">foo</a> ... </div> ... ''').find('div') >>> sample2 = beautifulsoup('''\ ... <div class='content'> ... <p> have paragraph </p> ... ... </div> ... ''').find('div') >>> sample1.get_text() u'\n hello have link \n have bullet point\n\nfoo\n' >>> sample2.get_text() u'\n have paragraph \n' or can strip down little using element.stripped_strings:
>>> ' '.join(sample1.stripped_strings) u'hello have link have bullet point foo' >>> ' '.join(sample2.stripped_strings) u'i have paragraph' to links, a elements href attributes , gather these in list:
>>> [a['href'] in sample1.find_all('a', href=true)] ['foo.com'] >>> [a['href'] in sample2.find_all('a', href=true)] [] the href=true argument limits search <a> tags have href attribute defined.
Comments
Post a Comment