html - Python: Parse all elements under a div -

June 15, 2014

i trying parse elements under div using beautifulsoup issue don't know elements underneath div prior parsing. example div can have text data in paragraph mode , bullet format along href elements. each url open can have different elements underneath specific div class looking at:

example:

url can have following:

<div class='content'> <p> hello have link </p>  <li> have bullet point  <a href="foo.com">foo</a> </div>

but url b

can have

<div class='content'> <p> have paragraph </p>  </div>

i started doing this:

content = souping_page.body.find('div', attrs={'class': 'content})

but how go beyond little confuse. hoping create 1 string parse data end result.

at end want following string obtain each example:

example 1: final output

 parse_data = hello have link have bullet point   parse_links = foo.com

example 2: final output

 parse_data = have paragraph

you can text of text element.get_text():

>>> bs4 import beautifulsoup >>> sample1 = beautifulsoup('''\ ... <div class='content'> ... <p> hello have link </p> ...  ... <li> have bullet point ...  ... <a href="foo.com">foo</a> ... </div> ... ''').find('div') >>> sample2 = beautifulsoup('''\ ... <div class='content'> ... <p> have paragraph </p> ...  ... </div> ... ''').find('div') >>> sample1.get_text() u'\n hello have link \n have bullet point\n\nfoo\n' >>> sample2.get_text() u'\n have paragraph \n'

or can strip down little using element.stripped_strings:

>>> ' '.join(sample1.stripped_strings) u'hello have link have bullet point foo' >>> ' '.join(sample2.stripped_strings) u'i have paragraph'

to links, a elements href attributes , gather these in list:

>>> [a['href'] in sample1.find_all('a', href=true)] ['foo.com'] >>> [a['href'] in sample2.find_all('a', href=true)] []

the href=true argument limits search <a> tags have href attribute defined.

Search This Blog

And

html - Python: Parse all elements under a div -

Comments

Post a Comment

Popular posts from this blog

visual studio - vb.net filter binding source by time -

php - SPIP: From Tag directly to an article -

jquery - isAjaxRequest always return false -