python - BeautifulSoup help, how to extract content from not proper tags text in html file? -
<tr> <td nowrap> good1 </td> <td class = "td_left" nowrap=""> 1 </td> </tr> <tr0> <td nowrap> good2 </td> <td class = "td_left" nowrap=""> </td> </tr0>
how using python parse it? please help. want result list ['good1',1,'good2',none]
find tr
tags , td
s it:
from bs4 import beautifulsoup page = """<tr> <td nowrap> good1 </td> <td nowrap class = "td_left"> 1 </td> </tr> <tr> <td nowrap> good2 </td> <td nowrap class = "td_left"> 2 </td> </tr>""" soup = beautifulsoup(page) rows = soup.body.find_all('tr') print [td.text.strip() row in rows td in row.find_all('td')]
prints:
[u'good1', u'1', u'good2', u'2']
note, strip() helps rid of leading , trailing whitespaces.
hope helps.
Comments
Post a Comment