python - BeautifulSoup help, how to extract content from not proper tags text in html file? -
<tr> <td nowrap> good1 </td> <td class = "td_left" nowrap=""> 1 </td> </tr> <tr0> <td nowrap> good2 </td> <td class = "td_left" nowrap=""> </td> </tr0> how using python parse it? please help. want result list ['good1',1,'good2',none]
find tr tags , tds it:
from bs4 import beautifulsoup page = """<tr> <td nowrap> good1 </td> <td nowrap class = "td_left"> 1 </td> </tr> <tr> <td nowrap> good2 </td> <td nowrap class = "td_left"> 2 </td> </tr>""" soup = beautifulsoup(page) rows = soup.body.find_all('tr') print [td.text.strip() row in rows td in row.find_all('td')] prints:
[u'good1', u'1', u'good2', u'2'] note, strip() helps rid of leading , trailing whitespaces.
hope helps.
Comments
Post a Comment