Problems faced while scraping webpages in Python -
so, wrote minimal function scrape text webpage:
url = 'http://www.brainpickings.org' request = requests.get(url) soup_data = beautifulsoup(request.content) texts = soup_data.findall(text=true) def visible(element): if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: return false return true print filter(visible,texts) but, doesn't work smooth. there still unnecessary tags there. also, if try to reg-ex removal of various characters don't want,
error elif re.match('<!--.*-->', str(element)): unicodeencodeerror: 'ascii' codec can't encode character u'\u2019' in position 209: ordinal not in range(128) thus, how can improve bit more make better?
with lxml pretty easy:
from lxml import html doc = html.fromstring(content) print doc.text_content() edit: filtering head done follows:
print doc.body.text_content()
Comments
Post a Comment