Problems faced while scraping webpages in Python -


so, wrote minimal function scrape text webpage:

url = 'http://www.brainpickings.org' request = requests.get(url) soup_data = beautifulsoup(request.content) texts = soup_data.findall(text=true)  def visible(element):     if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:         return false      return true print filter(visible,texts) 

but, doesn't work smooth. there still unnecessary tags there. also, if try to reg-ex removal of various characters don't want,

error     elif re.match('<!--.*-->', str(element)): unicodeencodeerror: 'ascii' codec can't encode character u'\u2019' in position 209: ordinal not in range(128) 

thus, how can improve bit more make better?

with lxml pretty easy:

from lxml import html  doc = html.fromstring(content) print doc.text_content() 

edit: filtering head done follows:

print doc.body.text_content() 

Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -