python - Issue with scraping site with foreign characters -

April 15, 2012

i need scraper i'm writing. i'm trying scrape table of university rankings, , of schools european universities foreign characters in names (e.g. ä, ü). i'm scraping table on site foreign universities in exact same way, , works fine. reason, current scraper won't work foreign characters (and far parsing foreign characters, 2 scrapers same).

here's i'm doing try & make things work:

declare encoding on first line of file:
```
# -*- coding: utf-8 -*- 
```

importing & using smart unicode django framework django.utils.encoding import smart_unicode

school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8',         strings_only=false, errors='strict').encode('utf-8')

use encode function, seen above when chained smart_unicode function. can't think of else doing wrong. before dealing these scrapers, didn't understand different encoding, it's been bit of eye-opening experience. i've tried reading following, still can't overcome problem
- http://farmdev.com/talks/unicode/
- http://www.joelonsoftware.com/articles/unicode.html

i understand in encoding, every character assigned number, can expressed in hex, binary, etc. different encodings have different capacities how many languages support (e.g. ascii supports english, utf-8 supports seems. however, feel i'm doing necessary ensure characters printed correctly. don't know mistake is, , it's driving me crazy. please help!!

when extracting information web page, need determine character encoding, how browsers such things (analyzing http headers, parsing html find meta tags, , possibly guesswork based on actual data, e.g. presence of looks bom in encoding). can find library routine you.

in case, should not expect web sites utf-8 encoded. iso-8859-1 still in widespread use, , in general reading iso-8859-1 if utf-8 results in big mess (for non-ascii characters).

Search This Blog

Funaction

python - Issue with scraping site with foreign characters -

Comments

Post a Comment

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -