python - Issue with scraping site with foreign characters -
i need scraper i'm writing. i'm trying scrape table of university rankings, , of schools european universities foreign characters in names (e.g. ä, ü). i'm scraping table on site foreign universities in exact same way, , works fine. reason, current scraper won't work foreign characters (and far parsing foreign characters, 2 scrapers same).
here's i'm doing try & make things work:
declare encoding on first line of file:
# -*- coding: utf-8 -*-importing & using smart unicode django framework django.utils.encoding import smart_unicode
school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8', strings_only=false, errors='strict').encode('utf-8')use encode function, seen above when chained smart_unicode function. can't think of else doing wrong. before dealing these scrapers, didn't understand different encoding, it's been bit of eye-opening experience. i've tried reading following, still can't overcome problem
i understand in encoding, every character assigned number, can expressed in hex, binary, etc. different encodings have different capacities how many languages support (e.g. ascii supports english, utf-8 supports seems. however, feel i'm doing necessary ensure characters printed correctly. don't know mistake is, , it's driving me crazy. please help!!
when extracting information web page, need determine character encoding, how browsers such things (analyzing http headers, parsing html find meta tags, , possibly guesswork based on actual data, e.g. presence of looks bom in encoding). can find library routine you.
in case, should not expect web sites utf-8 encoded. iso-8859-1 still in widespread use, , in general reading iso-8859-1 if utf-8 results in big mess (for non-ascii characters).
Comments
Post a Comment