Python Urllib2 Reading only part of document -
ok, driving me nuts.
i trying read crunchbase api using python's urllib2 library. relevant code:
api_url="http://api.crunchbase.com/v/1/financial-organization/venrock.js" len(urllib2.urlopen(api_url).read()) the result either 73493 or 69397. actual length of document longer. when try on different computer, length either 44821 or 40725. i've tried changing user-agent, using urllib, increasing time-out large number, , reading small chunks @ time. same result.
i assumed server problem, browser reads whole thing.
python 2.7.2, os x 10.6.8 ~40k lengths. python 2.7.1 running ipython ~70k lengths, os x 10.7.3. thoughts?
there kooky server. might work if you, browser, request file gzip encoding. here code should trick:
import urllib2, gzip api_url='http://api.crunchbase.com/v/1/financial-organization/venrock.js' req = urllib2.request(api_url) req.add_header('accept-encoding', 'gzip') resp = urllib2.urlopen(req) data = resp.read() >>> print len(data) 26610 the problem decompress data.
from stringio import stringio if resp.info().get('content-encoding') == 'gzip': g = gzip.gzipfile(fileobj=stringio(data)) data = g.read() >>> print len(data) 183159
Comments
Post a Comment