python - Counting words from tokenized url -
very new python , hoping guys give me help.
i have book great war, , want count times country appears in book. far have this:
>>> __future__ import division >>> import nltk, re, pprint >>> urllib import urlopen >>> url = "http://www.gutenberg.org/files/29270/29270.txt" >>> raw = urlopen(url).read() >>> type(raw) <type 'str'> >>> len(raw) 1067008 >>> raw[:75] 'the project gutenberg ebook of story of great war, volume ii (of\r\nv' >>> tokenization. break string words , punctuation.
>>> tokens = nltk.word_tokenize(raw) >>> type(tokens) <type 'list'> >>> len(tokens) 189743 >>> tokens[:10] //vind de eerste 10 tokens ['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'story', 'of', 'the', 'great'] >>> correcting beginning , ending of book
>>> raw.find("part i") >>> 2629 >>> raw.rfind("end of project gutenberg") >>> 1047663 >>> raw = raw[2629:1047663] >>> raw.find("part i") >>> 0 i unfortunately have no idea how implement book wordcount. ideal outcome this:
germany 2000 united kingdom 1500 usa 1000 holland 50 belgium 150 etc.
please help!
python has builtin method count substring in string.
from urllib import urlopen url = "http://www.gutenberg.org/files/29270/29270.txt" raw = urlopen(url).read() raw = raw[raw.find("part i"):raw.rfind("end of project gutenberg")] countries = ['germany', 'united kingdom', 'usa', 'holland', 'belgium'] c in countries: print c, raw.count(c) produces
germany 117 united kingdom 0 usa 0 holland 10 belgium 63 edit: eumiro right, doesn't work if want count exact word. use if want search exact word:
import re urllib import urlopen url = "http://www.gutenberg.org/files/29270/29270.txt" raw = urlopen(url).read() raw = raw[raw.find("part i"):raw.rfind("end of project gutenberg")] key, value in {c:len(re.findall(c + '[^a-za-z]', raw)) c in countries}.items(): print key, value edit: if want linenumbers:
from urllib import urlopen import re collections import defaultdict url = "http://www.gutenberg.org/files/29270/29270.txt" raw = urlopen(url).readlines() count = defaultdict(list) countries = ['germany', 'united kingdom', 'usa', 'holland', 'belgium'] c in countries: nr, line in enumerate(raw): if re.search(c + r'[^a-za-z]', line): count[c].append(nr + 1) #nr + 1 first line 1 instead of 0 print c, len(count[c]), 'lines:', count[c]
Comments
Post a Comment