python - Counting words from tokenized url -

July 15, 2010

very new python , hoping guys give me help.

i have book great war, , want count times country appears in book. far have this:

>>> __future__ import division  >>> import nltk, re, pprint >>> urllib import urlopen >>> url = "http://www.gutenberg.org/files/29270/29270.txt" >>> raw = urlopen(url).read()  >>> type(raw) <type 'str'> >>> len(raw) 1067008 >>> raw[:75] 'the project gutenberg ebook of story of great war, volume ii (of\r\nv' >>>

tokenization. break string words , punctuation.

>>> tokens = nltk.word_tokenize(raw) >>> type(tokens) <type 'list'> >>> len(tokens) 189743 >>> tokens[:10] //vind de eerste 10 tokens ['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'story', 'of', 'the', 'great'] >>>

correcting beginning , ending of book

    >>> raw.find("part i")     >>> 2629     >>> raw.rfind("end of project gutenberg")     >>> 1047663     >>> raw = raw[2629:1047663]     >>> raw.find("part i")     >>> 0

i unfortunately have no idea how implement book wordcount. ideal outcome this:

germany 2000 united kingdom 1500 usa 1000 holland 50 belgium 150

etc.

please help!

python has builtin method count substring in string.

from urllib import urlopen  url = "http://www.gutenberg.org/files/29270/29270.txt" raw = urlopen(url).read() raw = raw[raw.find("part i"):raw.rfind("end of project gutenberg")]  countries = ['germany', 'united kingdom', 'usa', 'holland', 'belgium'] c in countries:     print c, raw.count(c)

produces

germany 117 united kingdom 0 usa 0 holland 10 belgium 63

edit: eumiro right, doesn't work if want count exact word. use if want search exact word:

import re urllib import urlopen  url = "http://www.gutenberg.org/files/29270/29270.txt" raw = urlopen(url).read() raw = raw[raw.find("part i"):raw.rfind("end of project gutenberg")]  key, value in {c:len(re.findall(c + '[^a-za-z]', raw)) c in countries}.items():     print key, value

edit: if want linenumbers:

from urllib import urlopen import re collections import defaultdict  url = "http://www.gutenberg.org/files/29270/29270.txt" raw = urlopen(url).readlines()  count = defaultdict(list) countries = ['germany', 'united kingdom', 'usa', 'holland', 'belgium'] c in countries:     nr, line in enumerate(raw):         if re.search(c + r'[^a-za-z]', line):             count[c].append(nr + 1) #nr + 1 first line 1 instead of 0     print c, len(count[c]), 'lines:', count[c]

Search This Blog

Funaction

python - Counting words from tokenized url -

Comments

Post a Comment

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -