Write a python script that goes through the links on a page recursively -
i'm doing project school in compare scam mails. found website: http://www.419scam.org/emails/ save every scam in apart documents later on can analyse them. here code far:
import beautifulsoup, urllib2 address='http://www.419scam.org/emails/' html = urllib2.urlopen(address).read() f = open('test.txt', 'wb') f.write(html) f.close() this saves me whole html file in text format, strip file , save content of html links scams:
<a href="2011-12/01/index.htm">01</a> <a href="2011-12/02/index.htm">02</a> <a href="2011-12/03/index.htm">03</a> etc.
if that, still need go step further , open save href. idea how do in 1 python code?
thank you!
you picked right tool in beautifulsoup. technically in 1 script, might want segment it, because looks you'll dealing tens of thousands of e-mails, of seperate requests - , take while.
this page gonna lot, here's little code snippet started. gets of html tags index pages e-mails, extracts href links , appends bit front of url can accessed directly.
from bs4 import beautifulsoup import re import urllib2 soup = beautifulsoup(urllib2.urlopen("http://www.419scam.org/emails/")) tags = soup.find_all(href=re.compile("20......../index\.htm") links = [] t in tags: links.append("http://www.419scam.org/emails/" + t['href']) 're' python's regular expressions module. in fifth line, told beautifulsoup find tags in soup href attribute match regular expression. chose regular expression e-mail index pages rather of href links on page. noticed index page links had pattern of urls.
having proper 'a' tags, looped through them, extracting string href attribute doing t['href'] , appending rest of url front of string, raw string urls.
reading through documentation, should idea of how expand these techniques grab individual e-mails.
Comments
Post a Comment