Write a python script that goes through the links on a page recursively -

April 15, 2014

i'm doing project school in compare scam mails. found website: http://www.419scam.org/emails/ save every scam in apart documents later on can analyse them. here code far:

import beautifulsoup, urllib2  address='http://www.419scam.org/emails/' html = urllib2.urlopen(address).read() f = open('test.txt', 'wb') f.write(html) f.close()

this saves me whole html file in text format, strip file , save content of html links scams:

<a href="2011-12/01/index.htm">01</a>  <a href="2011-12/02/index.htm">02</a>  <a href="2011-12/03/index.htm">03</a>

etc.

if that, still need go step further , open save href. idea how do in 1 python code?

thank you!

you picked right tool in beautifulsoup. technically in 1 script, might want segment it, because looks you'll dealing tens of thousands of e-mails, of seperate requests - , take while.

this page gonna lot, here's little code snippet started. gets of html tags index pages e-mails, extracts href links , appends bit front of url can accessed directly.

from bs4 import beautifulsoup import re import urllib2 soup = beautifulsoup(urllib2.urlopen("http://www.419scam.org/emails/")) tags = soup.find_all(href=re.compile("20......../index\.htm") links = [] t in tags:     links.append("http://www.419scam.org/emails/" + t['href'])

're' python's regular expressions module. in fifth line, told beautifulsoup find tags in soup href attribute match regular expression. chose regular expression e-mail index pages rather of href links on page. noticed index page links had pattern of urls.

having proper 'a' tags, looped through them, extracting string href attribute doing t['href'] , appending rest of url front of string, raw string urls.

reading through documentation, should idea of how expand these techniques grab individual e-mails.

Search This Blog

Funaction

Write a python script that goes through the links on a page recursively -

Comments

Post a Comment

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -