perl - How do I extract data from a quoted-printable encoded HTML table? -


i know there many other posts related html::tableextract module, of them have been @ higher level understand @ moment. have small table (3 rows, 5 columns) email , want scrape data in second row. however, limited knowledge of perl, have been having lot of trouble following documentation online.

the table looks this:

time      notspam    probablespam    likelyspam    spam 2012-05   10252205   62192           55995         3797710 total     ""         ""              ""            "" 

here snippet of code trying parse. second of 3 rows:

<tr class=3dmailviewunreadodd>  <td  class=3dreportviewheader align=3d"left"> =09      2012-05 </td> =20=20 =20=20=20=20      <td align=3d'right' class=3d'mailviewrowreadeven'> 10252205 =20=20=20=20 </td> =20=20 =20=20=20=20      <td align=3d'right' class=3d'mailviewrowreadeven'> 62192 =20=20=20=20 </td> =20=20 =20=20=20=20      <td align=3d'right' class=3d'mailviewrowreadeven'> 55995 =20=20=20=20 </td> =20=20 =20=20=20=20      <td align=3d'right' class=3d'mailviewrowreadeven'> 3797710 =20=20=20=20 </td> =20=20 </tr> 

here have tried far. used example on html::tableextract page , modified fit needs. it's not returning anything:

use html::tableextract; $te = html::tableextract->new(     headers => [qw(notspam  probablespam  likelyspam  spam)]); $html = 'test.html'; $te->parse($html); # examine matching tables foreach $ts ($te->tables) {     print "table (", join(',', $ts->coords), "):\n";     foreach $row ($ts->rows) {         print join(',', @$row), "\n";     } } 

i want pull out date (2012-05) , numbers (10252205, 62192, 55995, 3797710) , store them in variables. should extracting data using depth , count arguments?

this works example data. (when run against full email, captures much, that's can partial html.)

use strictures; use file::slurp qw(read_file); use mime::quotedprint qw(decode_qp); use web::query qw();  $w = web::query->new_from_html(decode_qp read_file 'so10883053.html'); @data = $w->find('.mailviewunreadodd > *')->text; # ( #     " 2012-05 ", #       10252205 , #          62192 , #          55995 , #        3797710 # ) 

instead of messing around manual email decoding showed in code, instead should use high-level parser such courriel.


Comments

Popular posts from this blog

java - Play! framework 2.0: How to display multiple image? -

gmail - Is there any documentation for read-only access to the Google Contacts API? -

php - Controller/JToolBar not working in Joomla 2.5 -