perl - How do I extract data from a quoted-printable encoded HTML table? -
i know there many other posts related html::tableextract module, of them have been @ higher level understand @ moment. have small table (3 rows, 5 columns) email , want scrape data in second row. however, limited knowledge of perl, have been having lot of trouble following documentation online.
the table looks this:

time notspam probablespam likelyspam spam 2012-05 10252205 62192 55995 3797710 total "" "" "" "" here snippet of code trying parse. second of 3 rows:
<tr class=3dmailviewunreadodd> <td class=3dreportviewheader align=3d"left"> =09 2012-05 </td> =20=20 =20=20=20=20 <td align=3d'right' class=3d'mailviewrowreadeven'> 10252205 =20=20=20=20 </td> =20=20 =20=20=20=20 <td align=3d'right' class=3d'mailviewrowreadeven'> 62192 =20=20=20=20 </td> =20=20 =20=20=20=20 <td align=3d'right' class=3d'mailviewrowreadeven'> 55995 =20=20=20=20 </td> =20=20 =20=20=20=20 <td align=3d'right' class=3d'mailviewrowreadeven'> 3797710 =20=20=20=20 </td> =20=20 </tr> here have tried far. used example on html::tableextract page , modified fit needs. it's not returning anything:
use html::tableextract; $te = html::tableextract->new( headers => [qw(notspam probablespam likelyspam spam)]); $html = 'test.html'; $te->parse($html); # examine matching tables foreach $ts ($te->tables) { print "table (", join(',', $ts->coords), "):\n"; foreach $row ($ts->rows) { print join(',', @$row), "\n"; } } i want pull out date (2012-05) , numbers (10252205, 62192, 55995, 3797710) , store them in variables. should extracting data using depth , count arguments?
this works example data. (when run against full email, captures much, that's can partial html.)
use strictures; use file::slurp qw(read_file); use mime::quotedprint qw(decode_qp); use web::query qw(); $w = web::query->new_from_html(decode_qp read_file 'so10883053.html'); @data = $w->find('.mailviewunreadodd > *')->text; # ( # " 2012-05 ", # 10252205 , # 62192 , # 55995 , # 3797710 # ) instead of messing around manual email decoding showed in code, instead should use high-level parser such courriel.
Comments
Post a Comment