Perl WWW::Mechanize Slow down requests to avoid HTTP Code 429 -
i've written perl script fetch , parse webpage, fill forms , collect information, after while denied server http error 429 many requests
. sent many requests in short amount of time server ip has been blacklisted.
how "slow down" requests/script avoid again , not hurt anyone? there way perl module www::mechanize
?
sub getlinksofall { $i ( 1 .. $maxpages ) { $mech->follow_link( url_regex => qr/page$i/i ); push @links, $mech->find_all_links( url_regex => qr/http:\/\/www\.example\.com\/somestuffs\//i ); } foreach $links (@links) { push @links2, $links->url(); } @new_stuffs = uniq @links2; } sub getnumberofpages { push @numberofpages, $mech->content =~ m/\/page(\d+)"/gi; $maxpages = ( sort { $b <=> $a } @numberofpages )[0]; } sub getdataabout { foreach $stuff ( @new_stuffs ) { $mech->get($stuff); $g = $mech->content; $t = $mech->content; $s = $mech->content; # ... , regex match dbi stuff... } }
by these loops there thousands of links , want slow down. "sleep" command in these loops enough this?
you need check whether site scraping has service agreement allows use in way. because bandwidth costs money, sites prefer restrict access real human operators or legitimate index engines google
you should take @ robots.txt
file site you're leeching have details on automated access permitted. take @ www.robotstxt.org more information
a simple sleep 30
between requests okay past rules, don't reduce period below 30
there subclass of lwp::useragent
called lwp::robotua
intended situations this. may straightforward www::mechanize
use instead of base class
Comments
Post a Comment