Perl WWW::Mechanize Slow down requests to avoid HTTP Code 429 -


i've written perl script fetch , parse webpage, fill forms , collect information, after while denied server http error 429 many requests. sent many requests in short amount of time server ip has been blacklisted.

how "slow down" requests/script avoid again , not hurt anyone? there way perl module www::mechanize?

sub getlinksofall {      $i ( 1 .. $maxpages ) {          $mech->follow_link( url_regex => qr/page$i/i );         push @links, $mech->find_all_links(             url_regex => qr/http:\/\/www\.example\.com\/somestuffs\//i         );     }      foreach $links (@links) {         push @links2, $links->url();     }      @new_stuffs = uniq @links2; }  sub getnumberofpages {     push @numberofpages, $mech->content =~ m/\/page(\d+)"/gi;     $maxpages = ( sort { $b <=> $a } @numberofpages )[0]; }  sub getdataabout {      foreach $stuff ( @new_stuffs ) {          $mech->get($stuff);          $g = $mech->content;         $t = $mech->content;         $s = $mech->content;          # ... , regex match dbi stuff...     } } 

by these loops there thousands of links , want slow down. "sleep" command in these loops enough this?

you need check whether site scraping has service agreement allows use in way. because bandwidth costs money, sites prefer restrict access real human operators or legitimate index engines google

you should take @ robots.txt file site you're leeching have details on automated access permitted. take @ www.robotstxt.org more information

a simple sleep 30 between requests okay past rules, don't reduce period below 30

there subclass of lwp::useragent called lwp::robotua intended situations this. may straightforward www::mechanize use instead of base class


Comments

Popular posts from this blog

Ansible - ERROR! the field 'hosts' is required but was not set -

SoapUI on windows 10 - high DPI/4K scaling issue -

customize file_field button ruby on rails -