python - Follow links with regular expressions -

i know how find links on specific page regular expressions:

import urllib2 import re  url = ""  page = urllib2.urlopen(url)  html =  links = re.findall(r'"((http|ftp)s?://.*?)"', html) 

however, can't figure out how follow links extract <p> tags. tried this:

for link in links:     page += urllib2.urlopen(links)     html +=  paragraphs = re.findall(r'(<p(.*?)</p>)', html)  paragraph in paragraphs:     print paragraph[0], "\n" 

how supposed done?

(sidenote: regex question, not beautifulsoup question.)

it looks have small syntax errors in code snippet. when use re.findall, “captures” expressions in parentheses groups , returns them part of each hit. thus, links list (get it?) not array of strings, array of tuples. e.g.,

('', 'http'), ('', 'http') 

so can update loop ignore second part of tuple by:

for link, _ in links:     page += urllib2.urlopen(link)     html += 

n.b. had typo in spelling of link (you had links). same thing goes paragraphs parentheses delineate saved groups.


Popular posts from this blog

Ansible - ERROR! the field 'hosts' is required but was not set -

customize file_field button ruby on rails -

SoapUI on windows 10 - high DPI/4K scaling issue -