python - Follow links with regular expressions -
i know how find links on specific page regular expressions:
import urllib2 import re url = "www.something.com" page = urllib2.urlopen(url) html = page.read() links = re.findall(r'"((http|ftp)s?://.*?)"', html)
however, can't figure out how follow links extract <p>
tags. tried this:
for link in links: page += urllib2.urlopen(links) html += page.read() paragraphs = re.findall(r'(<p(.*?)</p>)', html) paragraph in paragraphs: print paragraph[0], "\n"
how supposed done?
(sidenote: regex question, not beautifulsoup question.)
it looks have small syntax errors in code snippet. when use re.findall
, “captures” expressions in parentheses groups , returns them part of each hit. thus, links
list (get it?) not array of strings, array of tuples. e.g.,
('https://s.yimg.com/os/mit/ape/w/d8f6e02/dark/partly_cloudy_day.png', 'http'), ('https://s.yimg.com/os/mit/ape/w/d8f6e02/dark/mostly_cloudy_day_night.png', 'http')
so can update loop ignore second part of tuple by:
for link, _ in links: page += urllib2.urlopen(link) html += page.read()
n.b. had typo in spelling of link
(you had links
). same thing goes paragraphs parentheses delineate saved groups.
Comments
Post a Comment