python - Follow links with regular expressions -

- August 15, 2012

i know how find links on specific page regular expressions:

import urllib2 import re  url = "www.something.com"  page = urllib2.urlopen(url)  html = page.read()  links = re.findall(r'"((http|ftp)s?://.*?)"', html)

however, can't figure out how follow links extract <p> tags. tried this:

for link in links:     page += urllib2.urlopen(links)     html += page.read()  paragraphs = re.findall(r'(<p(.*?)</p>)', html)  paragraph in paragraphs:     print paragraph[0], "\n"

how supposed done?

(sidenote: regex question, not beautifulsoup question.)

it looks have small syntax errors in code snippet. when use re.findall, “captures” expressions in parentheses groups , returns them part of each hit. thus, links list (get it?) not array of strings, array of tuples. e.g.,

('https://s.yimg.com/os/mit/ape/w/d8f6e02/dark/partly_cloudy_day.png', 'http'), ('https://s.yimg.com/os/mit/ape/w/d8f6e02/dark/mostly_cloudy_day_night.png', 'http')

so can update loop ignore second part of tuple by:

for link, _ in links:     page += urllib2.urlopen(link)     html += page.read()

n.b. had typo in spelling of link (you had links). same thing goes paragraphs parentheses delineate saved groups.

Search This Blog

EEE

python - Follow links with regular expressions -

Comments

Post a Comment

Popular posts from this blog

Ansible - ERROR! the field 'hosts' is required but was not set -

SoapUI on windows 10 - high DPI/4K scaling issue -

ssl - how to download/uplaod file over HTTPS using Indy 10 and OpenSSL in delphi? -