python - regex with optional block of text -


i'm using regex parse structured text below, caret symbol marking i'm trying match:

block 1 ^^^^^^^     subblock 1.1         attrib a=a1     subblock 1.2         attrib b=b1                  ^^ block 2     subblock 2.1         attrib a=a2 block 3 ^^^^^^^     subblock 3.1         attrib a=a3     subblock 3.2         attrib b=b3                  ^^ 

a subblock may or may not appear inside block, e.g.: subblock 2.2.

the expected match [(block1,b1), (block3,b3)].

/(capture block#)[\s\s]*?attrib\sb=(capture b#)/gm 

but ends matching [(block1, b1), (block2, b3)].

where doing regex wrong?

you can use

(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+) 

see the regex demo

the regex based on unroll loop technique. here explanation:

  • (?m) - multiline modifier make ^ match beginning of line
  • (^block\s*\d+) - match , capture block + optional whitespace(s) + 1+ digits (group 1)
  • .* - matches rest of line (as no dotall option should on)
  • (?:\n(?!block\s*\d).*)* - match text after not word block followed optional whitespace(s) followed digit (this way, boundary set)
  • \battrib\s*b=(\w+) - match whole word attrib followed 0+ whitespaces, literal b=, , match , capture 1+ alphanumerics or underscore (note: can adjusted per real data) (\w+)

python demo:

import re p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)') s = "block 1\n    subblock 1.1\n        attrib a=a1\n    subblock 1.2\n        attrib b=b1\nblock 2\n    subblock 2.1\n        attrib a=a2\nblock 3\n    subblock 3.1\n        attrib a=a3\n    subblock 3.2\n        attrib b=b3" print(p.findall(s)) 

Comments

Popular posts from this blog

Ansible - ERROR! the field 'hosts' is required but was not set -

SoapUI on windows 10 - high DPI/4K scaling issue -

customize file_field button ruby on rails -