python - regex with optional block of text -
i'm using regex parse structured text below, caret symbol marking i'm trying match:
block 1 ^^^^^^^ subblock 1.1 attrib a=a1 subblock 1.2 attrib b=b1 ^^ block 2 subblock 2.1 attrib a=a2 block 3 ^^^^^^^ subblock 3.1 attrib a=a3 subblock 3.2 attrib b=b3 ^^
a subblock may or may not appear inside block, e.g.: subblock 2.2.
the expected match [(block1,b1), (block3,b3)].
/(capture block#)[\s\s]*?attrib\sb=(capture b#)/gm
but ends matching [(block1, b1), (block2, b3)].
where doing regex wrong?
you can use
(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)
see the regex demo
the regex based on unroll loop technique. here explanation:
(?m)
- multiline modifier make^
match beginning of line(^block\s*\d+)
- match , captureblock
+ optional whitespace(s) + 1+ digits (group 1).*
- matches rest of line (as no dotall option should on)(?:\n(?!block\s*\d).*)*
- match text after not wordblock
followed optional whitespace(s) followed digit (this way, boundary set)\battrib\s*b=(\w+)
- match whole wordattrib
followed 0+ whitespaces, literalb=
, , match , capture 1+ alphanumerics or underscore (note: can adjusted per real data)(\w+)
import re p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)') s = "block 1\n subblock 1.1\n attrib a=a1\n subblock 1.2\n attrib b=b1\nblock 2\n subblock 2.1\n attrib a=a2\nblock 3\n subblock 3.1\n attrib a=a3\n subblock 3.2\n attrib b=b3" print(p.findall(s))
Comments
Post a Comment