python - regex with optional block of text -
i'm using regex parse structured text below, caret symbol marking i'm trying match:
block 1 ^^^^^^^ subblock 1.1 attrib a=a1 subblock 1.2 attrib b=b1 ^^ block 2 subblock 2.1 attrib a=a2 block 3 ^^^^^^^ subblock 3.1 attrib a=a3 subblock 3.2 attrib b=b3 ^^ a subblock may or may not appear inside block, e.g.: subblock 2.2.
the expected match [(block1,b1), (block3,b3)].
/(capture block#)[\s\s]*?attrib\sb=(capture b#)/gm but ends matching [(block1, b1), (block2, b3)].
where doing regex wrong?
you can use
(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+) see the regex demo
the regex based on unroll loop technique. here explanation:
(?m)- multiline modifier make^match beginning of line(^block\s*\d+)- match , captureblock+ optional whitespace(s) + 1+ digits (group 1).*- matches rest of line (as no dotall option should on)(?:\n(?!block\s*\d).*)*- match text after not wordblockfollowed optional whitespace(s) followed digit (this way, boundary set)\battrib\s*b=(\w+)- match whole wordattribfollowed 0+ whitespaces, literalb=, , match , capture 1+ alphanumerics or underscore (note: can adjusted per real data)(\w+)
import re p = re.compile(r'(?m)(^block\s*\d+).*(?:\n(?!block\s*\d).*)*\battrib\s*b=(\w+)') s = "block 1\n subblock 1.1\n attrib a=a1\n subblock 1.2\n attrib b=b1\nblock 2\n subblock 2.1\n attrib a=a2\nblock 3\n subblock 3.1\n attrib a=a3\n subblock 3.2\n attrib b=b3" print(p.findall(s))
Comments
Post a Comment