Python - Regex - findall duplicates -
i'm trying match e-mails in html text using following code in python
my_second_pat = '((\w+)( *?))(@|[aa][tt]|\([aa][tt]\))(((( *?)(\w+)( *?))(\.|[dd][oo][tt]|\([dd][oo][tt]\)))+)([ee][dd][uu]|[cc][oo][mm])'   matches = re.findall(my_second_pat,line) m in matches:     s = "".join(m)     email = "".join(s.split())     res.append((name,'e',email)) when run on line = shoham@stanford.edu
i get:
[('shoham', 'shoham', '', '@', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')] what expect:
[('shoham','@', 'stanford.', 'edu')] it's matched 1 string on regexpal.com, guess i'm having trouble re.findall
i'm new both regex, , python. optimization/modifications welcomed.
try this:
(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)([^@\s.]{2,}) 
if need limit .com , .edu:
(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)(com|edu) 
note have used case-insensitive flag (?i) @ start of regex, instead of using syntax [ee].
Comments
Post a Comment