Python - Regex - findall duplicates -
i'm trying match e-mails in html text using following code in python
my_second_pat = '((\w+)( *?))(@|[aa][tt]|\([aa][tt]\))(((( *?)(\w+)( *?))(\.|[dd][oo][tt]|\([dd][oo][tt]\)))+)([ee][dd][uu]|[cc][oo][mm])' matches = re.findall(my_second_pat,line) m in matches: s = "".join(m) email = "".join(s.split()) res.append((name,'e',email))
when run on line = shoham@stanford.edu
i get:
[('shoham', 'shoham', '', '@', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')]
what expect:
[('shoham','@', 'stanford.', 'edu')]
it's matched 1 string on regexpal.com, guess i'm having trouble re.findall
i'm new both regex, , python. optimization/modifications welcomed.
try this:
(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)([^@\s.]{2,})
if need limit .com
, .edu
:
(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)(com|edu)
note have used case-insensitive flag (?i)
@ start of regex, instead of using syntax [ee]
.
Comments
Post a Comment