Python - Regex - findall duplicates -


i'm trying match e-mails in html text using following code in python

my_second_pat = '((\w+)( *?))(@|[aa][tt]|\([aa][tt]\))(((( *?)(\w+)( *?))(\.|[dd][oo][tt]|\([dd][oo][tt]\)))+)([ee][dd][uu]|[cc][oo][mm])'   matches = re.findall(my_second_pat,line) m in matches:     s = "".join(m)     email = "".join(s.split())     res.append((name,'e',email)) 

when run on line = shoham@stanford.edu

i get:

[('shoham', 'shoham', '', '@', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')] 

what expect:

[('shoham','@', 'stanford.', 'edu')] 

it's matched 1 string on regexpal.com, guess i'm having trouble re.findall

i'm new both regex, , python. optimization/modifications welcomed.

try this:

(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)([^@\s.]{2,}) 

regular expression visualization

debuggex demo

if need limit .com , .edu:

(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)(com|edu) 

regular expression visualization

debuggex demo

note have used case-insensitive flag (?i) @ start of regex, instead of using syntax [ee].


Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -