Python - Regex - findall duplicates -

July 15, 2010

i'm trying match e-mails in html text using following code in python

my_second_pat = '((\w+)( *?))(@|[aa][tt]|\([aa][tt]\))(((( *?)(\w+)( *?))(\.|[dd][oo][tt]|\([dd][oo][tt]\)))+)([ee][dd][uu]|[cc][oo][mm])'   matches = re.findall(my_second_pat,line) m in matches:     s = "".join(m)     email = "".join(s.split())     res.append((name,'e',email))

when run on line = shoham@stanford.edu

i get:

[('shoham', 'shoham', '', '@', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')]

what expect:

[('shoham','@', 'stanford.', 'edu')]

it's matched 1 string on regexpal.com, guess i'm having trouble re.findall

i'm new both regex, , python. optimization/modifications welcomed.

try this:

(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)([^@\s.]{2,})

regular expression visualization

debuggex demo

if need limit .com , .edu:

(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)(com|edu)

regular expression visualization

debuggex demo

note have used case-insensitive flag (?i) @ start of regex, instead of using syntax [ee].

Search This Blog

And

Python - Regex - findall duplicates -

Comments

Post a Comment

Popular posts from this blog

google app engine - 403 Forbidden POST - Flask WTForms -

Android layout hidden on keyboard show -

Parse xml element into list in Python -