python - Filtering a dataframe based on a regex -


say have dataframe my_df column 'brand', drop rows brand either toyota or bmw.

i thought following it:

my_regex = re.compile('^(bmw$|toyota$).*$') my_function = lambda x: my_regex.match(x.lower()) my_df[~df['brand'].apply(my_function)]  

but error:

valueerror: cannot index vector containing na / nan values 

why? how can filter dataframe using regex?

i think re.match returns none when there no match , breaks indexing; below alternative solution using pandas vectorized string methods; note pandas string methods can handle null values:

>>> df = pd.dataframe( {'brand':['bmw', 'ford', np.nan, none, 'toyota', 'audi']}) >>> df     brand 0     bmw 1    ford 2     nan 3    none 4  toyota 5    audi  [6 rows x 1 columns]  >>> idx = df.brand.str.contains('^bmw$|^toyota$',               flags=re.ignorecase, regex=true, na=false) >>> idx 0     true 1    false 2    false 3    false 4     true 5    false name: brand, dtype: bool  >>> df[~idx]   brand 1  ford 2   nan 3  none 5  audi  [4 rows x 1 columns] 

Comments

Popular posts from this blog

Android layout hidden on keyboard show -

google app engine - 403 Forbidden POST - Flask WTForms -

c - Why would PK11_GenerateRandom() return an error -8023? -