python - Filtering a dataframe based on a regex -
say have dataframe my_df
column 'brand'
, drop rows brand either toyota
or bmw
.
i thought following it:
my_regex = re.compile('^(bmw$|toyota$).*$') my_function = lambda x: my_regex.match(x.lower()) my_df[~df['brand'].apply(my_function)]
but error:
valueerror: cannot index vector containing na / nan values
why? how can filter dataframe using regex?
i think re.match
returns none
when there no match , breaks indexing; below alternative solution using pandas vectorized string methods; note pandas string methods can handle null values:
>>> df = pd.dataframe( {'brand':['bmw', 'ford', np.nan, none, 'toyota', 'audi']}) >>> df brand 0 bmw 1 ford 2 nan 3 none 4 toyota 5 audi [6 rows x 1 columns] >>> idx = df.brand.str.contains('^bmw$|^toyota$', flags=re.ignorecase, regex=true, na=false) >>> idx 0 true 1 false 2 false 3 false 4 true 5 false name: brand, dtype: bool >>> df[~idx] brand 1 ford 2 nan 3 none 5 audi [4 rows x 1 columns]
Comments
Post a Comment