python - Fill NA Values in pandas Series with a stop -
i'm analyzing time series, , based on criteria, can pick out rows either start or end of events. @ point, series looks (i've left out repetitive values brevity):
the setup
import numpy np import pandas pandas import timestamp datadict = {'event': { timestamp('2010-01-01 00:20:00', tz=none): 'event start', timestamp('2010-01-01 00:30:00', tz=none): '--', timestamp('2010-01-01 00:40:00', tz=none): '--', timestamp('2010-01-01 00:50:00', tz=none): '--', timestamp('2010-01-01 01:00:00', tz=none): '--', timestamp('2010-01-01 01:10:00', tz=none): 'event end', timestamp('2010-01-01 01:20:00', tz=none): '--', timestamp('2010-01-01 02:20:00', tz=none): '--', timestamp('2010-01-01 02:30:00', tz=none): 'event start', timestamp('2010-01-01 02:40:00', tz=none): '--', timestamp('2010-01-01 02:50:00', tz=none): '--', timestamp('2010-01-01 03:00:00', tz=none): '--', timestamp('2010-01-01 03:10:00', tz=none): '--', timestamp('2010-01-01 03:20:00', tz=none): '--', timestamp('2010-01-01 03:30:00', tz=none): 'event end', }} data = pandas.dataframe.from_dict(datadict) event 2010-01-01 00:20:00 event start 2010-01-01 00:30:00 -- 2010-01-01 00:40:00 -- 2010-01-01 00:50:00 -- 2010-01-01 01:00:00 -- 2010-01-01 01:10:00 event end 2010-01-01 01:20:00 -- 2010-01-01 02:20:00 -- 2010-01-01 02:30:00 event start 2010-01-01 02:40:00 -- 2010-01-01 02:50:00 -- 2010-01-01 03:00:00 -- 2010-01-01 03:10:00 -- 2010-01-01 03:20:00 -- 2010-01-01 03:30:00 event end
here's achieve (ideally without for
loops)
event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- na 2010-01-01 02:20:00 -- na 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 2010-01-01 03:40:00 -- na 2010-01-01 03:50:00 -- na
here's i've tried
with optimistic assumptions quality of data, can event numbers this:
table = data[data.event != '--'].reset_index() table['event number'] = 1 + np.floor(table.index / 2) table = table.set_index('index') event event number index 2010-01-01 00:20:00 event start 1 2010-01-01 01:10:00 event end 1 2010-01-01 02:30:00 event start 2 2010-01-01 03:30:00 event end 2
i can join
original dataframe, , fillna
method='ffill'
data2 = data.join(table[['event number']]) data2['filled'] = data2['event number'].fillna(method='ffill') event event number filled 2010-01-01 00:20:00 event start 1 1 2010-01-01 00:30:00 -- nan 1 2010-01-01 00:40:00 -- nan 1 2010-01-01 00:50:00 -- nan 1 2010-01-01 01:00:00 -- nan 1 2010-01-01 01:10:00 event end 1 1 2010-01-01 01:20:00 -- nan 1 # <- d'oh 2010-01-01 02:20:00 -- nan 1 # <- d'oh 2010-01-01 02:30:00 event start 2 2 2010-01-01 02:40:00 -- nan 2 2010-01-01 02:50:00 -- nan 2 2010-01-01 03:00:00 -- nan 2 2010-01-01 03:10:00 -- nan 2 2010-01-01 03:20:00 -- nan 2 2010-01-01 03:30:00 event end 2 2
the problem
as can see, time between events (01:20 through 02:20) being associated event #1.
the question
is there anyway skip on these sections without looping?
you can achieve looking @ cumulative summation of number of event start
, number of event end
:
>>> data['event number'] = (data.event == 'event start').cumsum() >>> data event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- 1 2010-01-01 02:20:00 -- 1 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2
now need set nan
when there no event; places corresponds rows cumulative summation of event start
equal cumulative summation of event end
(with shifting 1 row)
>>> idx = data['event number'] == (data.event.shift(1) == 'event end').cumsum() >>> data.loc[idx, 'event number'] = np.nan >>> data event event number 2010-01-01 00:20:00 event start 1 2010-01-01 00:30:00 -- 1 2010-01-01 00:40:00 -- 1 2010-01-01 00:50:00 -- 1 2010-01-01 01:00:00 -- 1 2010-01-01 01:10:00 event end 1 2010-01-01 01:20:00 -- nan 2010-01-01 02:20:00 -- nan 2010-01-01 02:30:00 event start 2 2010-01-01 02:40:00 -- 2 2010-01-01 02:50:00 -- 2 2010-01-01 03:00:00 -- 2 2010-01-01 03:10:00 -- 2 2010-01-01 03:20:00 -- 2 2010-01-01 03:30:00 event end 2 [15 rows x 2 columns]
Comments
Post a Comment