python - Pandas Set Data From the Last Period As New DataFrame Column -
i have pandas dataframe:
import pandas pd df = pd.dataframe([['a', '2014-01-01', '2014-01-07', 1.2], ['b', '2014-01-01', '2014-01-07', 2.5], ['c', '2014-01-01', '2014-01-07', 3.], ['a', '2014-01-08', '2014-01-14', 13.], ['b', '2014-01-08', '2014-01-14', 2.], ['c', '2014-01-08', '2014-01-14', 1.], ['a', '2014-01-15', '2014-01-21', 10.], ['a', '2014-01-21', '2014-01-27', 98.], ['b', '2014-01-21', '2014-01-27', -5.], ['c', '2014-01-21', '2014-01-27', -72.], ['a', '2014-01-22', '2014-01-28', 8.], ['b', '2014-01-22', '2014-01-28', 25.], ['c', '2014-01-22', '2014-01-28', -23.], ['a', '2014-01-22', '2014-02-22', 8.], ['b', '2014-01-22', '2014-02-22', 25.], ['c', '2014-01-22', '2014-02-22', -23.], ], columns=['group', 'start date', 'end date', 'value']) and output looks this:
group start date end date value 0 2014-01-01 2014-01-07 1.2 1 b 2014-01-01 2014-01-07 2.5 2 c 2014-01-01 2014-01-07 3.0 3 2014-01-08 2014-01-14 13.0 4 b 2014-01-08 2014-01-14 2.0 5 c 2014-01-08 2014-01-14 1.0 6 2014-01-15 2014-01-21 10.0 7 2014-01-21 2014-01-27 98.0 8 b 2014-01-21 2014-01-27 -5.0 9 c 2014-01-21 2014-01-27 -72.0 10 2014-01-22 2014-01-28 8.0 11 b 2014-01-22 2014-01-28 25.0 12 c 2014-01-22 2014-01-28 -23.0 13 2014-01-22 2014-02-22 8.0 14 b 2014-01-22 2014-02-22 25.0 15 c 2014-01-22 2014-02-22 -23.0 i trying add new column data same group in previous period (if exists). so, output should this:
group start date end date value last period value 0 2014-01-01 2014-01-07 1.2 nan 1 b 2014-01-01 2014-01-07 2.5 nan 2 c 2014-01-01 2014-01-07 3.0 nan 3 2014-01-08 2014-01-14 13.0 1.2 4 b 2014-01-08 2014-01-14 2.0 2.5 5 c 2014-01-08 2014-01-14 1.0 3.0 6 2014-01-15 2014-01-21 10.0 13.0 7 2014-01-21 2014-01-27 98.0 nan 8 b 2014-01-21 2014-01-27 -5.0 nan 9 c 2014-01-21 2014-01-27 -72.0 nan 10 2014-01-22 2014-01-28 8.0 10.0 11 b 2014-01-22 2014-01-28 25.0 nan 12 c 2014-01-22 2014-01-28 -23.0 nan 13 2014-01-22 2014-02-22 8.0 nan 14 b 2014-01-22 2014-02-22 25.0 nan 15 c 2014-01-22 2014-02-22 -23.0 nan notice rows nan not have corresponding value same group , in last period. so, rows span 7 days (one week) need matched same row same group previous week.
suppose compute duration between start , end each row:
df['duration'] = df['end']-df['start'] and suppose compute previous start value based on duration:
df['prev'] = df['start'] - df['duration'] - pd.timedelta(days=1) then can express desired dataframe result of merge between df , merge rows group, duration , prev (in 1 dataframe) match group, duration , start (in other dataframe):
import pandas pd df = pd.dataframe([['a', '2014-01-01', '2014-01-07', 1.2], ['b', '2014-01-01', '2014-01-07', 2.5], ['c', '2014-01-01', '2014-01-07', 3.], ['a', '2014-01-08', '2014-01-14', 3.], ['b', '2014-01-08', '2014-01-14', 2.], ['c', '2014-01-08', '2014-01-14', 1.], ['a', '2014-01-15', '2014-01-21', 10.], ['a', '2014-01-21', '2014-01-27', 98.], ['b', '2014-01-21', '2014-01-27', -5.], ['c', '2014-01-21', '2014-01-27', -72.], ['a', '2014-01-22', '2014-01-28', 8.], ['b', '2014-01-22', '2014-01-28', 25.], ['c', '2014-01-22', '2014-01-28', -23.], ['a', '2014-01-22', '2014-02-22', 8.], ['b', '2014-01-22', '2014-02-22', 25.], ['c', '2014-01-22', '2014-02-22', -23.], ], columns=['group', 'start', 'end', 'value']) col in ['start', 'end']: df[col] = pd.to_datetime(df[col]) df['duration'] = df['end']-df['start'] df['prev'] = df['start'] - df['duration'] - pd.timedelta(days=1) result = pd.merge(df, df[['group','duration','start','value']], how='left', left_on=['group','duration','prev'], right_on=['group','duration','start'], suffixes=['', '_y']) result = result[['group', 'start', 'end', 'value', 'value_y']] result = result.rename(columns={'value_y':'prev value'}) print(result) yields
group start end value prev value 0 2014-01-01 2014-01-07 1.2 nan 1 b 2014-01-01 2014-01-07 2.5 nan 2 c 2014-01-01 2014-01-07 3.0 nan 3 2014-01-08 2014-01-14 3.0 1.2 4 b 2014-01-08 2014-01-14 2.0 2.5 5 c 2014-01-08 2014-01-14 1.0 3.0 6 2014-01-15 2014-01-21 10.0 3.0 7 2014-01-21 2014-01-27 98.0 nan 8 b 2014-01-21 2014-01-27 -5.0 nan 9 c 2014-01-21 2014-01-27 -72.0 nan 10 2014-01-22 2014-01-28 8.0 10.0 11 b 2014-01-22 2014-01-28 25.0 nan 12 c 2014-01-22 2014-01-28 -23.0 nan 13 2014-01-22 2014-02-22 8.0 nan 14 b 2014-01-22 2014-02-22 25.0 nan 15 c 2014-01-22 2014-02-22 -23.0 nan in comments, artur nowak asks time complexity of pd.merge. believe doing o(n + m) hash join n size of hashed table, , m size of lookup table. here code test performance of pd.merge function of dataframe size empirically.
import collections import string import timeit import numpy np import pandas pd scipy import stats import matplotlib.pyplot plt timing = collections.defaultdict(list) def make_df(ngroups, ndur, ndates): groups = list(string.uppercase[:ngroups]) durations = range(ndur) start = pd.date_range('2000-1-1', periods=ndates, freq='d') index = pd.multiindex.from_product([start, durations, groups], names=['start', 'duration', 'group']) values = np.arange(len(index)) df = pd.dataframe({'value': values}, index=index).reset_index() df['end'] = df['start'] + pd.to_timedelta(df['duration'], unit='d') df = df.drop('duration', axis=1) df = df[['group', 'start', 'end', 'value']] df['duration'] = df['end']-df['start'] df['prev'] = df['start'] - df['duration'] - pd.timedelta(days=1) return df def using_merge(df): result = pd.merge(df, df[['group','duration','start','value']], how='left', left_on=['group','duration','prev'], right_on=['group','duration','start'], suffixes=['', '_y']) return result ns = np.array([10**i in range(5)]) n in ns: timing['merge'].append(timeit.timeit( 'using_merge(df)', 'from __main__ import using_merge, make_df; df = make_df(10, 10, {})'.format(n), number=5)) print(timing['merge']) slope, intercept, rval, pval, stderr = stats.linregress(ns, timing['merge']) print(slope, intercept, rval, pval, stderr) plt.plot(ns, timing['merge'], label='merge') plt.plot(ns, slope*ns + intercept) plt.legend(loc='best') plt.show() this suggests dataframes of tens of thousands of rows, pd.merge's speed linear.

Comments
Post a Comment