python - Pandas Set Data From the Last Period As New DataFrame Column -


i have pandas dataframe:

import pandas pd  df = pd.dataframe([['a', '2014-01-01', '2014-01-07', 1.2],                    ['b', '2014-01-01', '2014-01-07', 2.5],                    ['c', '2014-01-01', '2014-01-07', 3.],                    ['a', '2014-01-08', '2014-01-14', 13.],                    ['b', '2014-01-08', '2014-01-14', 2.],                    ['c', '2014-01-08', '2014-01-14', 1.],                    ['a', '2014-01-15', '2014-01-21', 10.],                    ['a', '2014-01-21', '2014-01-27', 98.],                    ['b', '2014-01-21', '2014-01-27', -5.],                    ['c', '2014-01-21', '2014-01-27', -72.],                    ['a', '2014-01-22', '2014-01-28', 8.],                    ['b', '2014-01-22', '2014-01-28', 25.],                    ['c', '2014-01-22', '2014-01-28', -23.],                    ['a', '2014-01-22', '2014-02-22', 8.],                    ['b', '2014-01-22', '2014-02-22', 25.],                    ['c', '2014-01-22', '2014-02-22', -23.],                   ], columns=['group', 'start date', 'end date', 'value']) 

and output looks this:

   group  start date    end date  value 0       2014-01-01  2014-01-07    1.2 1      b  2014-01-01  2014-01-07    2.5 2      c  2014-01-01  2014-01-07    3.0 3       2014-01-08  2014-01-14   13.0 4      b  2014-01-08  2014-01-14    2.0 5      c  2014-01-08  2014-01-14    1.0 6       2014-01-15  2014-01-21   10.0 7       2014-01-21  2014-01-27   98.0 8      b  2014-01-21  2014-01-27   -5.0 9      c  2014-01-21  2014-01-27  -72.0 10      2014-01-22  2014-01-28    8.0 11     b  2014-01-22  2014-01-28   25.0 12     c  2014-01-22  2014-01-28  -23.0 13      2014-01-22  2014-02-22    8.0 14     b  2014-01-22  2014-02-22   25.0 15     c  2014-01-22  2014-02-22  -23.0 

i trying add new column data same group in previous period (if exists). so, output should this:

   group  start date    end date  value   last period value 0       2014-01-01  2014-01-07    1.2                 nan 1      b  2014-01-01  2014-01-07    2.5                 nan 2      c  2014-01-01  2014-01-07    3.0                 nan 3       2014-01-08  2014-01-14   13.0                 1.2 4      b  2014-01-08  2014-01-14    2.0                 2.5    5      c  2014-01-08  2014-01-14    1.0                 3.0 6       2014-01-15  2014-01-21   10.0                13.0  7       2014-01-21  2014-01-27   98.0                 nan 8      b  2014-01-21  2014-01-27   -5.0                 nan 9      c  2014-01-21  2014-01-27  -72.0                 nan 10      2014-01-22  2014-01-28    8.0                10.0      11     b  2014-01-22  2014-01-28   25.0                 nan 12     c  2014-01-22  2014-01-28  -23.0                 nan 13      2014-01-22  2014-02-22    8.0                 nan    14     b  2014-01-22  2014-02-22   25.0                 nan    15     c  2014-01-22  2014-02-22  -23.0                 nan    

notice rows nan not have corresponding value same group , in last period. so, rows span 7 days (one week) need matched same row same group previous week.

suppose compute duration between start , end each row:

df['duration'] = df['end']-df['start'] 

and suppose compute previous start value based on duration:

df['prev'] = df['start'] - df['duration'] - pd.timedelta(days=1) 

then can express desired dataframe result of merge between df , merge rows group, duration , prev (in 1 dataframe) match group, duration , start (in other dataframe):

import pandas pd  df = pd.dataframe([['a', '2014-01-01', '2014-01-07', 1.2],                    ['b', '2014-01-01', '2014-01-07', 2.5],                    ['c', '2014-01-01', '2014-01-07', 3.],                    ['a', '2014-01-08', '2014-01-14', 3.],                    ['b', '2014-01-08', '2014-01-14', 2.],                    ['c', '2014-01-08', '2014-01-14', 1.],                    ['a', '2014-01-15', '2014-01-21', 10.],                    ['a', '2014-01-21', '2014-01-27', 98.],                    ['b', '2014-01-21', '2014-01-27', -5.],                    ['c', '2014-01-21', '2014-01-27', -72.],                    ['a', '2014-01-22', '2014-01-28', 8.],                    ['b', '2014-01-22', '2014-01-28', 25.],                    ['c', '2014-01-22', '2014-01-28', -23.],                    ['a', '2014-01-22', '2014-02-22', 8.],                    ['b', '2014-01-22', '2014-02-22', 25.],                    ['c', '2014-01-22', '2014-02-22', -23.],                   ], columns=['group', 'start', 'end', 'value']) col in ['start', 'end']:     df[col] = pd.to_datetime(df[col])  df['duration'] = df['end']-df['start'] df['prev'] = df['start'] - df['duration'] - pd.timedelta(days=1)  result = pd.merge(df, df[['group','duration','start','value']], how='left',                   left_on=['group','duration','prev'],                    right_on=['group','duration','start'], suffixes=['', '_y']) result = result[['group', 'start', 'end', 'value', 'value_y']] result = result.rename(columns={'value_y':'prev value'}) print(result) 

yields

   group      start        end  value  prev value 0      2014-01-01 2014-01-07    1.2         nan 1      b 2014-01-01 2014-01-07    2.5         nan 2      c 2014-01-01 2014-01-07    3.0         nan 3      2014-01-08 2014-01-14    3.0         1.2 4      b 2014-01-08 2014-01-14    2.0         2.5 5      c 2014-01-08 2014-01-14    1.0         3.0 6      2014-01-15 2014-01-21   10.0         3.0 7      2014-01-21 2014-01-27   98.0         nan 8      b 2014-01-21 2014-01-27   -5.0         nan 9      c 2014-01-21 2014-01-27  -72.0         nan 10     2014-01-22 2014-01-28    8.0        10.0 11     b 2014-01-22 2014-01-28   25.0         nan 12     c 2014-01-22 2014-01-28  -23.0         nan 13     2014-01-22 2014-02-22    8.0         nan 14     b 2014-01-22 2014-02-22   25.0         nan 15     c 2014-01-22 2014-02-22  -23.0         nan 

in comments, artur nowak asks time complexity of pd.merge. believe doing o(n + m) hash join n size of hashed table, , m size of lookup table. here code test performance of pd.merge function of dataframe size empirically.

import collections import string import timeit  import numpy np import pandas pd scipy import stats import matplotlib.pyplot plt  timing = collections.defaultdict(list)  def make_df(ngroups, ndur, ndates):     groups = list(string.uppercase[:ngroups])     durations = range(ndur)     start = pd.date_range('2000-1-1', periods=ndates, freq='d')      index = pd.multiindex.from_product([start, durations, groups],                                         names=['start', 'duration', 'group'])     values = np.arange(len(index))     df = pd.dataframe({'value': values}, index=index).reset_index()     df['end'] = df['start'] + pd.to_timedelta(df['duration'], unit='d')     df = df.drop('duration', axis=1)     df = df[['group', 'start', 'end', 'value']]      df['duration'] = df['end']-df['start']     df['prev'] = df['start'] - df['duration'] - pd.timedelta(days=1)     return df  def using_merge(df):     result = pd.merge(df, df[['group','duration','start','value']], how='left',                       left_on=['group','duration','prev'],                        right_on=['group','duration','start'], suffixes=['', '_y'])     return result  ns = np.array([10**i in range(5)]) n in ns:     timing['merge'].append(timeit.timeit(         'using_merge(df)',         'from __main__ import using_merge, make_df; df = make_df(10, 10, {})'.format(n),         number=5))  print(timing['merge']) slope, intercept, rval, pval, stderr = stats.linregress(ns, timing['merge']) print(slope, intercept, rval, pval, stderr)  plt.plot(ns, timing['merge'], label='merge') plt.plot(ns, slope*ns + intercept) plt.legend(loc='best') plt.show() 

this suggests dataframes of tens of thousands of rows, pd.merge's speed linear.

enter image description here


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -