python - pandas top n values when group name in multiple columns -

March 15, 2010

trying find top n values large dataframe. keys combinations of named objects in first 2 columns. however, i'd find max, regardless of column key located. better demonstrated example:

import itertools import pandas pd  np.random.seed(10)  pairs = [combo combo in itertools.combinations(['apple','banana','pear','orange'], 2)]  df = pd.dataframe(pairs, columns=['a','b']) df['score'] = np.random.rand(6)

the original dataframe:

in [2]: df out[2]:   b   score      0  apple   banana  0.771321      1  apple   pear    0.020752      2  apple   orange  0.633648      3  banana  pear    0.748804      4  banana  orange  0.498507      5  pear    orange  0.224797

here how accomplish task w/ sql, assuming have database table named fruits emulates df above:

uniq = pd.unique(df[['a', 'b']].values.ravel())  df_sql = pd.dataframe() fruit in uniq:     dfsql_tmp = pd.read_sql_query(     """select a,b,score fruits     = %s     or b = %s     order score desc     limit 1;""",     engine, params=[fruit, fruit])      df_sql = pd.concat([df_sql, dfsql_tmp], ignore_index=true)

this gets me i'm asking, top n scores each unique value (from union of df['a'] , df['b']). desired output:

in [5]: df_sql out[5]:   b   score      0  apple   banana  0.771321 #highest apple score      1  apple   banana  0.771321 #highest banana score      2  apple   orange  0.633648 #highest orange score      3  banana  pear    0.748804 #highest pear score

edit

this trick also, slow @ scale:

n=1 df_new = pd.dataframe() fruit in uniq:     df_tmp = df[(df['a'] == fruit) | (df['b'] == fruit)].sort_values('score', ascending=false).head(n)     df_new= pd.concat([df_new, df_tmp])

is there better method desired result? nested sql query doesn't scale. i'd rather perform action on 1 large df. it's important keep n, rather max or min.

this not pretty solution, , suspect there's better ones out there, here's crack @ it. creates ~550k row x 5 column dataframe , runs in 4 seconds on laptop.

import string import pandas pd import numpy np import itertools  np.random.seed(10) pairs = [combo combo in itertools.combinations(string.letters + string.digits, 4)]  df = pd.dataframe(pairs, columns=['a', 'b', 'c', 'd']) df['score'] = np.random.rand(len(df))  cols = ['a', 'b', 'c', 'd'] indexes = []  c in pd.concat([df[col] col in cols]).unique():     indexes.append(df[reduce(lambda x, y: x | y, [df[col] == c col in cols])]['score'].idxmax()) print df.ix[indexes]

add in .reset_index() @ end if don't want original indexes preserved in output.

for top n, instead of doing .idxmax(), sort reduced frame , take first n indexes .iloc[:n].

Search This Blog

Color

python - pandas top n values when group name in multiple columns -

edit

Comments

Post a Comment

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

android - net_scheduler holding wakelock -

Unlimited choices in BASH case statement -