python - pandas top n values when group name in multiple columns -
trying find top n values large dataframe. keys combinations of named objects in first 2 columns. however, i'd find max, regardless of column key located. better demonstrated example:
import itertools import pandas pd np.random.seed(10) pairs = [combo combo in itertools.combinations(['apple','banana','pear','orange'], 2)] df = pd.dataframe(pairs, columns=['a','b']) df['score'] = np.random.rand(6) the original dataframe:
in [2]: df out[2]: b score 0 apple banana 0.771321 1 apple pear 0.020752 2 apple orange 0.633648 3 banana pear 0.748804 4 banana orange 0.498507 5 pear orange 0.224797 here how accomplish task w/ sql, assuming have database table named fruits emulates df above:
uniq = pd.unique(df[['a', 'b']].values.ravel()) df_sql = pd.dataframe() fruit in uniq: dfsql_tmp = pd.read_sql_query( """select a,b,score fruits = %s or b = %s order score desc limit 1;""", engine, params=[fruit, fruit]) df_sql = pd.concat([df_sql, dfsql_tmp], ignore_index=true) this gets me i'm asking, top n scores each unique value (from union of df['a'] , df['b']). desired output:
in [5]: df_sql out[5]: b score 0 apple banana 0.771321 #highest apple score 1 apple banana 0.771321 #highest banana score 2 apple orange 0.633648 #highest orange score 3 banana pear 0.748804 #highest pear score edit
this trick also, slow @ scale:
n=1 df_new = pd.dataframe() fruit in uniq: df_tmp = df[(df['a'] == fruit) | (df['b'] == fruit)].sort_values('score', ascending=false).head(n) df_new= pd.concat([df_new, df_tmp]) is there better method desired result? nested sql query doesn't scale. i'd rather perform action on 1 large df. it's important keep n, rather max or min.
this not pretty solution, , suspect there's better ones out there, here's crack @ it. creates ~550k row x 5 column dataframe , runs in 4 seconds on laptop.
import string import pandas pd import numpy np import itertools np.random.seed(10) pairs = [combo combo in itertools.combinations(string.letters + string.digits, 4)] df = pd.dataframe(pairs, columns=['a', 'b', 'c', 'd']) df['score'] = np.random.rand(len(df)) cols = ['a', 'b', 'c', 'd'] indexes = [] c in pd.concat([df[col] col in cols]).unique(): indexes.append(df[reduce(lambda x, y: x | y, [df[col] == c col in cols])]['score'].idxmax()) print df.ix[indexes] add in .reset_index() @ end if don't want original indexes preserved in output.
for top n, instead of doing .idxmax(), sort reduced frame , take first n indexes .iloc[:n].
Comments
Post a Comment