python - Use groups in scipy.stats.kruskal similar to R cran kruskal.test -


i'm trying replace rpy2 code in python script python (scipy). in context need replace kruskal-wallis test (r:kruskal.test()) (python:scipy.stats.kruskal).

scipy.stats.kruskal returns similar h-statistic , p-value when comparing integers/floats only. however, have difficulty applying groups represented strings.

below subsample of data:

y = [4.33917022422, 2.96541899883, 6.70475220836, 9.19889096119, 2.14087398016,      5.39520023918, 1.58443224287, 3.59625224078, 4.01998599966, 2.58058624352] x = ['high_o2', 'high_o2', 'high_o2', 'high_o2', 'low_o2',       'low_o2',  'low_o2',  'low_o2',  'mid_o2',  'mid_o2'] 

in r 1 type:

kruskal.test(y,as.factor(x)) 

doing same thing in python (2.7) using scipy (0.17):

from scipy import stats stats.kruskal(y,x) 

however, low p values (p<e-07) , quite high h-statistics (26) when using scipy, incorrect. have tried replace x list {0,1,2} no improvement.

how can tell scipy treat x groups during ranking?

each non-keyword argument passed scipy.stats.kruskal treated separate group of y-values. passing x 1 of arguments, kruskal attempts treat label strings though second group of y-values. strings cast nans (which ought raise runtimewarning).

instead, need group y values label, pass them separate input arrays kruskal. example:

# convert `y` numpy array more convenient indexing y = np.array(y)  # find unique group labels , corresponding indices label, idx = np.unique(x, return_inverse=true)  # make list of arrays containing y-values corresponding each unique label groups = [y[idx == i] i, l in enumerate(label)]  # use `*` unpack list sequence of arguments `stats.kruskal` h, p = stats.kruskal(*groups)  print(h, p) # 2.94545454545 0.22929927 

Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -