dplyr - R: Split weighted column into equal-sized buckets -


i use dplyr's cut_number split column buckets approximately same number of observations, dataset in compact form each row has weight (number of observations).

example data frame:

df <- data.frame(     x=c(18,17,18.5,20,20.5,24,24.4,18.3,31,34,39,20,19,34,23),     weight=c(1,10,3,6,19,20,34,66,2,3,1,6,9,15,21) ) 

if there 1 observation of x per row, use df$bucket <- cut_number(df$x,3) segment x 3 buckets approximately same number of observations. how take account fact each row weighted number of observations? i'd avoid splitting each row weight rows since original dataframe has millions of rows.

based on comments, think may interval set seeking. apologies general un-r-ness of it:

dftest <- data.frame(x=1:6, weight=c(1,1,1,1,4,1))  f <- function(df, n) {   interval <- round(sum(df$weight) / n)   buckets <- vector(mode="integer", length(nrow(df)))   bucketnum <- 1   count <- 0   (i in 1:nrow(df)) {     count <- count + df$weight[i]     buckets[i] <- bucketnum     if (count >= interval) {       bucketnum <- bucketnum + 1       count <- 0     }   }   return(buckets) } 

running function buckets items follows:

dftest$bucket <- f(dftest, 3)  #    x weight bucket #  1 1      1      1 #  2 2      1      1 #  3 3      1      1 #  4 4      1      2 #  5 5      4      2 #  6 6      1      3 

for example:

df$bucket <- f(df, 3) #        x weight bucket #  1  18.0      1      1 #  2  17.0     10      1 #  3  18.5      3      1 #  4  20.0      6      1 #  5  20.5     19      1 #  6  24.0     20      1 #  7  24.4     34      1 #  8  18.3     66      2 #  9  31.0      2      2 #  10 34.0      3      2 #  11 39.0      1      2 #  12 20.0      6      3 #  13 19.0      9      3 #  14 34.0     15      3 #  15 23.0     21      3 

Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -