dplyr - R: Split weighted column into equal-sized buckets -
i use dplyr's cut_number split column buckets approximately same number of observations, dataset in compact form each row has weight (number of observations).
example data frame:
df <- data.frame( x=c(18,17,18.5,20,20.5,24,24.4,18.3,31,34,39,20,19,34,23), weight=c(1,10,3,6,19,20,34,66,2,3,1,6,9,15,21) ) if there 1 observation of x per row, use df$bucket <- cut_number(df$x,3) segment x 3 buckets approximately same number of observations. how take account fact each row weighted number of observations? i'd avoid splitting each row weight rows since original dataframe has millions of rows.
based on comments, think may interval set seeking. apologies general un-r-ness of it:
dftest <- data.frame(x=1:6, weight=c(1,1,1,1,4,1)) f <- function(df, n) { interval <- round(sum(df$weight) / n) buckets <- vector(mode="integer", length(nrow(df))) bucketnum <- 1 count <- 0 (i in 1:nrow(df)) { count <- count + df$weight[i] buckets[i] <- bucketnum if (count >= interval) { bucketnum <- bucketnum + 1 count <- 0 } } return(buckets) } running function buckets items follows:
dftest$bucket <- f(dftest, 3) # x weight bucket # 1 1 1 1 # 2 2 1 1 # 3 3 1 1 # 4 4 1 2 # 5 5 4 2 # 6 6 1 3 for example:
df$bucket <- f(df, 3) # x weight bucket # 1 18.0 1 1 # 2 17.0 10 1 # 3 18.5 3 1 # 4 20.0 6 1 # 5 20.5 19 1 # 6 24.0 20 1 # 7 24.4 34 1 # 8 18.3 66 2 # 9 31.0 2 2 # 10 34.0 3 2 # 11 39.0 1 2 # 12 20.0 6 3 # 13 19.0 9 3 # 14 34.0 15 3 # 15 23.0 21 3
Comments
Post a Comment