Closest value different files, with different number of lines and other conditions ( bash awk other) -

March 15, 2014

i have revive , old question modification long files.

i have age of 2 stars in 2 files (file1 , file2). column of age of stars $1 , rest of columns $13 information need print @ end.

i trying find age in stars have same age or closest age. since files large (~25000 lines) don't want search in whole array, speed issues. also, have big difference in number of lines (let ~10000 in cases)

i not sure if best way solve problem, in lack of better one, idea. (if have faster , more efficient method, please it)

all values 12 decimals of precision. , concern in first column (where age is).

and need different loops.

let's use value file 1:

2.326062371284e+05

first routine should search in file2 matches contain

2.3260e+05

(this loop search in whole array, if there way stop search reaches 2.3261 save time)

if finds one, output should value.

usually, find several lines, maybe 1000. case, should search again against

2.32606e+05

between lines founded before. (it nested loop think) number of matches decrease ~200

at moment, routine should search best difference tolerance x between

2.326062371284e+05

and these 200 lines.

this way having these files

file1

1.833800650355e+05 col2f1 col3f1 col4f1 1.959443501406e+05 col2f1 col3f1 col4f1 2.085086352458e+05 col2f1 col3f1 col4f1 2.210729203510e+05 col2f1 col3f1 col4f1 2.326062371284e+05 col2f1 col3f1 col4f1 2.441395539059e+05 col2f1 col3f1 col4f1 2.556728706833e+05 col2f1 col3f1 col4f1

file2

2.210729203510e+05 col2f2 col3f2 col4f2 2.354895663228e+05 col2f2 col3f2 col4f2 2.499062122946e+05 col2f2 col3f2 col4f2 2.643228582664e+05 col2f2 col3f2 col4f2 2.787395042382e+05 col2f2 col3f2 col4f2 2.921130362004e+05 col2f2 col3f2 col4f2 3.054865681626e+05 col2f2 col3f2 col4f2

output file3 (with tolerance 3000)

2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2 2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2

important condition:

the output shouldn't contain repeated lines (the star 1 can't have @ fixed age, different ages star 2, closest one.

how solve this?

super thanks!

ps: i've change question, since showed me reasoning had errors. thanks!

not awk solution, comes time when other solutions great too, here answer using r

new answer different datas, not reading file time bake example:

# sample data code, use fread read file , setnames name colmumns accordingly set.seed(123) data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(dnase$density,20)) data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(dnase$density,10))  setkey(data,'age') # set key joining age column setkey(data2,'age') # set key joining age column  # result result=data[ # whole datas file 1 , file 2 @ end          data2[             data, # search each star of list 1            .sd, # return columns of file 2            roll='nearest',by=.eachi, # join on each line (left join) , find nearest value           .sdcols=c('age','name','dens')]        ][!duplicated(age) & abs(i.age - age) < 1e3,.sd,.sdcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file , on difference # write results file (change separator wish): write.table(format(result,digits=15,scientific=true),"c:/test.txt",sep=" ")

code:

# nice package have, install.packages('data.table') if it's no present library(data.table) # read data (the text can file names) stars1 <- fread("1.833800650355e+05 1.959443501406e+05 2.085086352458e+05 2.210729203510e+05 2.326062371284e+05 2.441395539059e+05 2.556728706833e+05")  stars2 <- fread("2.210729203510e+05 2.354895663228e+05 2.499062122946e+05 2.643228582664e+05 2.787395042382e+05 2.921130362004e+05 3.054865681626e+05")  # name columns (not needed if file has header) colnames(stars1) <- "age" colnames(stars2) <- "age"  # key data tables (for fast join binary search later) setkey(stars1,'age') setkey(stars2,'age')  # result (more datils below on happening here :)) result=stars2[ stars1, age, roll="nearest", by=.eachi]  # rename columns acn filter whole result setnames(result,make.unique(names(result)))  # final filter on difference result[abs(age.1 - age) < 3e3]

so interesting parts first 'join' on 2 stars ages list, searching each in stars1 nearest in stars2.

this give (after column renaming):

> result         age    age.1 1: 183380.1 221072.9 2: 195944.4 221072.9 3: 208508.6 221072.9 4: 221072.9 221072.9 5: 232606.2 235489.6 6: 244139.6 249906.2 7: 255672.9 249906.2

now have nearest each, filter close enough (on absolute difference above 3 000 here):

> result[abs(age.1 - age) < 3e3]         age    age.1 1: 221072.9 221072.9 2: 232606.2 235489.6

Search This Blog

Color

Closest value different files, with different number of lines and other conditions ( bash awk other) -

Comments

Post a Comment

Popular posts from this blog

Redirect to a HTTPS version using .htaccess -

Unlimited choices in BASH case statement -

javascript - jQuery: Add class depending on URL in the best way -