r - Read in file with complicated format and tidy it -


i read file complex format data frame or data table. simplified format have simplest example can still convey of complexity of real case.

title = "sometitlehere" variables = "n","q[m3/hr]","gf[-]","pe[bar]","eff[%]", zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i=  4   0    818.96002      0.00000      13.00000          61.92762   1    818.96002      0.00000      29.86776          61.92762  zone datapacking=point t="offdesign gf= 0.000 q= 200.00 rpm=4800.",i=  4   0    200.00000      0.00000      13.00000           0.00000   1    200.00000      0.00000      37.79360          27.12768  zone datapacking=point t="offdesign gf=  0.000 q=1200.00 rpm=4800.",i=  4   0   1200.00000      0.00000      13.00000       0.00000   1   1200.00000      0.00000      17.17662      28.08889  zone datapacking=point t="offdesign gf=  0.100 q= 200.00 rpm=4800.",i=  4   0    200.00000      0.10000     13.00000       0.00000   1    188.40880      0.04463      30.91997      22.54672  zone datapacking=point t="offdesign gf= 0.100 q=1200.00 rpm=4800.",i=  4   0   1200.00000      0.10000    13.00000           0.00000   1   1177.85608      0.08308     15.94177      13.05620 

format explanation: first line (title = "sometitlehere") kind of comment , can skipped. second line contains prefixes variable names , measurement units. since know names of variables, line can skipped. then, there 2*n+1 "data blocks". each data block 5 lines long: first title line, contains values of 4 variables, point, gfin, qin , rpm (thus must parsed). example, first block title line

zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i=  4 

which corresponds to

      point gfin     qin  rpm      design  0.0  818.96 4800 

then, have 4 lines of numeric/integer data without strings. 4 lines correspond 2 lines of actual data, because lines last value of odd lines! contain values of 8 variables q1, q2, gf1, gf2, pe1, pe2, eff1 , eff2. in other words, first data block (lines 5-7 in sample file)

zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i=  4   0    818.96002      0.00000      13.00000          61.92762   1    818.96002      0.00000      29.86776          61.92762 

corresponds following entry in dataframe

  point gfin     qin  rpm     q1     q2 gf1 gf2 pe1      pe2     eff1     eff2  design  0.0  818.96 4800 818.96 818.96   0   0  13 29.86776 61.92762 61.92762 

applying same logic, final data frame corresponding above input file should be

> df       point gfin     qin  rpm      q1        q2 gf1     gf2 pe1      pe2     eff1     eff2 1    design  0.0  818.96 4800  818.96  818.9600 0.0 0.00000  13 29.86776 61.92762 61.92762 2 offdesign  0.0  200.00 4800  200.00  200.0000 0.0 0.00000  13 37.79360  0.00000 27.12768 3 offdesign  0.0 1200.00 4800 1200.00 1200.0000 0.0 0.00000  13 17.17662  0.00000 28.08889 4 offdesign  0.1  200.00 4800  200.00  188.4088 0.1 0.04463  13 30.91997  0.00000 22.54672 5 offdesign  0.1 1200.00 4800 1200.00 1177.8561 0.1 0.08308  13 15.94177  0.00000 13.05620 

how can go input file data frame, minimizing level of manual intervention?

ps of course real file has thousands more data blocks , more variables each data block. simple example.

edit read readlines, suggested user, , got here (testfile file provided @ start of question):

# read test file testfile.dat  # clear workspace rm(list=ls()) gc() graphics.off()  # read full file directory = "../test/" filename = "testfile.dat" fullpath = paste0(directory,filename) s = readlines(fullpath) # looks r can read in 1 sweep original file, has more 60000 lines. great!!!  # remove title line , variables line s=s[-2:-1]  # how many data points? nstages = 1 nlines = 2*(nstages+1)+1 npoints = length(s)/nlines  # parser function parse_point <- function(x) {}  # lapply parser function s data_list=lapply(s,parse_point)      # merge list of data frames data_list in single data frame data=do.call("rbind",data_list) 

i think lapply+do.call trick neat, , saves me slowness of for. problem don't know how write parser function lapply can handle! basically, lapply applies parse_point 1 element of s @ time. won't do: need parse 5 elements of s @ time, i.e., data block:

zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i=  4   0    818.96002      0.00000      13.00000          61.92762   1    818.96002      0.00000      29.86776          61.92762 

any suggestions? don't need full solution, hint continue. can go on , improve solution.

edit 2: leaving aside minute fact cannot lapply parse_point, tried concentrate on parse_point. andddd...great! can @ least parse correctly 1 data block:

library(stringr)  index = 1 split_text_line = strsplit(s[index],split=" +")[[1]] point = str_sub(split_text_line[4],4) gfin = as.numeric(split_text_line[6]) qin =  as.numeric(split_text_line[8]) rpm = as.numeric(str_extract(split_text_line[9],"[:digit:]+")) index = index + 1 split_text_line = strsplit(s[index],split=" +")[[1]] q1 = split_text_line[3] gf1 = split_text_line[4] pe1 = split_text_line[5] index = index + 1 eff1 = as.numeric(str_trim(s[index])) index = index + 1 split_text_line = strsplit(s[index],split=" +")[[1]] q2 = split_text_line[3] gf2 = split_text_line[4] pe2 = split_text_line[5] index = index + 1 eff2 = as.numeric(str_trim(s[index])) df = data.frame(point=point, gfin=gfin, qin=qin, rpm=rpm, q1=q1, q2=q2,                 gf1=gf1, gf2=gf2, pe1=pe1, pe2=pe2, eff1=eff1, eff2=eff2) 

where s character vector generated script above. however, still have issue of applying parsing algorithm data blocks. for loop, isn't there faster way?

you need think little bit sideways here. have list of text lines, want process 5 @ time. so, pass lapply list of indices list of data

lapply(seq(1,length(s), 5), function (x) { parse_point(s[x:x+4]) }) 

this call parse_point each group of 5 lines in source file.

you modify parse_point take array index x instead of list of lines. it's just

lapply(seq(1,length(s), 5), parse_point) 

you may need either unlist result of lapply or consider using sapply instead.


Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -