r - Read in file with complicated format and tidy it -
i read file complex format data frame or data table. simplified format have simplest example can still convey of complexity of real case.
title = "sometitlehere" variables = "n","q[m3/hr]","gf[-]","pe[bar]","eff[%]", zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i= 4 0 818.96002 0.00000 13.00000 61.92762 1 818.96002 0.00000 29.86776 61.92762 zone datapacking=point t="offdesign gf= 0.000 q= 200.00 rpm=4800.",i= 4 0 200.00000 0.00000 13.00000 0.00000 1 200.00000 0.00000 37.79360 27.12768 zone datapacking=point t="offdesign gf= 0.000 q=1200.00 rpm=4800.",i= 4 0 1200.00000 0.00000 13.00000 0.00000 1 1200.00000 0.00000 17.17662 28.08889 zone datapacking=point t="offdesign gf= 0.100 q= 200.00 rpm=4800.",i= 4 0 200.00000 0.10000 13.00000 0.00000 1 188.40880 0.04463 30.91997 22.54672 zone datapacking=point t="offdesign gf= 0.100 q=1200.00 rpm=4800.",i= 4 0 1200.00000 0.10000 13.00000 0.00000 1 1177.85608 0.08308 15.94177 13.05620 format explanation: first line (title = "sometitlehere") kind of comment , can skipped. second line contains prefixes variable names , measurement units. since know names of variables, line can skipped. then, there 2*n+1 "data blocks". each data block 5 lines long: first title line, contains values of 4 variables, point, gfin, qin , rpm (thus must parsed). example, first block title line
zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i= 4 which corresponds to
point gfin qin rpm design 0.0 818.96 4800 then, have 4 lines of numeric/integer data without strings. 4 lines correspond 2 lines of actual data, because lines last value of odd lines! contain values of 8 variables q1, q2, gf1, gf2, pe1, pe2, eff1 , eff2. in other words, first data block (lines 5-7 in sample file)
zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i= 4 0 818.96002 0.00000 13.00000 61.92762 1 818.96002 0.00000 29.86776 61.92762 corresponds following entry in dataframe
point gfin qin rpm q1 q2 gf1 gf2 pe1 pe2 eff1 eff2 design 0.0 818.96 4800 818.96 818.96 0 0 13 29.86776 61.92762 61.92762 applying same logic, final data frame corresponding above input file should be
> df point gfin qin rpm q1 q2 gf1 gf2 pe1 pe2 eff1 eff2 1 design 0.0 818.96 4800 818.96 818.9600 0.0 0.00000 13 29.86776 61.92762 61.92762 2 offdesign 0.0 200.00 4800 200.00 200.0000 0.0 0.00000 13 37.79360 0.00000 27.12768 3 offdesign 0.0 1200.00 4800 1200.00 1200.0000 0.0 0.00000 13 17.17662 0.00000 28.08889 4 offdesign 0.1 200.00 4800 200.00 188.4088 0.1 0.04463 13 30.91997 0.00000 22.54672 5 offdesign 0.1 1200.00 4800 1200.00 1177.8561 0.1 0.08308 13 15.94177 0.00000 13.05620 how can go input file data frame, minimizing level of manual intervention?
ps of course real file has thousands more data blocks , more variables each data block. simple example.
edit read readlines, suggested user, , got here (testfile file provided @ start of question):
# read test file testfile.dat # clear workspace rm(list=ls()) gc() graphics.off() # read full file directory = "../test/" filename = "testfile.dat" fullpath = paste0(directory,filename) s = readlines(fullpath) # looks r can read in 1 sweep original file, has more 60000 lines. great!!! # remove title line , variables line s=s[-2:-1] # how many data points? nstages = 1 nlines = 2*(nstages+1)+1 npoints = length(s)/nlines # parser function parse_point <- function(x) {} # lapply parser function s data_list=lapply(s,parse_point) # merge list of data frames data_list in single data frame data=do.call("rbind",data_list) i think lapply+do.call trick neat, , saves me slowness of for. problem don't know how write parser function lapply can handle! basically, lapply applies parse_point 1 element of s @ time. won't do: need parse 5 elements of s @ time, i.e., data block:
zone datapacking=point t="design gf= 0.000 q= 818.96 rpm=4800.",i= 4 0 818.96002 0.00000 13.00000 61.92762 1 818.96002 0.00000 29.86776 61.92762 any suggestions? don't need full solution, hint continue. can go on , improve solution.
edit 2: leaving aside minute fact cannot lapply parse_point, tried concentrate on parse_point. andddd...great! can @ least parse correctly 1 data block:
library(stringr) index = 1 split_text_line = strsplit(s[index],split=" +")[[1]] point = str_sub(split_text_line[4],4) gfin = as.numeric(split_text_line[6]) qin = as.numeric(split_text_line[8]) rpm = as.numeric(str_extract(split_text_line[9],"[:digit:]+")) index = index + 1 split_text_line = strsplit(s[index],split=" +")[[1]] q1 = split_text_line[3] gf1 = split_text_line[4] pe1 = split_text_line[5] index = index + 1 eff1 = as.numeric(str_trim(s[index])) index = index + 1 split_text_line = strsplit(s[index],split=" +")[[1]] q2 = split_text_line[3] gf2 = split_text_line[4] pe2 = split_text_line[5] index = index + 1 eff2 = as.numeric(str_trim(s[index])) df = data.frame(point=point, gfin=gfin, qin=qin, rpm=rpm, q1=q1, q2=q2, gf1=gf1, gf2=gf2, pe1=pe1, pe2=pe2, eff1=eff1, eff2=eff2) where s character vector generated script above. however, still have issue of applying parsing algorithm data blocks. for loop, isn't there faster way?
you need think little bit sideways here. have list of text lines, want process 5 @ time. so, pass lapply list of indices list of data
lapply(seq(1,length(s), 5), function (x) { parse_point(s[x:x+4]) }) this call parse_point each group of 5 lines in source file.
you modify parse_point take array index x instead of list of lines. it's just
lapply(seq(1,length(s), 5), parse_point) you may need either unlist result of lapply or consider using sapply instead.
Comments
Post a Comment