r - How to reshape a dataframe with "reoccurring" columns? -


i new data analysis r. got pre-formatted environmental observation-model dataset, example subset of shown below:

date                   site      obs    mod      site           obs    mod 2000-09-01 00:00:00    campus    na     61.63    city centre    66     56.69 2000-09-01 01:00:00    campus    52     62.55    city centre    na     54.75 2000-09-01 02:00:00    campus    52     63.52    city centre    56     54.65 

basically, data include time series of hourly observed , modelled concentrations of pollutant @ various sites in "reoccurring columns", i.e., site - obs - mod (in example showed 2 out of total 75 sites). read "wide" dataset in data frame, , wanted reshape "narrower" format as:

date                   site           obs    mod 2000-09-01 00:00:00    campus         na     61.63 2000-09-01 01:00:00    campus         52     62.55 2000-09-01 02:00:00    campus         52     63.52 2000-09-01 00:00:00    city centre    66     56.69 2000-09-01 01:00:00    city centre    na     54.75 2000-09-01 02:00:00    city centre    56     54.65 

i believed should use package "reshape2" this. firstly tried melt , dcast dataset:

test.melt <- melt(test.data, id.vars = "date", measure.vars = c("site", "obs", "mod")) 

however, returned half of data, i.e., records of site(s) ("city centre") following first 1 ("campus") cut off:

date                   variable    value 2001-01-01 00:00:00    site        campus 2001-01-01 01:00:00    site        campus 2001-01-01 02:00:00    site        campus 2001-01-01 00:00:00    obs         na 2001-01-01 01:00:00    obs         52 2001-01-01 02:00:00    obs         52 2001-01-01 00:00:00    mod         61.63 2001-01-01 01:00:00    mod         62.55 2001-01-01 02:00:00    mod         63.52 

i tried recast:

test.recast <- recast(test.data, date ~ site + obs + mod) 

however, returned error message:

error in eval(expr, envir, enclos) : object 'site' not found 

i have tried search previous questions have not found similar scenarios (correct me if wrong). please me this?

many in advance!

you might better off using base r reshape after doing variable name cleanup.

here's data.

test <- read.table(header = true, stringsasfactors=false, text = "date             site  obs    mod    site             obs    mod '2000-09-01 00:00:00'  campus   na  61.63    'city centre'    66     56.69 '2000-09-01 01:00:00'  campus   52  62.55    'city centre'    na     54.75 '2000-09-01 02:00:00'  campus   52  63.52    'city centre'    56     54.65") test #                  date   site obs   mod      site.1 obs.1 mod.1 # 1 2000-09-01 00:00:00 campus  na 61.63 city centre    66 56.69 # 2 2000-09-01 01:00:00 campus  52 62.55 city centre    na 54.75 # 3 2000-09-01 02:00:00 campus  52 63.52 city centre    56 54.65 

if did correctly, should names got: @chase mentions in answer, "recurring column names bit of oddity , not normal r behaviour"--so we've got fix that.

note: both of these options generate "time" variable can go ahead , drop. might want keep in case wanted reshape wide format.

  • option 1: if got names did (which should have), solution simple. first site, append "0" site name , use base r reshape:

    names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".") test <- reshape(test, direction = "long",                  idvar = "date", varying = 2:ncol(test)) rownames(test) <- null # reshape makes ugly rownames test #                  date time        site obs   mod # 1 2000-09-01 00:00:00    0      campus  na 61.63 # 2 2000-09-01 01:00:00    0      campus  52 62.55 # 3 2000-09-01 02:00:00    0      campus  52 63.52 # 4 2000-09-01 00:00:00    1 city centre  66 56.69 # 5 2000-09-01 01:00:00    1 city centre  na 54.75 # 6 2000-09-01 02:00:00    1 city centre  56 54.65 
  • option 2: if have duplicated column names, fix still easy, , follows same logic. first, create nicer column names (easy using rep()), , use reshape() described above.

    names(test)[-1] <- paste(names(test)[-1],                           rep(1:((ncol(test)-1)/3), each = 3), sep = ".") test <- reshape(test, direction = "long",                  idvar = "date", varying = 2:ncol(test)) rownames(test) <- null  ### or, more convenient: # names(test) <- make.unique(names(test)) # names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".") # test <- reshape(test, direction = "long",  #                 idvar = "date", varying = 2:ncol(test)) # rownames(test) <- null 
  • optional step: data in form still not totally "long". if required, required 1 more step:

    require(reshape2) melt(test, id.vars = c("date", "site", "time")) #                   date        site time variable value # 1  2000-09-01 00:00:00      campus    0      obs    na # 2  2000-09-01 01:00:00      campus    0      obs 52.00 # 3  2000-09-01 02:00:00      campus    0      obs 52.00 # 4  2000-09-01 00:00:00 city centre    1      obs 66.00 # 5  2000-09-01 01:00:00 city centre    1      obs    na # 6  2000-09-01 02:00:00 city centre    1      obs 56.00 # 7  2000-09-01 00:00:00      campus    0      mod 61.63 # 8  2000-09-01 01:00:00      campus    0      mod 62.55 # 9  2000-09-01 02:00:00      campus    0      mod 63.52 # 10 2000-09-01 00:00:00 city centre    1      mod 56.69 # 11 2000-09-01 01:00:00 city centre    1      mod 54.75 # 12 2000-09-01 02:00:00 city centre    1      mod 54.65 

update (to try address questions comments)

  1. the reshape() documentation quite confusing. it's best work through few examples understanding of how works. specifically, "time" not have refer time ("date" in problem), more for, say, panel data, records collected @ different times same id. in case, "id" in original data "date" column. other potential "id" site, not in way data organized.

    imagine, moment, if data looked this:

    test1 <- structure(list(date = structure(1:3,      .label = c("2000-09-01 00:00:00",      "2000-09-01 01:00:00", "2000-09-01 02:00:00"), class = "factor"),      obs.campus = c(na, 52l, 52l), mod.campus = c(61.63, 62.55,      63.52), obs.citycentre = c(66l, na, 56l), mod.citycentre = c(56.69,      54.75, 54.65)), .names = c("date", "obs.campus", "mod.campus",  "obs.citycentre", "mod.citycentre"), class = "data.frame", row.names = c(na,  -3l)) test1 #                  date obs.campus mod.campus obs.citycentre mod.citycentre # 1 2000-09-01 00:00:00         na      61.63             66          56.69 # 2 2000-09-01 01:00:00         52      62.55             na          54.75 # 3 2000-09-01 02:00:00         52      63.52             56          54.65 

    now try reshape(test1, direction = "long", idvar = "date", varying = 2:ncol(test1)). you'll see reshape() sees site names "time" (that can overridden adding "timevar = "site"" reshape command).

    when direction = "long", must specify columns vary "time". in case, columns except first, hence use of 2:ncol(test) "varying".

  2. test2? that?

  3. question under @chase's answer: think misunderstand how melt() supposed work. basically, tries "skinniest" form of data. in case, skinniest form "optional step" described above since date + site minimum required comprise unique id variable. (i "time" can safely dropped.)

    once data in format described in "optional step" (we'll assume output has been stored "test.melt", can pivot table around in different ways. demonstration of mean that, try following , see do.

    dcast(test.melt, date + site ~ variable) dcast(test.melt, date ~ variable + site) dcast(test.melt, variable + site ~ date) dcast(test.melt, variable + date ~ site) 

    it not easy have flexibility if stop @ "option 1" or "option 2".


update (a few years later)

melt "data.table" can "melt" multiple columns in similar way reshape does. should work whether or not column names duplicated.

you can try following:

measure <- c("site", "obs", "mod") melt(as.data.table(test), measure.vars = patterns(measure), value.name = measure) #                   date variable        site obs   mod # 1: 2000-09-01 00:00:00        1      campus  na 61.63 # 2: 2000-09-01 01:00:00        1      campus  52 62.55 # 3: 2000-09-01 02:00:00        1      campus  52 63.52 # 4: 2000-09-01 00:00:00        2 city centre  66 56.69 # 5: 2000-09-01 01:00:00        2 city centre  na 54.75 # 6: 2000-09-01 02:00:00        2 city centre  56 54.65 

Comments

Popular posts from this blog

java - pagination of xlsx file to XSSFworkbook using apache POI -

Unlimited choices in BASH case statement -

apache - How do I stop my index.php being run twice for every user -