R: read.table producing unexpected results on tab delimited data file -
i'm noob r. running r studio in windows , i'm having heck of time trying understand what's happening following read.table command.
continents=read.table("country2continent.tsv",sep="\t", col.names=c("country","continent"),fileencoding = "utf-8",strip.white = true)
questions:
if try print column of data on command line "continents$country" command, data totally garbled. examined garbled data , found special characters "\t" embedded. have rid of special characters causing issues?
if view continents data frame in r studio, looks correct. because examining r data frame shows row 61 has issue. should read "cote d'ivoire africa" reads "cote divoire africa". in row (row 61), apostrophe missing in divoire , there tab between divoire , africa. there lot of country / continent pairs following "cote d'ivoire africa" did not own rows. suggestions on how fix this?
as per rawr's request, here snippet of sample data including problematic row 61:
algeria africa angola africa benin africa botswana africa burkina faso africa burundi africa cameroon africa cape verde africa central african republic africa chad africa comoros africa congo - brazzaville africa congo - kinshasa africa côte d’ivoire africa djibouti africa egypt africa equatorial guinea africa eritrea africa ethiopia africa gabon africa gambia africa ghana africa guinea africa guinea-bissau africa kenya africa lesotho africa liberia africa libya africa madagascar africa malawi africa mali africa mauritania africa mauritius africa mayotte africa morocco africa mozambique africa namibia africa niger africa nigeria africa rwanda africa réunion africa saint helena africa senegal africa seychelles africa sierra leone africa somalia africa south africa africa sudan africa swaziland africa são tomé , príncipe africa tanzania africa togo africa tunisia africa uganda africa western sahara africa zambia africa zimbabwe africa eritrea , ethiopia africa south sudan africa sao tome , principe africa cote d'ivoire africa reunion africa congo, dem. rep. africa congo, rep. africa anguilla americas antigua , barbuda americas argentina americas aruba americas bahamas americas barbados americas belize americas bermuda americas bolivia americas brazil americas british virgin islands americas canada americas cayman islands americas chile americas colombia americas costa rica americas cuba americas dominica americas dominican republic americas ecuador americas el salvador americas falkland islands americas french guiana americas greenland americas grenada americas guadeloupe americas guatemala americas guyana americas haiti americas honduras americas jamaica americas martinique americas mexico americas montserrat americas netherlands antilles americas nicaragua americas panama americas paraguay americas peru americas puerto rico americas st. barthélemy americas st. kitts , nevis americas st. lucia americas st. martin americas st. pierre , miquelon americas st. vincent , grenadines americas suriname americas trinidad , tobago americas turks , caicos islands americas virgin islands (u.s.) americas united states americas uruguay americas venezuela americas st.-pierre-et-miquelon americas st. helena americas sint maarten (dutch part) americas falkland (malvinas) americas curaçao americas pitcairn americas cocos island americas afghanistan asia armenia asia azerbaijan asia bahrain asia bangladesh asia bhutan asia brunei asia cambodia asia china asia cyprus asia georgia asia hong kong, china asia india asia indonesia asia iran asia iraq asia israel asia japan asia jordan asia kazakhstan asia kuwait asia kyrgyzstan asia laos asia lebanon asia macao, china asia malaysia asia maldives asia mongolia asia myanmar [burma] asia nepal asia neutral zone asia north korea asia oman asia pakistan asia west bank , gaza asia people's democratic republic of yemen asia philippines asia qatar asia saudi arabia asia singapore asia south korea asia sri lanka asia syria asia taiwan asia tajikistan asia thailand asia timor-leste asia turkey asia turkmenistan asia united arab emirates asia uzbekistan asia vietnam asia yemen asia myanmar asia lao asia united korea (former) asia south yemen (former) asia north yemen (former) asia albania europe andorra europe austria europe belarus europe belgium europe bosnia , herzegovina europe bulgaria europe croatia europe cyprus europe czech republic europe denmark europe east germany europe estonia europe faroe islands europe finland europe france europe germany europe gibraltar europe greece europe guernsey europe hungary europe iceland europe ireland europe isle of man europe italy europe jersey europe latvia europe liechtenstein europe lithuania europe luxembourg europe macedonia europe malta europe metropolitan france europe moldova europe monaco europe montenegro europe netherlands europe norway europe poland europe portugal europe romania europe russia europe san marino europe serbia europe serbia , montenegro europe slovakia europe slovenia europe spain europe svalbard , jan mayen europe sweden europe switzerland europe ukraine europe ussr europe united kingdom europe vatican city europe Åland islands europe Åland europe west germany europe yugoslavia europe serbia excluding kosova europe serbia excluding kosovo europe slovak republic europe svalbard europe kosovo europe kyrgyz republic europe czechoslovakia europe macedonia europe macedonia, fyr europe channel islands europe faeroe islands europe holy see europe akrotiri , dhekelia europe american samoa oceania antarctica oceania australia oceania bouvet island oceania british indian ocean territory oceania christmas island oceania cocos [keeling] islands oceania cook islands oceania fiji oceania french polynesia oceania french southern territories oceania guam oceania heard island , mcdonald islands oceania kiribati oceania marshall islands oceania micronesia oceania nauru oceania new caledonia oceania new zealand oceania niue oceania norfolk island oceania northern mariana islands oceania palau oceania papua new guinea oceania pitcairn islands oceania samoa oceania solomon islands oceania south georgia , south sandwich islands oceania tokelau oceania tonga oceania tuvalu oceania u.s. minor outlying islands oceania vanuatu oceania wallis et futuna oceania micronesia, fed. sts. oceania cook oceania
i copied data text file called countries.tsv
, ran following code. there may way use read.table
directly, easier me.
## read in each line of data character string rl <- readlines('~/desktop/countries.tsv') ## separate last word (continent) rest of string ## assumes second column _only_ 1 word ## (.*) 1st capture group character number of times ## \\s+ followed 1 or more white spaces ## ([a-z]+)$ 2nd capture group, take letters a-z 1 or more times ## end of line $ ## \\1;\\2 take 2 capture groups , separate them semicolon txt <- gsub('(.*)\\s+([a-z]+)$', '\\1;\\2', rl, ignore.case = true) txt[c(1:5, 60:62)] # [1] "algeria;africa" "angola ;africa" # [3] "benin ;africa" "botswana ;africa" # [5] "burkina faso ;africa" "sao tome , principe ;africa" # [7] "cote d'ivoire ;africa" "reunion;africa"
so have semicolon-separated vector of strings, can use text=
in read.table
straight-forward. note since have irregular quotes, eg in line 61 pointed out, disable quotes quote = ''
dd <- read.table(text = txt, sep = ';', quote = '', stringsasfactors = false, col.names = c("country","continent"), strip.white = true) # 'data.frame': 290 obs. of 2 variables: # $ country : chr "algeria" "angola" "benin" "botswana" ... # $ continent: chr "africa" "africa" "africa" "africa" ... dd[c(1:5, 60:62), ] # country continent # 1 algeria africa # 2 angola africa # 3 benin africa # 4 botswana africa # 5 burkina faso africa # 60 sao tome , principe africa # 61 cote d'ivoire africa # 62 reunion africa
Comments
Post a Comment