R: read.table producing unexpected results on tab delimited data file -


i'm noob r. running r studio in windows , i'm having heck of time trying understand what's happening following read.table command.

continents=read.table("country2continent.tsv",sep="\t",   col.names=c("country","continent"),fileencoding = "utf-8",strip.white = true) 

questions:

  1. if try print column of data on command line "continents$country" command, data totally garbled. examined garbled data , found special characters "\t" embedded. have rid of special characters causing issues?

  2. if view continents data frame in r studio, looks correct. because examining r data frame shows row 61 has issue. should read "cote d'ivoire africa" reads "cote divoire africa". in row (row 61), apostrophe missing in divoire , there tab between divoire , africa. there lot of country / continent pairs following "cote d'ivoire africa" did not own rows. suggestions on how fix this?

as per rawr's request, here snippet of sample data including problematic row 61:

algeria africa angola  africa benin   africa botswana    africa burkina faso    africa burundi africa cameroon    africa cape verde  africa central african republic    africa chad    africa comoros africa congo - brazzaville africa congo - kinshasa    africa côte d’ivoire   africa djibouti    africa egypt   africa equatorial guinea   africa eritrea africa ethiopia    africa gabon   africa gambia  africa ghana   africa guinea  africa guinea-bissau   africa kenya   africa lesotho africa liberia africa libya   africa madagascar  africa malawi  africa mali    africa mauritania  africa mauritius   africa mayotte africa morocco africa mozambique  africa namibia africa niger   africa nigeria africa rwanda  africa réunion africa saint helena    africa senegal africa seychelles  africa sierra leone    africa somalia africa south africa    africa sudan   africa swaziland   africa são tomé , príncipe   africa tanzania    africa togo    africa tunisia africa uganda  africa western sahara  africa zambia  africa zimbabwe    africa eritrea , ethiopia    africa south sudan africa sao tome , principe   africa cote d'ivoire   africa reunion africa congo, dem. rep.    africa congo, rep. africa anguilla    americas antigua , barbuda americas argentina   americas aruba   americas bahamas americas barbados    americas belize  americas bermuda americas bolivia americas brazil  americas british virgin islands  americas canada  americas cayman islands  americas chile   americas colombia    americas costa rica  americas cuba    americas dominica    americas dominican republic  americas ecuador americas el salvador americas falkland islands    americas french guiana   americas greenland   americas grenada americas guadeloupe  americas guatemala   americas guyana  americas haiti   americas honduras    americas jamaica americas martinique  americas mexico  americas montserrat  americas netherlands antilles    americas nicaragua   americas panama  americas paraguay    americas peru    americas puerto rico americas st. barthélemy  americas st. kitts , nevis americas st. lucia   americas st. martin  americas st. pierre , miquelon americas st. vincent , grenadines  americas suriname    americas trinidad , tobago americas turks , caicos islands    americas virgin islands (u.s.)   americas united states   americas uruguay americas venezuela   americas st.-pierre-et-miquelon  americas st. helena  americas sint maarten (dutch part)   americas falkland (malvinas)  americas curaçao americas pitcairn    americas cocos island    americas afghanistan asia armenia asia azerbaijan  asia bahrain asia bangladesh  asia bhutan  asia brunei  asia cambodia    asia china   asia cyprus  asia georgia asia hong kong, china    asia india   asia indonesia   asia iran    asia iraq    asia israel  asia japan   asia jordan  asia kazakhstan  asia kuwait  asia kyrgyzstan  asia laos    asia lebanon asia macao, china    asia malaysia    asia maldives    asia mongolia    asia myanmar [burma] asia nepal   asia neutral zone    asia north korea asia oman    asia pakistan    asia west bank , gaza  asia people's democratic republic of yemen   asia philippines asia qatar   asia saudi arabia    asia singapore   asia south korea asia sri lanka   asia syria   asia taiwan  asia tajikistan  asia thailand    asia timor-leste asia turkey  asia turkmenistan    asia united arab emirates    asia uzbekistan  asia vietnam asia yemen   asia myanmar asia lao asia united korea (former)   asia south yemen (former)    asia north yemen (former)    asia albania europe andorra europe austria europe belarus europe belgium europe bosnia , herzegovina  europe bulgaria    europe croatia europe cyprus  europe czech republic  europe denmark europe east germany    europe estonia europe faroe islands   europe finland europe france  europe germany europe gibraltar   europe greece  europe guernsey    europe hungary europe iceland europe ireland europe isle of man europe italy   europe jersey  europe latvia  europe liechtenstein   europe lithuania   europe luxembourg  europe macedonia   europe malta   europe metropolitan france europe moldova europe monaco  europe montenegro  europe netherlands europe norway  europe poland  europe portugal    europe romania europe russia  europe san marino  europe serbia  europe serbia , montenegro   europe slovakia    europe slovenia    europe spain   europe svalbard , jan mayen  europe sweden  europe switzerland europe ukraine europe ussr    europe united kingdom  europe vatican city    europe Åland islands   europe Åland   europe west germany    europe yugoslavia  europe serbia excluding kosova europe serbia excluding kosovo europe slovak republic europe svalbard    europe kosovo  europe kyrgyz republic europe czechoslovakia  europe macedonia   europe macedonia, fyr  europe channel islands europe faeroe islands  europe holy see    europe akrotiri , dhekelia   europe american samoa  oceania antarctica  oceania australia   oceania bouvet island   oceania british indian ocean territory  oceania christmas island    oceania cocos [keeling] islands oceania cook islands    oceania fiji    oceania french polynesia    oceania french southern territories oceania guam    oceania heard island , mcdonald islands   oceania kiribati    oceania marshall islands    oceania micronesia  oceania nauru   oceania new caledonia   oceania new zealand oceania niue    oceania norfolk island  oceania northern mariana islands    oceania palau   oceania papua new guinea    oceania pitcairn islands    oceania samoa   oceania solomon islands oceania south georgia , south sandwich islands    oceania tokelau oceania tonga   oceania tuvalu  oceania u.s. minor outlying islands oceania vanuatu oceania wallis et futuna    oceania micronesia, fed. sts.   oceania cook oceania 

i copied data text file called countries.tsv , ran following code. there may way use read.table directly, easier me.

## read in each line of data character string rl <- readlines('~/desktop/countries.tsv')  ## separate last word (continent) rest of string ## assumes second column _only_ 1 word  ## (.*)        1st capture group character number of times ## \\s+        followed 1 or more white spaces ## ([a-z]+)$   2nd capture group, take letters a-z 1 or more times ##               end of line $  ## \\1;\\2     take 2 capture groups , separate them semicolon txt <- gsub('(.*)\\s+([a-z]+)$', '\\1;\\2', rl, ignore.case = true)  txt[c(1:5, 60:62)] # [1] "algeria;africa"                 "angola ;africa"                 # [3] "benin  ;africa"                 "botswana   ;africa"             # [5] "burkina faso   ;africa"         "sao tome , principe  ;africa" # [7] "cote d'ivoire  ;africa"         "reunion;africa"    

so have semicolon-separated vector of strings, can use text= in read.table straight-forward. note since have irregular quotes, eg in line 61 pointed out, disable quotes quote = ''

dd <- read.table(text = txt, sep = ';', quote = '', stringsasfactors = false,                  col.names = c("country","continent"), strip.white = true)  # 'data.frame': 290 obs. of  2 variables: #   $ country  : chr  "algeria" "angola" "benin" "botswana" ... #   $ continent: chr  "africa" "africa" "africa" "africa" ...  dd[c(1:5, 60:62), ] #                  country continent # 1                algeria    africa # 2                 angola    africa # 3                  benin    africa # 4               botswana    africa # 5           burkina faso    africa # 60 sao tome , principe    africa # 61         cote d'ivoire    africa # 62               reunion    africa 

Comments

Popular posts from this blog

javascript - jQuery: Add class depending on URL in the best way -

caching - How to check if a url path exists in the service worker cache -

Redirect to a HTTPS version using .htaccess -