Tuesday, June 18, 2013

R - Grammar of Graphics Figure 1.1 redone with ggplot


An Example (1.4)


  • first we have to get the data which are birth and death rates of the year 1990, therefore we use the world bank data (and of course, there is a package WDI providing direct access)
  • so load the package and download a list of indicators (WDIcache)
  • the resulting data frame contains indicator and name, and additional information (data description, source)

require(WDI)
indicators <- as.data.frame(WDIcache()$series)
names(indicators)
[1] "indicator"          "name"               "description"       
[4] "sourceDatabase"     "sourceOrganization"
  • then we have to find our two parameters of interest: the crude birth and the crude death rate:
    • we use grep on the column name to find them

grep("crude",indicators$name)
[1] 6553 6554
  • so we know that there are two lines containing the string crude in the name variable, so let's show them
indicators[grep("crude",indicators$name),]
indicator                                 name
6553 SP.DYN.CBRT.IN Birth rate, crude (per 1,000 people)
6554 SP.DYN.CDRT.IN Death rate, crude (per 1,000 people)
                                                                                                                                                                                                                                                                                                   description
6553 Crude birth rate indicates the number of live births occurring during the year, per 1,000 population estimated at midyear. Subtracting the crude death rate from the crude birth rate provides the rate of natural increase, which is equal to the rate of population change in the absence of migration.
6554      Crude death rate indicates the number of deaths occurring during the year, per 1,000 population estimated at midyear. Subtracting the crude death rate from the crude birth rate provides the rate of natural increase, which is equal to the rate of population change in the absence of migration.
                   sourceDatabase
6553 World Development Indicators
6554 World Development Indicators
                                                                                                                                                                                                                                                                                                                                                                                                                         sourceOrganization
6553 (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Reprot (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
6554 (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Reprot (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
  • now we now the indicators - and we can download the desired data (and just in case: we download them for the years 1980–2012)
rate.data <- WDI(country="all",indicator=indicators[grep("crude",indicators$name),1],start=1980,end=2012)
rate.data.1990 <- rate.data[rate.data$year==1990,]
head(rate.data.1990)
iso2c                                 country year SP.DYN.CBRT.IN
11     1A                              Arab World 1990       34.71824
44     1W                                   World 1990       25.85034
77     4E   East Asia & Pacific (developing only) 1990       22.84601
110    7E Europe & Central Asia (developing only) 1990       17.90782
143    8S                              South Asia 1990       32.91213
176    AD                                 Andorra 1990             NA
    SP.DYN.CDRT.IN
11        8.262026
44        9.266023
77        6.997506
110      10.065765
143      10.703840
176             NA


  • now we have to remove aggregations from these data (such as the European Union, Middle East, etc)
    • we use grepl on the iso2c column (which does the same as grep but returns logical values (of course you can also use grep again))
    • the iso2c is a two character, standardized country code
    • we eliminate all rows with iso2c starting with X, equals EU, containing a number, or equals ZQ, ZJ, ZG, or ZF

rate.data.1990 <- rate.data.1990[!grepl("^X|EU|\\d|Z[QJGF]",rate.data.1990$iso2c,perl=T),]
head(rate.data.1990)


    iso2c              country year SP.DYN.CBRT.IN SP.DYN.CDRT.IN
176    AD              Andorra 1990             NA             NA
209    AE United Arab Emirates 1990         25.916          2.794
242    AF          Afghanistan 1990         52.449         22.062
275    AG  Antigua and Barbuda 1990         20.100          6.800
308    AL              Albania 1990         24.610          5.909
341    AM              Armenia 1990         21.215          7.738
  • so that's the data frame
  • now the graphics
require(ggplot2)
require(gridExtra)
require(scales)

rate.data.1990$lab <- ifelse(rbinom(size=1,n=nrow(rate.data.1990),prob=0.2)==1,rate.data.1990$country,"")
ggplot(rate.data.1990,aes(x=SP.DYN.CBRT.IN,y=SP.DYN.CDRT.IN)) +
    stat_density2d(aes(colour= ..level..),bins=6,h=c(11,9),geom="density2d") +
    scale_x_continuous("Birth Rate", limits = c(0,60),breaks=seq(0,60,by=10),expand=c(0,1)) +
    scale_y_continuous("Death Rate", limits = c(0,30),breaks=seq(0,30,by=10),expand=c(0,1)) +
    scale_colour_gradientn(colours=c("blue","green","lightgreen","red"),guide="none") +
    geom_text(aes(label=lab),size=3,position = "jitter") +
    geom_abline(intersect=0,slope=1) +
    coord_fixed() +
    annotate("text",x=20,y=20,angle=45, label="Zero Population Growth",vjust=-0.5,size=4) +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill="transparent"),
        axis.text = element_text(colour="black"),
        axis.ticks = element_line(colour="black"),
        axis.line = element_line(colour="black")
        ) 


  • note:
    • the plot in chapter 1.4 of the book uses an epanechnikov kernel, whereas ggplot uses a normal one
    • I choose randomly about 20% of the country names as labels (because I was to lazy to pick up exactly those used in the book)


No comments :

Post a Comment