Sunday, June 12, 2011

R - ggplot2: qplot

ggplot

1 ggplot intro

1.1 first examples ggplot

The next view lines are the first examples in the book ggplot from the author of the package H. Wickham - just a overview what is possible with qplot(); the diamond data set is a part of the ggplot2 package.

  • first load the package
  • load the data set
  • create a small data set from diamonds using sample() (1000 rows)
  • have a look on the data (head())
library(ggplot2)
data(diamonds)
dsmall <- diamonds[sample(nrow(diamonds),1000),]
head(diamonds)
carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
  • and str()
str(diamonds)
'data.frame':   53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Factor w/ 5 levels "Fair","Good",..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Factor w/ 8 levels "I1","SI2","SI1",..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Now we can start:

1.1.1 Simple Barcharts and Histograms

barchart of the variable cut which is a factor with five levels

qplot(cut, data=diamonds, geom="bar")

trying the same with a numeric (continuous) variable e.g. depth; Histogram:

qplot(depth, data=diamonds, geom="histogram")

If we look at this picture we notice ggplot has set the range of the x-axis apparently to wide. Type

range(diamonds$depth)
[1] 43 79

which give the min and the max of the depths

if you still want to change the visible part of the x-axis, you can do it with the xlim argument:

qplot(depth, xlim=c(55,70), data=diamonds, geom="histogram")

Besides the image R gives you the following line as result:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

so if you want to change the width of the bins add the binwidth argument

Now we have a look on the distribution of carat of the diamonds and change this argument:

qplot(carat, data=diamonds, geom="histogram", binwidth=1)

qplot(carat, data=diamonds, geom="histogram", binwidth=0.1)

qplot(carat, data=diamonds, geom="histogram", binwidth=0.01)

Every step we have a gain of information, the more bins we have the more information we get from the image.

1.1.2 Density Plots

For continuous variables you can use density instead of histogram.

qplot(carat, data=diamonds, geom="density")

If we want to compare different groups defined by a factor, we simply add the colour argument. Here we use the variable diamonds$color.

qplot(carat, data=diamonds, geom="density", colour=color)

Too many curves on one plot? No problem: we add a facets argument, which splits the one above into as many as levels of color

qplot(carat, data=diamonds, geom="density", facets=color~., colour=color)

And if we want to fill the curve (in the same color):

qplot(carat, data=diamonds, geom="density", facets=color~., colour=color, fill=color)

If you want to put two plots side by side on one image, use grid.arrange(). Install the package (if it is not installed yet) via install.packages("gridExtra").

First we load the library:
library(gridExtra)

Now we can look at the densities depending on the color in on hand and clarity on the other:

p1 <- qplot(carat, data=diamonds, geom="density", facets=clarity~.,  fill=clarity)  
p2 <- qplot(carat, data=diamonds, geom="density", facets=color~., fill=color)       
grid.arrange(p1,p2, ncol=2)                                                        

Scatter Plots

Giving two arguments x and y to qplot() we will get back a scatter plot, through which we can investigate the relationship of them:

qplot(carat, price, data=diamonds)

qplot() accepts functions of variables as arguments:

qplot(log(carat), log(price), data=diamonds)

By using the colour argument, you can use a factor variable to color the points. In this example I use the column color of the diamonds data frame to define the different groups. The further argument alpha changes the transparency, it is numeric in the range [0,1] where 0 means completely transparent and 1 completely opaque. I() is a R function and stands for as is.

qplot(log(carat), log(price), data=diamonds, colour=color, alpha=I(1/7))

Instead of colour you can use the shape argument, it is more helpful especially when you are forced to create bw graphics. Unfortunately shape can only deal with with a maximum of 6 levels. So I chose the column cut. And - of course - it is more appropriate to use a smaller dataset. Additionally we use the size argument to change the size of the points according to the volume of the diamonds (the product x*y*z).

qplot(log(carat), log(price), data=dsmall, shape=cut, size=x*y*z)

Via geom argument (which is useful in lots of other ways) we can add a smoother (because we want to keep the points we also add point). You can turn off the drawing of the confidence interval by the argument se=FALSE.

qplot(log(carat), log(price), data=dsmall, geom=c("point","smooth"))

With method you can also change the smoother: loess, gam, lm etc pp.

qplot(log(carat), log(price), data=dsmall, geom=c("point","smooth"), method="lm")

Author: Mandy

Date: 2011-06-12 13:04:25 CEST

HTML generated by org-mode 7.4 in emacs 23

No comments :

Post a Comment