Getting to Know R

Getting Started
Doing Simple Calculations
Making Simple Plots
Creating Random Distributions
Decisions and Loops
Matrix Manipulations
Reading/Writing Data
Dataframes
Fitting Curves to Data
Additional Packages
Integration and Differentiation
And Much, Much More

For general computations and associated documentation, the statistical programming language R has become a popular choice. The R programming environment (a) is available on all major platforms (Mac, PC, unix/linux), (b) has many easy-to-use (and many not-so-easy-to-use) built-in functions for plotting data, generating curves, manipulating matrices, etc., and best of all (c) is freely available for download over the internet. For information and free downloads, visit http://r-project.org and https://www.rstudio.com/.

There will be many options available to the R user for performing many of the typical operations that are described below. The ones shown in this document should help the new user get started with R, but a quick web search can easily bring up many optional methods and the user is encouraged to explore the vast resources available online.

Getting Started

After installing R (and, perhaps, Rstudio) on your system and initializing,

in the Console window, type commands and hit Return:

34 + 17

## [1] 51

sqrt(49)

## [1] 7

besselI(0.25,1)

## [1] 0.1259791

pi

## [1] 3.141593

in the Source window, select lines that you wish to execute and (a) hit Run, or (b) hit Command-Return (MAC) or Control-Return (PC)
if have a text file located in a directory /dir/filename that contains R commands, can type source("/dir/filename") (quotation marks must be included) in the console window and hit Return to execute the commands.

Doing Simple Calculations

Define variables and perform calculations…

E0 = 938; W = 8000
sqrt( 1 - (E0/(E0+W))^2 )

## [1] 0.994478

# or, define a variable:
beta = sqrt( 1 - (E0/(E0+W))^2 )
# just typing the name of the variable displays its value:
beta

## [1] 0.994478

Create a function as follows:

beta = function(x){ sqrt( 1 - (E0/(E0+x))^2 ) }
beta(8000)

## [1] 0.994478

Make a list of numbers and perform a calculation on all of them

x = c(0,2,5,9,23)
beta(x)

## [1] 0.0000000 0.0651981 0.1028413 0.1375393 0.2174718

Round off our answers to 3 digits

round(beta(x),3)

## [1] 0.000 0.065 0.103 0.138 0.217

Making Simple Plots

x = c(0,1,2,3,4,5,6,7,8,9)
y = beta(x)
x

##  [1] 0 1 2 3 4 5 6 7 8 9

##  [1] 0.00000000 0.04613883 0.06519810 0.07978741 0.09205725 0.10284134
##  [7] 0.11256761 0.12149051 0.12977598 0.13753935

plot(x,y)

Plot a curve of a function

curve(beta,0,10)

Plot the points, and add a curve to the same plot; also, add some new labels to the axes

plot(x,y,
   xlab="proton kinetic energy [MeV]", ylab="v/c", main="Velocity vs. Kinetic Energy")
curve(beta, 0, 10, add=TRUE, col="blue")

Creating Random Distributions

# Uniform Distribution of Npts points between 0 and 1
Npts = 500
r = runif(Npts,0,1)
plot(r)

hist(r)

# Normal (Gaussian) Distribution of Npts points, with mean=0, sigma=sig
sig=2
set.seed(231)  # set the random number seed to get same results each run
r = rnorm(Npts,0,sig)
plot(r)

hist(r)
curve(Npts/(sqrt(2*pi)*sig)*exp(-x^2/(2*sig^2)), add=TRUE, col="blue")

Decisions and Loops

imax=10
x = array(c(1:imax))
x

##  [1]  1  2  3  4  5  6  7  8  9 10

i = 1
while (i<imax){
  x[i+1] = x[i] + 12
  i = i+1
}
x

##  [1]   1  13  25  37  49  61  73  85  97 109

Can use ifelse(test,yes,no) to perform tests and make decisions

imax=10
x = runif(imax,-1,1)
round(x,2)

##  [1] -0.17  0.29 -0.56  0.82 -0.32 -0.25 -0.25  0.24 -0.70  0.50

i = 0
while (i<imax){
  i = i+1
  x[i] = ifelse(x[i]<0, x[i]+15, x[i])
}
round(x,2)

##  [1] 14.83  0.29 14.44  0.82 14.68 14.75 14.75  0.24 14.30  0.50

Matrix Manipulations

A matrix is essentially an ordered list; so, create a list of numbers and tell R where to break the list into rows or columns.

M = matrix( c(1,2,3,4), ncol=2, byrow=TRUE)
M

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Nice to type the input this way, to visually emphasize the arrangement:

A = matrix( c(1, 2,
              3, 4  ), ncol=2, byrow=TRUE)
A

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

B = matrix( c(5, 6,
              7, 8  ), ncol=2, byrow=TRUE)
B

##      [,1] [,2]
## [1,]    5    6
## [2,]    7    8

Multiply matrices – but be careful! If you write A * B, this will take each element of A (say a11) and multiply by the corresponding element in B (say b11). What we want is a real matrix multiply – performed using the %*% operator:

C = A * B
C

##      [,1] [,2]
## [1,]    5   12
## [2,]   21   32

WRONG!!

Here’s the right way (in R ):

C = A %*% B
C

##      [,1] [,2]
## [1,]   19   22
## [2,]   43   50

Take the transpose using t(), determinant using det()

t(A)

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

det(A)

## [1] -2

To find the inverse of a matrix, we use the solve() function:

solve(A)

##      [,1] [,2]
## [1,] -2.0  1.0
## [2,]  1.5 -0.5

Check:

  A      %*% solve(A)

##      [,1]         [,2]
## [1,]    1 1.110223e-16
## [2,]    0 1.000000e+00

solve(A) %*%   A

##      [,1]         [,2]
## [1,]    1 4.440892e-16
## [2,]    0 1.000000e+00

The function solve(A,B) solves the equation \(A\vec{x} = B\) for \(\vec{x}\). If \(B\) is missing in the solve() command, it is taken to be the identity and the inverse of \(A\) is returned.

Multiply a vector by a matrix to get a new vector; still use the operator %*%…

x = c(2,1)
x

## [1] 2 1

t(t(x))  # trick, to see it in a "usual" (up/down) vector format

##      [,1]
## [1,]    2
## [2,]    1

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

A %*% x

##      [,1]
## [1,]    4
## [2,]   10

Reading/Writing Data

Write a set of numbers to a file, and read them back in…

x = runif(500,3,9)
y = rnorm(500,2,6)
write.table(cbind(x,y), file="data.txt", row.names=FALSE)

data = read.table("data.txt", header = TRUE)
head(data)  # shows the first few lines...

##          x         y
## 1 8.405329  4.752421
## 2 8.990993 -6.413817
## 3 3.436147  2.626524
## 4 5.367637  7.377437
## 5 3.300938  1.870243
## 6 3.387021 -8.907860

xx = data[[1]]  # first column
yy = data[[2]]  # second column

plot(xx,yy)
abline(h=c(mean(yy),mean(yy)+c(-1,1)*sd(yy)),
   lty=c(1,2,2),lwd=2,col=c("red","blue","blue"))
abline(v=mean(xx),lty=1,lwd=2,col="green")

Note: The abline() function adds either horizontal, vertical, or general (\(y = a + bx\)) lines to an existing plot. The argument “lty” depicts the style of the line (solid, dotted, etc.).

Dataframes

Often times we wish to create or maybe read in a table of data, perhaps from measurements that were taken or from the results of a computer simulation, and take a quick look at the numbers, make simple plots, or pursue other manipulations. Above we read in data from a file and then identified new variables for each of the two columns of numbers. In reality, the read.table command created a “dataframe” in R which makes many new options available for our use.

# Create some sample data and write to a file, with a header line
x = rnorm(50,0,2)
y = rnorm(50,0,5)
p = (y/5*2-5*x)/10
z = 2-(x+runif(50,-2,2))/2

filename = "data2.txt"
write.table(cbind(x,y,z,p), file=filename, row.names = FALSE)

# Look at the first few lines of the file we created:
system("head -n 5 data2.txt")  # here, for unix/mac system

Next, we read the file back in, creating a dataframe named data2.

filename = "data2.txt"
data2 = read.table(file=filename,header=TRUE,sep="")
# What have we created?
attributes(data2)

## $names
## [1] "x" "y" "z" "p"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

head(data2)  # look at first few lines of the dataframe

##            x          y        z          p
## 1 -1.4275733  0.9341935 3.406161  0.7511544
## 2 -1.3736954  3.1395635 1.877720  0.8124302
## 3 -1.3159787 -4.6992528 2.435235  0.4700192
## 4  1.7580206 -8.1432030 1.282641 -1.2047384
## 5 -0.6537441 -2.4643359 2.723834  0.2282986
## 6  0.5998719 -9.1421571 2.220741 -0.6656223

Note that the names of the columns in our dataframe are taken from the header (first line) of the text file we read in. To access the data for one of the variables (columns) from our dataframe data we use the dollar sign:

length(data2$x) # number of entries for "x"

## [1] 50

data2$x[23]     # value of 23rd entry for "x"

## [1] 1.479732

print(sqrt(abs(data2$x)))  # perform calculation on all values of "x"

##  [1] 1.1948110 1.1720475 1.1471611 1.3259037 0.8085444 0.7745140 1.5291475
##  [8] 0.4660131 1.4790423 0.7585373 0.6813876 1.6949926 1.2965852 0.8788676
## [15] 1.3323534 1.8204247 0.6084816 0.8739309 1.4903011 1.9915934 1.4277139
## [22] 1.6439511 1.2164426 1.4701206 1.7993016 0.5850132 1.8616033 0.4317011
## [29] 1.0924151 1.6296967 0.2154265 1.6189016 1.4255884 1.4386191 0.3784294
## [36] 0.9138128 0.5334032 1.3422886 0.9582665 1.9949085 1.4001613 0.4762613
## [43] 1.3154949 1.2312756 1.6939402 1.3210473 0.5674360 0.5617848 0.9786093
## [50] 1.3660924

pairs(data2)    # create plots of each varialble vs. every other variable

plot(data2$x,data2$p)  # plot one variable vs. another variable

You can create a subset of your data set:

# find subset where abs(x)<2 AND abs(p)<1:
data.sub = subset(data2, subset=( (abs(data2$x)<2) & (abs(data2$p)<1)) )
head(data.sub)

##            x          y        z          p
## 1 -1.4275733  0.9341935 3.406161  0.7511544
## 2 -1.3736954  3.1395635 1.877720  0.8124302
## 3 -1.3159787 -4.6992528 2.435235  0.4700192
## 5 -0.6537441 -2.4643359 2.723834  0.2282986
## 6  0.5998719 -9.1421571 2.220741 -0.6656223
## 8  0.2171682 -1.7014461 2.699227 -0.1766420

plot(data.sub$x,data.sub$p)  # plot one variable vs. another variable

#  note: a & b     "a is true AND b is true" (intersection)
#        a | b     "a is true OR  b is true"    (union)
#         !a            "a is NOT true"        (negation)
#      other logical options:  a< b  a<=b  a==b  a>=b   a!=b

You can also add columns to your data frame using the already existing data if required. An example:

data2$a = sqrt(data2$x^2 + data2$y^2)
head(data2)

##            x          y        z          p        a
## 1 -1.4275733  0.9341935 3.406161  0.7511544 1.706072
## 2 -1.3736954  3.1395635 1.877720  0.8124302 3.426937
## 3 -1.3159787 -4.6992528 2.435235  0.4700192 4.880039
## 4  1.7580206 -8.1432030 1.282641 -1.2047384 8.330810
## 5 -0.6537441 -2.4643359 2.723834  0.2282986 2.549575
## 6  0.5998719 -9.1421571 2.220741 -0.6656223 9.161817

Fitting Curves to Data

Let’s take our previous plot and fit the data to a linear model, finding the intercept and slope, and adding the corresponding line to the plot.

fitparams = lm(p ~ x, data.sub)
fitparams

## 
## Call:
## lm(formula = p ~ x, data = data.sub)
## 
## Coefficients:
## (Intercept)            x  
##     0.02114     -0.46927

plot(data.sub$x,data.sub$p)
abline(fitparams,lty=2,col="red")

Note that the abline function plots a straight line directly using the fit parameters. One could also access the values of the fit parameters and use curve to draw the line. (See the next example.)

More complicated curves can be fit to data as follows. First, let’s create (or, perhaps, read in) some data:

xdat = sort(runif(100,0,10))
ydat = 17.4*exp(-xdat/2.687) + rnorm(100, 0, 0.216) + 1.25
plot(xdat,ydat, ylim=c(0,20))

Next, create a fit function with variable parameters:

fcnFt = function(x,A,TAU,B){
  A * exp(-x/TAU) + B
}

Fitting can be performed using the nls function (“non-linear least squares”). As with most fitting routines, it is important to attempt to get good starting parameters, usually by close examination of the data and making good guesses.

Cfit = nls(ydat ~ fcnFt(xdat,A,TAU,B), start=list(A= 20, TAU = 3, B=0))
Cfit

## Nonlinear regression model
##   model: ydat ~ fcnFt(xdat, A, TAU, B)
##    data: parent.frame()
##      A    TAU      B 
## 17.610  2.647  1.294 
##  residual sum-of-squares: 4.11
## 
## Number of iterations to convergence: 3 
## Achieved convergence tolerance: 6.723e-07

Plot the results together:

plot(xdat,ydat, ylim=c(0,20))
A   = coef(Cfit)[1]
TAU = coef(Cfit)[2]
B   = coef(Cfit)[3]
curve(fcnFt(x,A,TAU,B),add=TRUE,col="red")

Additional Packages

The R community has created a vast number of packages that extend the functionality of the base system. From Rstudio, for example, new packages can be downloaded to become available for use on your system by clicking on the “Packages” tab and clicking on “Install” – a pop-up window will open. In the pop-up window begin typing the name of the package of interest, and select the package from the list that appears.

For example, click “Install” as noted above, and begin typing “ggplot2” (a common and very useful graphics package). Once selected, click on “Install” and the package will be saved to your system.

Once the above steps are successfully performed, in R the newly installed package (ggplot2 in our example) can be used by typing the command:

library(ggplot2)

and then one can use the functions of the package. For example:

library(ggplot2)
x = runif(5000,0,1)
y = rnorm(5000,3,2)
pltdf = data.frame(cbind(x,y))
ggplot(pltdf) + geom_bin2d(aes(x,y))

Integration and Differentiation

Base R has built-in functions for performing numerical integration and for evaluating derivatives. Further manipulations can be performed using various packages that can be imported. We first examine the numerical integration and differentiation of single-variable functions. Next, we look at making estimates of integrals and derivatives given pairs of numerical data.

Numerical Integration of a Function

A = 3
B = 5
fcn1  = function(x){ 
  A + ( x^3 - 5*(x+B)^2 )/100 
}

To integrate the function fcn1 between two values of x:

fcnInt = integrate(fcn1,-5,10)
fcnInt

## 12.1875 with absolute error < 2.1e-13

Note that the printed solution is text with a value and an error estimate. We often want the numerical results to use in future calculations. To access them,

valInt = fcnInt$value
errInt = fcnInt$abs.error

valInt; errInt

## [1] 12.1875

## [1] 2.05242e-13

Differentiation of a Function

Base R can differentiate simple functions using the D function. We first define an expression of the function we wish to differentiate:

fcnExp = expression( A + ( x^3 - 5*(x+B)^2 )/100 )

The function D finds the derivative, here with respect to variable x:

fcnDiff = D(fcnExp,'x')
fcnDiff

## (3 * x^2 - 5 * (2 * (x + B)))/100

Note that the output is itself an expression. It can be further simplified using the simplify function (from the Deriv package):

library(Deriv)
Simplify(fcnDiff)

## (3 * x^2 - 10 * (B + x))/100

or, in one step:

Simplify( D( expression( A + ( x^3 - 5*(x+B)^2 )/100 ), 'x') )

## (3 * x^2 - 10 * (B + x))/100

Next, using our expression fcnDiff above, let’s create a numerical function that can actually be evaluated. To do so, we “evaluate” the expression and create a function based upon that evaluation:

dfcn1 = function(x){ 
  eval(fcnDiff) 
}

Now we can make plots of our original function and its derivative:

curve(fcn1, -10,15, ylab="fcn1(x) and dfcn1(x)")
curve(dfcn1, lty=2, col="red", add=TRUE)
abline(h=0,v=0)

Symbolic Integration of a Function

The base R package and the package Deriv do not have symbolic integration capabilities. Several R packages exist that can interface to an external computer algebra system (CAS). For instance, the package Ryacas interfaces to the Yacas CAS which runs on linux and on PCs.

Numerical Integration and Differentiation of Data

Often times one just has a set of data x and y and one is interested in estimates of \(\int_a^b y(x) dx\) or \(dy/dx(x)\), etc., based on those data. Base R has built-in functions for summing up values and for finding smoothed differences given vectors of data.

Let’s make up some data:

# here, 100 random numbers between -10 and 15, sorted low-to-high
xdat = sort(runif(100,-10,15))     
# take fcn1(x) of these data, but add random noise to the answer
ydat = fcn1(xdat)+rnorm(100,0,0.1) 
plot(xdat,ydat)

Let’s take these data and use approxfun to create a function that interpolates our ydat data and can then be used in the integration/differentiation process:

yApprox = approxfun(xdat,ydat)
curve(yApprox(x),-10,15)

Notice that yApprox is an R function that can be called and thus can be integrated.

Integration

Next, we can set up a sequence of x-values equally spaced by dx. Let’s avoid the end points, so as to not worry about convergence in these regions for our purposes. (This can be dealt with later, if necessary.) We then use the approximate function to give y-values and sum them up – multiplied by dx – to get an estimate of our integral:

dx = 0.1
xtst  = seq(-9,14,by=dx)
ytst  = yApprox(xtst)
yint  = sum(ytst, na.rm=TRUE)*dx
yint

## [1] 33.37528

Note that our original data set has 100 points between -10 and 15, or an average \(dx\) = 0.25. So, interpolation using much smaller values of dx won’t be very meaningful.

Let’s compare this result with the integration of our original function:

yINT  = integrate(fcn1,-9,14)
yINT

## 33.25417 with absolute error < 5.6e-13

print( paste("differs by", round( (yint-yINT$value)/yINT$value*100,3), "%"))

## [1] "differs by 0.364 %"

Differentiation

For differentation we can use the diff function in base R:

dydx = diff(ytst)/dx
plot( xtst[-1] - dx/2,  dydx)
curve(dfcn1(x),add=TRUE,col="red",lwd=2)

In the above, we are taking ytst[i] - ytst[i-1] and dividing by dx to estimate a derivative. As can be reasoned, if ytst has \(N\) elements, then the result for dydx will have \(N-1\) elements. Hence, we take xtst[-1] to delete the first element of xtst and make the two vectors the same length so we can plot them. We thus need to subtract dx/2 from the xtst values to re-center. We also see that the derivative function computed this way is very noisy.

We can make the result more smooth by using the lag parameter of the diff function. This iterates ytst[i] - ytst[i-lag] from lag = 1 up to its given value (lg in the example below), and tends to smooth the result. The derivative and its “centering” must then be dealt with appropriately as the number of points in the vector dydx below will be less by an amount lg; the derivative approximation will only be valid in the more central region of the data. As a rule of thumb, try using a value of lag \(\approx 2/dx\).

lg = 20    # here, a "lag" value on the scale of 2/dx 
dydx = diff(ytst,lag=lg)/dx/lg
plot( xtst[-c(1:lg)] - dx*lg/2,    dydx)
curve(dfcn1(x),add=TRUE,col="red",lwd=2)

or, altogether:

plot(xdat, ydat, ylab="ydat and dydx", main="Data and their Derivative")
points( xtst[-c(1:lg)] - dx*lg/2, dydx, pch="+", col="red")
abline(h=0,v=0)

And Much, Much More

text manipulations; complex plotting control; optimizations and artificial intellegence; complex statistical analyses and modeling; markdown documents, web sites, blogs, exams, …

Visit: https://www.r-project.org/ and https://www.rstudio.com for more information.

Getting to Know `R`

A short tutorial

Mike Syphers

last update: 10 Mar 2020