R

Installing R

  1. Navigate to https://cran.r-project.org/mirrors.html.
  2. Choose a location near you under USA and click the hyperlink
  3. On this page, there are downloads available for each system. Click the link for your system.
  4. Click base
  5. Click the Download link at the top, Save File and proceed through the download and install.

GUI for R

We recommend installing a GUI for R, such as RStudio or Tinn-R. This will provide a nice interface for working with R.

ANOVA

One-Way

  1. Create lists of the data (vectors) that you want to make an ANOVA for. In this case, the response variable (y) is the selling price (in thousands of dollars) and x represents the sales person who each sold 4 robots at y selling price. Note in this example that it is necessary to use strings for x instead of numbers. R reads these strings like humans read words. In other words, the number assigned to each salesperson is arbitrary and could be named anything. For instance, as humans we recognize that the problem would be comparable if we named the sales people "Rebecca", "Rachel", and "Raymond" instead of "1", "2", and "3" but R needs to be specifically told to read them as a string and not a number.

    > SellingPrices <- c(10,14,13,12,11,16,14,15,11,13,12,15)
    > Salesperson <- c("1","1","1","1","2","2","2","2","3","3","3","3")

  2. Next make a data frame for the variables which enables R to read them as one set of data instead of two independent columns.

    > values = data.frame(SellingPrices, Salesperson)
    > summary(values)

    ## SellingPrices Salesperson
    ## Min. :10.00 1:4
    ## 1st Qu.:11.75 2:4
    ## Median :13.00 3:4
    ## Mean :13.00
    ## 3rd Qu.:14.25
    ## Max. :16.00
  3. Use the aov(y ~ x, data = data frame) to run a one-way ANOVA. Then use the summary() function to look up key outputs from the ANOVA.

    > SellingPrice.aov <- aov(SellingPrices ~ Salesperson, data = values)

    > summary(SellingPrice.aov)

    ## Df Sum Sq Mean Sq F value Pr(>F)
    ## Salesperson 2 6.5 3.25 0.929 0.43
    ## Residuals 9 31.5 3.50

Read more about making your own ANOVAs here.

Binomial Distribution

Binomial Probability (pdf)

  1. Into the console type dbinom(successes, number of trials, probability of success). The result is shown in row [1].

    > dbinom(0, 4, 1/6)
    [1] 0.4822530864

Binomial Probability Distribution

  1. Into the console type dbinom(successes, number of trials, probability of success). The result is shown in row [1].

    > dbinom(0:4, 4, 1/6)
    [1] 0.4822530864 0.3858024691 0.1157407407 0.0154320988 0.0007716049

Note: Here we have entered 0:4 for successes to calculate the entire distribution at once. The first probability (0.4822530864) in the output corresponds to 0 successes, the second (0.3858024691) corresponds to 1 success, and so on. Individual probabilities can be found by instead entering a single number here, such as "dbinom(0, 4, 1/6)".

Binomial Probability (cdf)

  1. Into the console type pbinom(successes, number of trials, probability of success). The result is shown in row [1].

    > pbinom(11,20,0.4)
    [1] 0.9434736

Chi-Square Distribution

Critical Value

  1. Into the console type "qchisq(1-a, degrees of freedom)". The chi-square critical value corresponding to probability a in the right tail is returned. The result is shown in row [1].

    > qchisq(.99,13)
    [1] 27.68825

Left Tailed Probability (cdf)

To find the corresponding p-value for a left tailed probability (cdf) X2 test statistic, use pchisq(x, degrees of freedom, lower.tail = TRUE).

> pchisq(120,2,lower.tail = TRUE)
## [1] 1

Read more about chi-square distribution probability distributions here.

Right Tailed Probability (cdf)

To find the corresponding p-value for a right tailed probability (cdf) X2 test statistic, use pchisq(x, degrees of freedom, lower.tail = FALSE).

> pchisq(120,2,lower.tail = FALSE)
## [1] 8.756511e-27

Read more about chi-square distribution probability distributions here.

Confidence Intervals

Proportion

  1. To make a proportion confidence interval you will use the binom.test(x,n) function. You will need to enter the following parameters into the function: x being the number of cases, n being the total sample size. Example: You take a sample of 10 people. 5 of them are female. Confidence level defaults to 95%

  2. You can change the confidence level using the parameter conf.level.

    > binom.test(5,10)
    Exact binomial test
    data: 5 and 10
    number of successes = 5, number of trials = 10, p-value = 1
    alternative hypothesis: true probability of success is not equal to 0.5
    95 percent confidence interval:
    0.187086 0.812914
    sample estimates:
    probability of success
    0.5
    > binom.test(5,10,conf.level=.90)
    Exact binomial test
    data: 5 and 10
    number of successes = 5, number of trials = 10, p-value = 1
    alternative hypothesis: true probability of success is not equal to 0.5
    90 percent confidence interval:
    0.2224411 0.7775589
    sample estimates:
    probability of success
    0.5

t-Interval

  1. To make a t-interval you will need your data saved in an array.

  2. You can perform the t-interval calculation using the function t.test(). The function will automatically calculate the necessary sample statistics. Confidence level defaults to 95%.

  3. You can change the confidence level using the parameter conf.level.

    > age=c(24,25,27,33,35,37)
    > t.test(age)
    One Sample t-test
    data: age
    t = 13.365, df = 5, p-value = 4.195e-05
    alternative hypothesis: true mean is not equal to 0
    95 percent confidence interval:
    24.36464 35.96870
    sample estimates:
    mean of x
    30.16667
    > t.test(age,conf.level=.90)
    One Sample t-test
    data: age
    t = 13.365, df = 5, p-value = 4.195e-05
    alternative hypothesis: true mean is not equal to 0
    90 percent confidence interval:
    25.61853 34.71481
    sample estimates:
    mean of x
    30.16667

z-Interval

To make a z-interval you will use R to make the calculation by hand

  1. To make a z-interval you will need your data saved in an array.

  2. Save the following parameters in a variable.

    1. Sample mean

    2. Sample standard deviation

    3. Sample size

  3. Now you need to determine the Critical Value to use using the qnorm() function. Decide on your Confidence Level. The typical options are 90%, 95%, or 99%. Subtract that percentage from 100%, cut in half, and convert to a decimal to use in the qnorm() function. For Example: For a 95% confidence interval, take half of 5% or 2.5% (.025).

  4. Calculate the Margin of Error by multiplying the critical value times standard deviation, then dividing by square root of sample size.

  5. Calculate the confidence interval by adding and subtracting the margin of error from the sample mean.

    > age=c(24,25,27,33,35,37)
    > mean=mean(age)
    > sd=sd(age)
    > n=length(age)
    > z=qnorm(.025)
    > z
    [1] -1.959964
    > MOE=z*sd/sqrt(n)
    > mean-MOE
    [1] 34.59048
    > mean+MOE
    [1] 25.74286

Two Sample t-Interval (Independent Samples)

  1. To make a Two Sample t-interval you will need each sample's data saved in a separate array.

  2. You can perform the t-interval calculation using the function t.test(). The function will automatically calculate the necessary sample statistics. Confidence level defaults to 95%.

  3. You can change the confidence level using the parameter conf.level.

    > agemen=c(24,25,27,33,35,37)
    > agewomen=c(24,34,22,18,33,25)
    > t.test(agemen,agewomen)
    Welch Two Sample t-test
    data: agemen and agewomen
    t = 1.2184, df = 9.837, p-value = 0.2515
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
    -3.470084 11.803418
    sample estimates:
    mean of x mean of y
    30.16667 26.00000
    > t.test(agemen,agewomen,conf.level=.90)
    Welch Two Sample t-test
    data: agemen and agewomen
    t = 1.2184, df = 9.837, p-value = 0.2515
    alternative hypothesis: true difference in means is not equal to 0
    90 percent confidence interval:
    -2.041873 10.375206
    sample estimates:
    mean of x mean of y
    30.16667 26.00000

Two Sample z-Interval

To make a two sample z-interval you will use R to make the calculation by hand.

  1. To make a two sample z-interval you will need each sample's data saved in a separate array.

    1. Sample mean 1 and 2

    2. Sample variance 1 and 2

    3. Sample size 1 and 2

  2. Now you need to determine the Critical Value to use using the qnorm() function. Decide on your Confidence Level. The typical options are 90%, 95%, or 99%. Subtract that percentage from 100%, cut in half, and convert to a decimal to use in the qnorm() function. For Example: For a 95% confidence interval, take half of 5% or 2.5% (.025).

  3. Calculate the Margin of Error by multiplying the critical value times the square root of the sum of each variance divided by its sample size.

  4. Calculate the sample difference by subtracting the two sample means.

  5. Calculate the confidence interval by adding and subtracting the margin of error from the sample difference.

    > agemen=c(24,25,27,33,35,37)
    > agewomen=c(24,34,22,18,33,25)
    > meanM=mean(agemen)
    > meanW=mean(agewomen)
    > varM=var(agemen)
    > varW=var(agewomen)
    > nM=length(agemen)
    > nW=length(agewomen)
    > z=qnorm(.025)
    > z
    [1] -1.959964
    > MOE=z*sqrt(varM/nM+varW/nW)
    > diff=meanM-meanW
    > diff-MOE
    [1] 10.86918
    > diff+MOE
    [1] -2.53585

Counting

Combination

The number of combinations can found by using combn. Input the number of objects first (36) followed by the number of objects taken at a time (5). The ncol command counts the number of combinations in this case.

> ncol(combn(36,5)
## [1] 376992

Read more about programming your combinations here.

Factorial

Use the factorial() function to find the factorial.

> factorial(5)
## [1] 120

Read more about how to use the factorial function here.

Permutation

Since there is no simple command for a permutation like there is for combinations, it is easiest to calculate a permutation by using what we know about combinations. Permutations are simply a combination multiplied by k! or the factorial of the number selected at a time. Without using the number of columns function (ncol), we would receive a list of all permutations.

> ncol(combn(7,3))*factorial(3)
## [1] 210

Data Manipulation

Sorting

To sort data by ascending or descending order, use the sort() function. In this example, we will sort the ages of 25 employees at a clothing department store (Example 3.4.2).

  1. Create a list of all the ages to be included. This list (or vector) is called "Ages" here. Alternatively, you could extract a column or row of data from an Excel to sort it.

    > Ages <-c(32,21,24,19,61,18,18,16,16,35,39,17,22,21,60,18,53,18,57,63,28,20,29,35,45)

  2. Use sort(x, decreasing=FALSE) to sort in ascending order (smallest to largest). Set decreasing=TRUE for descending order (largest to smallest).

    > sort(Ages, decreasing=FALSE)
    ## [1] 16 16 17 18 18 18 18 19 20 21 21 22 24 28 29 32 35 35 39 45 53 57 60
    ## [24] 61 63

Read more about sorting your data here.

Descriptive Statistics

One Variable

There are a variety of functions and packages that can be used to calculate descriptive statistics. In addition to the base functions in R, packages such as "mosaic" can be used to calculate descriptive statistics.

> list <- c(4,10,7,15)

One of the easiest ways to see the mean, median, maximum, and minimum of a data set is to use the summary() function. Note that there is no simple function to find the mode of a data set.

> summary(list)

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 6.25 8.50 9.00 11.25 15.00

The standard deviation, stadard error, variation, range, and sum can also be calculated easily.

> se <- sd(list)/sqrt(length(list)) #Standard error calculation which is the standard deviation divided by the squareroot of the list's length
> se
## [1] 2.345208
> sd(list) #Standard deviation function
## [1] 4.690416
> var(list) #Variance function
## [1] 22
> range.value <- (max(list)-min(list)) #Calculation for range which is the maximum value minus the minimum value
## [1] 11
> sum(list) #Sum of the values
## [1] 36

Alternatively, you can install the "mosaic" package in R to use the favstats function. This will show you data on the minimum, maximum, mean, median, standard deviation, and count amoung other basic descriptors.

> install.packages("mosaic")
> library("mosaic")
> favstats(list)

## min. Q1 median Q3 max mean sd n missing
## 4 6.25 8.5 11.25 15 9 4.690416 4 0

F-Distribution > F-Probability (cdf)

F-Probability (cdf)

Once you find the test statistic F and the degrees of freedom, then you can plug your values into the function to find the P-value.

F-statistic: 0.9286

Degrees of freedom: 2,9

P(F>0.9286): lower.tail=FALSE

> pf(0.9286,2,9,lower.tail=FALSE)
## [1] 0.4298936

Read more about F-Probability (cdf) here.

Graphs

Bar Charts

  1. Create vectors (lines of your data) for x and y. Here, x is the Lack of Parental Involvement and y is the Percentage of Frequency Distribution for the responses.

    Lack_of_Parental_Involvement <- c(Very serious", Somewhat serious", Not very serious", Not a problem", Not sure")
    > Percents <- c(56,27,9,6,3)

  2. Plot the bar chart using barplot().Percents acts as the height of the bar chart and the names for the widths of the bars are represented by the responses for Lack of Parental Involvement. xlab and ylab are used to label the name of each axis and ylim is used to show the distribution of response percentage from 0-60%.

    > barplot(Percents, names.arg=Lack_of_Parental_Involvement,main="Lack of Parental Involvement Louis Harris Poll", xlab="Response Categories", ylab="Percentage of Frequency Distribution", ylim=c(0,60))

Bar Chart

Read more about bar charts here.

Box Plots

  1. Write your data as a list of numbers (vectors). For this example, the highest and lowest 5 wins per seasons were included from the Braves, Cubs, Dodgers, and Yankees baseball teams from the years 1967-2010.

    > Braves <- c(50,54,61,63,65,106,104,103,101,101)
    > Cubs <- c(38,49,61,64,65,103,97,97,96,93)
    > Dodgers <- c(58,63,63,71,73,102,98,95,95,95)
    > Yankees <- c(59,67,70,71,72,114,103,103,103,101)

  2. For a boxplot with multiple columns, it is necessary to create a data frame which puts each data list as a column with 10 rows.

    > BaseballTeams <- data.frame(Braves,Cubs,Dodgers,Yankees)
    > BaseballTeams

    ## Braves Cubs Dodgers Yankees
    ## 1 50 38 58 59
    ## 2 54 49 63 67
    ## 3 61 61 63 70
    ## 4 63 64 71 71
    ## 5 65 65 73 72
    ## 6 106 103 102 114
    ## 7 104 97 98 103
    ## 8 103 97 95 103
    ## 9 101 96 95 103
    ## 10 101 93 95 101
  3. Plot the boxplot using the data frame and appropriate labels.

    > boxplot(BaseballTeams,ylim=c(35,120),xlab= "Baseball Team",ylab="Wins per Season",main="Box Plot of the Number of Franchise Wins per Season 1967-2010",col=c("blue","yellow","red","green"),cex.main=.8)

    Boxplot Graph

Read more about making your own boxplots here.

Choropleth Map (County)

  1. Open RStudio.

  2. Create a new project by clicking "File" at the top of the window, and then selecting "New Project".

  3. In the pop-up window, click "New Directory" and then "Empty Project". Name it "Choropleth Map", and save it wherever you prefer.

  4. Upon creating your project, a section of your screen will show up with some text on it. Below the text there will be a ">" symbol and the cursor should show up to the right of the symbol. This is called the R console, and this is where we will write our statements.

  5. First, we will install all the necessary packages needed to create a choropleth map. Enter the following statements in the R console, pressing Enter after each statement.

    > install.packages("choroplethr")

    > install.packages("choroplethrMaps")

    > library(choroplethr)

    > library(choroplethrMaps)

    Note: the install.packages() statements only need to be run once and the specified package will be installed permanently. However, installing a package is different from loading a package into R. We need to load packages into R using the library() function every time we open a new R session.

  6. Now that we have all of the necessary tools installed and loaded, it is time to load in our data. For county level data, the choropleth map package in R requires data to be in comma separated value format (.csv) and organized in the following way:

    A B
    1 region value
    2 1001 8437
    3 1003 39710
    4 1005 2354
    5 1007 1664
    6 1009 5080
    7 1011 1031
    8 1013 2032
    9 1015 13818
    10 1017 2759

    The first column must have a header titled "region" containing the geographic indicator (FIPS/county codes), and the second column must have a header titled "value" that contains the value of the variable of interest associated with the geographic indicator. Make sure that there is no extra formatting in either column (no commas, symbols, text, etc.) Save the correctly formatted .csv file in the same location that you created your new R project directory. You should see the file name appear in the "Files" pane.

    For this example, we will use the US County Data found on the web resource.

  7. Once the .csv is saved in the correct location, we can load it into R via the R console. Navigate to the console, type in the following statement, and press Enter.

    > mapData <- read.csv("county_data.csv")

    Note: The "<-" symbol denotes assignment. We are assigning the data from the .csv file to a variable named "mapData" so that we can easily access it for future use.

  8. Once loaded, you should see an item in the Data panel with the name "mapData". By clicking on the item in the data panel, we can view the data that was loaded into R. The data should only have 2 variables (region, value). Now, we will create the choropleth map using a single R statement. Type the following into the console and press Enter.

    > county_choropleth(mapData)

    Depending on which FIPS codes are/aren't included in your dataset, you may get a warning message, but if there are no actual errors, the graph should still be generated like the following.

  9. After a couple seconds, R will generate a map of the United States with the county regions shaded depending on the associated value. You can specify a title for the maps, and a title for the legend by typing the statement with some additional input or parameters.

    > county_choropleth(mapData, title = "Number of Residents with a Bachelor's Degree or Higher, 2011-2015", legend = "Number of Residents with Degree")

    The previous command will generate the following plot.

    The plot can be saved as an image by clicking the "Export" button above the graph and selecting "Save as Image…"

Histogram

  1. Create a list of values (a vector) for your histogram. In this case, we are using the heart rate of 50 students.

    > HeartRate <-
    c(77,84,79,90,67,84,82,74,88,75,69,81,94,68,65,86,78,79,79,70,83,83,84,82,93,80,81,80,87,80,62,98,77,83,82,80,82,73,85,77,77,79,81,70,72,85,84,80,74,83)

  2. Create a histogram with breaks using the number at the beginning of the interval (56.5 is an example here). This can be accomplished by using hist(). To create a histogram where the frequency of values is on the y-axis, make sure that the interval breaks are equally spaced.

    > hist(HeartRate,breaks=c(56.5,66.5,76.5,86.5,96.5,106.5),main="Histogram of Heart Rates (per min.) of 50 Students", xlab="Heart Rates (per min.)")

    Histogram

Read more about making your own histograms here.

Normal Probability Plot

  1. Define your data variable by loading a datafile or entering a set of single variable data as a vector (in this case the data set was small so we defined it by hand).

  2. Then, use qqnorm and qqline to create the plot and draw the trendline.

    > exampleData<-c(20, 32, 14, 23, 27, 23, 29, 24, 23, 19)

    > qqnorm(exampleData, datax=TRUE)

    > qqline(exampleData, datax=TRUE)

    Note: If you left off ", datax=TRUE" the plot would be drawn with the sample quantities on the y-axis instead. Our materials typically show the data on the x-axis so we have adjusted this argument, but regardless of which axis has the data, you are looking for the points to follow a line.

Scatterplot

  1. For this example, we are using the High School Completion and Crime Rate data from Hawkes Stat. Delete the title (High School Completion and Crime Rate 2014) of the dataset from the top while saving it to your computer.

  2. Upload the data set to R. To do this, type in the name you want to save the data set as (in this case school_and_crime). To read the data into R, use read.csv(file="",header=TRUE,sep=","). Inside the "" you should put the path name to the file. You should then be able to view your data set in the Global Environment. More information about doing this can be found here.

    > school_and_crime <- read.csv(file="", header=TRUE, sep=",")

  3. Next you can plot your data. The $ are used to call a particular column of data from your file. In this case, the Crime Rate Data column is read as Crime.Rate..per.100.000 by R and the High School Completion column is read by R as High.School.Completion. Finally, label your axes and title.

    plot(school_and_crime$Crime.Rate..per.100.000,school_and_crime$High.School.Completion, xlab="Crime Rate (per 100,000)", ylab="Completion Rate",main="High School Completion Rate and Crime Rate",ylim=c(65,95))

    Scatterplot

Read more about programming scatterplots here.

Hypothesis Testing

z-Test

  1. Unless you choose to install a package in R, you will have to create your own z-test. There are a number of ways to accomplish this, but one way is to make a function that calculates the z-score and a separate command to calculate the P-value. In this case we are using the parameters for x _ (x.bar), μ (mu), 𝜎 (sd), and number (n) for our z-test. To calculate the z-score, we use the equation:

    t = x _ μ 0 𝜎 n .

    > z.score = function(x.bar, mu, sd, n){
    z <- ((x.bar-mu)/(sd/sqrt(n)))}

  2. Now plug in the values for the z-score function. Saving it as a new output will be useful for calculating the P-value and other test statistics.

    >z_output <- z.score(16200,16000,2500,1000)
    >z_output
    [1] 2.529822

  3. Calculate the P-value for P( z ≥ 2.53) = P( z ≤ -2.53). As always, be careful to correctly evaluate the P-value depending on if you want the upper, lower, or two-tailed probability. Then evaluate if the p-value convinces you to reject or fail to reject the null hypothesis.

    >p=pnorm(-z_output)
    >p

    ## [1] 0.005706018

    alpha = 0.01
    if (alpha > p) {
    print("Reject null hypothesis")
    } else {
    print ("Fail to reject the null hypothesis")
    }
    [1] "Reject null hypothesis"

Read more about z-tests here.

t-Test

  1. There is a t.test option in R, but without a vector or list of data, it is necessary to create your own function. There are several ways to accomplish this, but one way is to make a function that calculates the t-score and a separate command to calculate the P-value. In this case we are using the parameters for x _ (x.bar), μ0 (mu), s (s), and number (n) for our t-test. To calculate the t-score, we use the equation:

    t = x _ μ 0 s n .

    > t.score = function(x.bar, mu, s, n){
    t <- ((x.bar-mu)/(s/sqrt(n)))}

  2. Now plug in the values for the t-score function. Saving it as a new output will be useful for calculating the p-value and other test statistics.

    >t_output <- t.score(29,35,8,20)
    >t_output
    [1] -3.354102

  3. Calculate the P-value. As always, be careful to correctly evaluate the P-value depending on if you want the upper, lower, or two-tailed probability. Then evaluate if the P-value convinces you to reject or fail to reject the null hypothesis.

    >alpha = 0.01

    >p=2*pt(t_output, df=19)
    >p

    ## [1] 0.003332838

    if (alpha > p) {
    print("Reject null hypothesis")
    } else {
    print ("Fail to reject the null hypothesis")
    }
    [1] "Reject null hypothesis"

Read more about t-tests here.

Normal Distribution

Normal Probability (cdf)

  1. Use the function pnorm(z/x, mean = mu, sd = standard deviation, lower.tail = TRUE)

    Z/x: provide the z score or x value
    mu: if left off assumed to be 0
    standard deviation: if left off assumed to be 1
    lower.tail: TRUE if left off. Include lower.tail = FALSE if you need the probability of observing a value above the x or z you provided.

Examples

  1. P z > 1.37

    > pnorm(1.37, lower.tail = FALSE)
    [1] 0.08534345

  2. P z < 1.37

    > pnorm(1.37)
    [1] 0.9146565

  3. P X < 50 with mean 25 and standard deviation 10

    > pnorm(50,mean=25,sd=10)
    [1] 0.9937903

Poisson Distribution

Poisson Probability (cdf)

  1. Enter ppois(1, lambda=mean). The probability is shown in output row [1].

    > ppois(1,lambda=0.5)
    [1] 0.909796

Poisson Probability (pdf)

  1. Enter dpois(x, lambda=mean). The probability is shown in output row [1].

    > dpois(0, lambda = .5)
    [1] 0.6065307

Regression

Confidence Intervals for Slope and y-Intercept

To find the confidence interval for the slope and y-intercept of a linear regression, run your regression using the lm() function, then use the confint() function. Inside of this you give the model, and the confidence level desired.

> Y = c(12, 11, 12, 12, 13, 16, 13, 18, 11, 14)
> X = c(50, 51, 62, 45, 63, 76, 53, 68, 51, 74)
> model = lm(Y~X)
> confint(model,level=0.95)

2.5 % 97.5 %
(Intercept) −2.74485180 10.9761507
X 0.03920706 0.2671791

Read more about confidence intervals here.

Correlation Coefficient

Revisiting the scatterplot we made previously (see Graphs > Scatterplot for information on the data set) , we can calculate its correlation using cor( x, y). The ouput indicates a moderately negative correlation which makes sense given the scatterplot.

> cor(School_and_crime$Crime.Rate..per.100.000,school_and_crime$High.School.Completion)

[1] -0.4262846

Regression Prediction Intervals

To find the confidence interval for the mean value of y given x, run your regression, then use the predict() function. Inside of this you give the model, the data to predict on as shown below, the type of interval, and the confidence level desired. For a simple linear regression, still use the newdata=list() notation.

> daughter <- c(65, 65, 61, 69, 67, 59, 69, 70, 68, 70, 70, 65, 70)
> mother <- c(64, 66, 62, 70, 70, 58, 66, 66, 64, 67, 65, 66, 68)
> father <- c(73, 70, 72, 72, 72, 63, 75, 75, 72, 69, 77, 70, 74)
> m1 <- lm(daughter~mother+father)
> predict(m1,newdata=list(mother=64, father=74),interval="confidence",level=0.95)

fit lwr upr
1 66.82968 64.7847 68.87465

To find the predicted value of y given x, change the interval type to prediction.

> daughter <- c(65, 65, 61, 69, 67, 59, 69, 70, 68, 70, 70, 65, 70)

fit lwr upr
1 66.82968 61.5329 72.12646

Read more about predictions here.

Simple Linear Regression

  1. Create a list of values (vectors) for the x and y variables. Age will go on the x-axis and AskingPrice on the y-axis for this example.

    > Age <- c(1,1,2,2,2,3,3,4,4,5,5,6,6,6)
    > AskingPrice <- c(17850,18000,15195,16995,15625,14935,14879,14460,13586,13050,13495,9150,9950,10995)

  2. Create a regression line using the command lm(y ~ x) for linear model.

    > RegressionLine <- lm(AskingPrice ~ Age)

  3. Plot the points with the fitted linear model. The lines () function can be used and the points should be sorted by the x-variable before being fit to the regression line.

    > plot(Age, AskingPrice,xlab="Age (Years)",ylab="AskingPrice",main="Asking Price versus Age (Years)", lines(sort(Age),fitted(RegressionLine)))

    Line Graph
  4. To get a summary of the linear regression line use summary () function.

    > summary(RegressionLine)
    lm(formula = AskingPrice ~ Age)
    Residuals:

    Min 1Q Median 3Q Max
    -1574.94 -582.30 50.25 533.37 1357.83


    Coefficients:

    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 19198.3 524.9 36.58 1.12e-13
    Age -1412.2 131.8 -10.71 1.69e-07

    ---
    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    Residual standard error: 868.7 on 12 degrees of freedom
    Multiple R-squared: 0.9053, Adjusted R-squared: 0.8975
    F-statistic: 114.8 on 1 and 12 DF, p-value: 1.692e-07

Read more about simple linear regression and plotting linear regression points here: Simple Linear Regression | The Default Scatterplot Function.

Sampling

Random Samples

See examples below. Enter the values you would like to sample from in an array named as you like. sample(x) will sample from the given array without replacement and generate a sample with as many values as are in the array. By default, the function samples without replacement but you may specify replace=TRUE.

Example 1
> x<-c(1,2,3,5,6,7)
> sample(x)
[1] 1 7 2 3 5 6
> sample(x, replace=TRUE)
[1] 3 5 5 3 2 1
> sample(x,2)
[1] 1 5

Example 2
> x<-1:5
> sample(x)
[1] 1 3 2 4 5
> sample(x,4,replace=TRUE)
[1] 4 4 2 2

t-Distribution

Inverse t

  1. Enter qt(probability, "df=" degrees of freedom). The t-value is shown in output row [1].

    > qt(0.975, df=18)
    [1] 2.100922

Hypergeometric Distribution

Hypergeometric Distribution

To find the probability of successes P(X=0), P(X=1), and P(X=2), use the dhyper(number of successes in the sample of size n, number of possible successes, number of possible failures, number of draws) function. This can equivalently be written as dhyper(x, k, N-k, n).

> dhyper(0,2,28,16)
## [1] 0.2091954
> dhyper(1,2,28,16)
## [1] 0.5149425
> dhyper(2,2,28,16)
## [1] 0.2758621

Read more about hypergeometric distributions here.