TRY R

Data Visualisation & Data Science

Dublin – 13th March 2016

Try R

**Data Visualisation & Data Science**** **

Dublin – 13th March 2016 **Try R**

With the Big Data gaining more and more importance every day, “R” is growing in parallel and is R is rapidly becoming the leading programming language in statistics and data science. Every year, the number of R users grows and organisations are using it in their day-to-day activities to examine and mining their data for future business purposes.

**Using R Studio**

To begin using R, head to r-proj ect.org to download and install R for your desktop or laptop. It runs on Windows and OS X. It is not the only one available and if you experiencing issues with the set up, it is worth mentioning DataCamp, which also give you the certificate of completion to link to your LinkedIn account! …but there are 63 chapters to go through, versus the 8 of Try R Code School. tryr.codeschool.com

Here below the completion of TryR.codeschool.com course.

This is how R Studio looks like once installed:

The top left window is where you’ll probably do most of your work. That’s the R code editor allowing you to create a file with multiple lines of R code — or open an existing file — and then run the entire file or portions of it.

Bottom left is the interactive console where you can type in R statements one line at a time. Any lines of code that are run from the editor window also appear in the console.

The top right window shows your workspace, which includes a list of objects currently in memory. There’s also a history tab with a list of your prior commands; what’s handy there is that you can select one, some or all of those lines of code and one-click to send them either to the console or to whatever file is active in your code editor.

The window at bottom right shows a plot if you’ve created a data visualization with your R code. There’s a history of previous plots and an option to export a plot to an image file or PDF. This window also shows external packages (R extensions) that are available on your system, files in your working directory and help files when called from the console.

**Basic data types in R**

R works with numerous data types. Some of the most basic types to get started are:

Decimals values like 4.5 are called **numerics**.

Natural numbers like 4 are called **integers**. Integers are also numerics.

Boolean values (TRUE or FALSE) are called logical (TRUE can be abbreviated to T and FALSE to F).

Text (or string) values are called characters.

**What is a Variable?**

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable x with the command: **x <- 4**

Suppose you have a fruit basket with five apples. As a data analyst in training, (if you are a beginner as myself 🙂 you want to store the number of apples in a variable with the name my_apples.

The following code in the editor: **my_apples <- 5**, assigns the value 5 to my_apples.

Every tasty fruit basket needs oranges, so you decide to add six oranges. As a data analyst, your reflex is to immediately create the variable **my_oranges** and assign the value 6 to it. Next, you want to calculate how many pieces of fruit you have in total. Since you have given meaningful names to these values, you can now code this in a clear way:

**my_apples + my_oranges**

Common knowledge tells you not to add apples and oranges. But hey, that is what you just did, no :-)? The **my_apples** and **my_oranges** variables both contained a number in the example above. The + operator works with numeric variables in R. If you really tried to add “apples” and “oranges”, and assigned a text value to the variable my_oranges, it would not work. You would be trying to assign the addition of a numeric and a character variable to the variable **my_fruit**. This is not possible. See error returned below.

# Assign a value to the variable called ‘my_apples’

>my_apples <- 5

# Assign a value to the variable ‘my_oranges’

>my_oranges <- “six”

my_oranges

# New variable that contains the total amount of fruit

my_fruit <- my_apples + my_oranges

my_fruit

the results returned by your console would be the following:

Error in my_apples + my_oranges : non-numeric argument to binary operator

> my_fruit

Error: object ‘my_fruit’ not found

It makes sense.

**Exploratory Vs. Explanatory**

Data visualisation is an essential component of your school set as a data scientist.

Data visualisation is a statistic and design combined in a meaningful and appropriate ways that means data visualisation is form of a graphical data analysis.

There is the importance to understand and differentiate between the exploratory pot versus the explanatory pot.

**Exploratory** visualisations are easily generated data heavy and intended for a small specialist audience. The primary purpose is graphical data analysis

**Explanatory** visualisations are labour intensive, data specific and intended for a broader audience for example publications or presentations. They are part of the communication process.

As data analyst our job is to exploring our data, but also to explaining to a specific audience.

**Effectiveness of Insect Sprays: introduction.**

In my example below, I took an example of effectiveness of insect sprays: the counts of insects killed, in agricultural experimental unit, treated with different insecticides.

My dataset contains 72 observations on 2 variables.

[,1] count: numeric Insect count

[,2] spray factor: The type of spray.

Taken from Source: *‘Beall, G., (1942) The Transformation of data from entomological field experiments, Biometrika, 29, 243–262.’*

count spray

1 10 A

2 7 A

3 20 A

4 14 A

5 14 A

6 12 A

7 10 A

8 23 A

9 17 A

10 20 A

11 14 A

12 13 A

13 11 B

14 17 B

15 21 B

16 11 B

17 16 B

18 14 B

19 17 B

20 17 B

21 19 B

22 21 B

23 7 B

24 13 B

25 0 C

26 1 C

27 7 C

28 2 C

29 3 C

30 1 C

31 2 C

32 1 C

33 3 C

34 0 C

35 1 C

36 4 C

37 3 D

38 5 D

39 12 D

40 6 D

41 4 D

42 3 D

43 5 D

44 5 D

45 5 D

46 5 D

47 2 D

48 4 D

49 3 E

50 5 E

51 3 E

52 5 E

53 3 E

54 6 E

55 1 E

56 1 E

57 3 E

58 2 E

59 6 E

60 4 E

61 11 F

62 9 F

63 15 F

64 22 F

65 15 F

66 16 F

67 13 F

68 10 F

69 26 F

70 26 F

71 24 F

72 13 F

In our example, e we are interested in looking at the relationship of these two continuos variables (the insect count and the type of spray used) so the most obvious first step is to make the scatter plot, like the one below.

**Effectiveness of Insect Sprays: explanatory**

We begin to explore our data which reveals quite immediately that spray C, D and E are the less effective, while further analysis is required for spray A, B and F.

Same data, displayed differently using command boxplot, which Produce box-and-whisker plot(s) of the given (grouped) values.

It is noticeable that in each spray C and D data there is a value which is an anomaly compared to the other sampled data, in our case higher that the others (in C is 7 insects and D is 12).

This graph was made with following command:

boxplot(count ~ spray, data = InsectSprays, xlab = “Type of spray”, ylab = “Insect count”,main = “InsectSprays data”, varwidth = TRUE, col = “lightblue”)

**Understanding R: bit of arithmetics…**

If you are newer than a dummies :), as I was, then let’s start with bit of arithmetics!

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

Addition: +

Subtraction: –

Multiplication: *

Division: /

Exponentiation: ^

Modulo: %%

The last two might need some explaining:

The ^ exponentiation raises the number to its left to the power of the number to its right: for example 3^2 is 9.

The modulo returns the remainder of the division of the number to the left by the number on its right, for example 5 modulo 3 or 5 %% 3 is 2.

**Create a vector**

Feeling lucky? You better, because we are going on a trip to the City of Sins, also known as “Statisticians Paradise” ;-).

Thanks to R and your new data-analytical skills, you will learn how to uplift your performance at the tables and fire off your career as a professional gambler. This chapter will show how you can easily keep track of your betting progress and how you can do some simple analyses on past actions. Next Stop, Vegas Baby… VEGAS!!

Let us focus first!

On your way from rags to riches, you will make extensive use of vectors. Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. For example, you can store your daily gains and losses in the casinos.

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the brackets. For example:

numeric_vector <- c(1, 2, 3)

character_vector <- c(“a”, “b”, “c”)

boolean_vector <- c(TRUE, FALSE)

Once you have created these vectors in R, you can use them to do calculations.

After one week in Las Vegas and still zero Ferraris in your garage, you decide that it is time to start using your data analytical superpowers.

Before doing a first analysis, you decide to first collect all the winnings and losses for the last week:

For **poker_vector**:

On Monday you won 140$

Tuesday you lost 50$

Wednesday you won 20$

Thursday you lost 120$

Friday you won 240$

For **roulette_vector**:

On Monday you lost 24$

Tuesday you lost 50$

Wednesday you won 100$

Thursday you lost 350$

Friday you won 10$

You only played poker and roulette, since there was a delegation of mediums that occupied the craps tables. To be able to use this data in R, you decide to create the variables poker_vector and roulette_vector.

> poker_vector <- c(140, -50, 20, -120, 240)

> roulette_vector <-c(-24,-50,100,-350,100)

As a data analyst, it is important to have a clear view on the data that you are using. Understanding what each element refers to is therefore essential.

We created a vector with your winnings over the week and with Each vector element refers to a day of the week but it is hard to tell which element belongs to which day. It would be nice if you could show that in the vector itself.

**Calculating total winnings**

Now that you have the poker and roulette winnings nicely as a named vector, you can start doing some data analytical magic.

You might want to find out:

1) How much has been your overall profit or loss per day of the week?

2) Have you lost money over the week in total?

3) Are you winning/losing money on poker or on roulette?

To get the answers, we have to do arithmetic calculations on vectors.

It is important to know is that if you sum two vectors in R, it takes the element-wise sum.

**Very important consideration: if you want to become a good statistician, you have to become lazy.** (If you are already lazy, chances are high you are one of those exceptional, natural-born statistical talents.)

In the previous examples you probably experienced that it is boring and frustrating to type and retype information such as the days of the week. However, when you look at it from a higher perspective, there is a more efficient way to do this, namely, to assign the days of the week vector to a variable!

Just like you did with your poker and roulette returns, you can also create a variable that contains the days of the week. This way you can use and re-use it.

# Creating the variable ‘days_vector’

days_vector <-c(“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”)

#Assign the names of the day to ‘roulette_vector’ and ‘poker_vector’

names(poker_vector) <- (days_vector)

names(roulette_vector) <-(days_vector)

# Total daily and plot

total_daily <- (poker_vector+roulette_vector)

total_daily

total_daily

Monday Tuesday Wednes day Thursday Friday

116 -100 120 -470 250

plot(poker_vector,roulette_vector)

Based on the previous analysis, it looks like you had a mix of good and bad days. This is not what your ego expected, and you wonder if there may be a (very very very) tiny chance you have lost money over the week in total?

A function that helps you to answer this question is sum(). It calculates the sum of all elements of a vector. For example, to calculate the total amount of money you have lost/won with poker you do:

total_poker <- sum(poker_vector)

# Poker winnings from Monday to Friday

poker_vector <- c(140, -50, 20, -120, 240)

# Roulette winnings from Monday to Friday

roulette_vector <- c(-24, -50, 100, -350, 10)

# Give names to both ‘poker_vector’ and ‘roulette_vector’

days_vector <- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”)

names(roulette_vector) <- days_vector

names(poker_vector) <- days_vector

# Total winnings with poker

total_poker <- sum(poker_vector)

plot(poker_vector,roulette_vector)

# Up to you now:

total_roulette <- sum(roulette_vector)

total_week <- sum(poker_vector,roulette_vector)

plot(total_week)

Which displays a single dot: a total loss of -84$.

Oops, it seems like you are losing money. Time to rethink and adapt your strategy! This will require some deeper analysis…

After a short brainstorm in your hotel’s jacuzzi, you realize that a possible explanation might be that your skills in roulette are not as well developed as your skills in poker. So maybe your total gains in poker are higher (or > ) than in roulette…?!

so…

> # Calculate total gains for poker and roulette

> total_poker= sum(poker_vector)

> total_roulette=sum(roulette_vector)

> total_poker

[1] 230

> total_roulette

[1] -314

> # Check if you realized higher total gains in poker than in roulette

> answer=(total_poker>total_roulette)

> answer

[1] TRUE

Vector selection: the good times

Your guess seemed to be right. It appears that the poker game is more your cup of tea than roulette.

Another possible route for investigation is your performance at the beginning of the working week compared to the end of it. You did have a couple of Margarita cocktails at the end of the week…

To answer that question, you only want to focus on a selection of the total_vector. In other words, our goal is to select specific elements of the vector. To select elements of a vector, you can use square brackets. Between the square brackets, you indicate what elements to select. For example, to select the first element of the vector, you type poker_vector[1]. To select the second element of the vector, you type poker_vector[2], etc.

> # Define a new variable based on a selection

> poker_wednesday <- poker_vector[3]

> poker_wednesday

Wednesday

20

How about analysing your week results now?

To select multiple elements from a vector, you can add square brackets at the end of it. You can indicate between the brackets what elements should be selected. For example: suppose you want to select the first and the fifth day of the week: use the vector c(1,5) between the square brackets. For example, the code below selects the first and fifth element of poker_vector:

poker_vector[c(1,5)]

> # Define a new variable based on a selection

> poker_midweek <- poker_vector[c(2,3,4)]

> poker_midweek

Tuesday Wednesday Thursday

-50 20 -120

> Poker_earlyweek=poker_vector[c(1:3)]

> Poker_earlyweek

Monday Tuesday Wednesday

140 -50 20

poker_midweek <- poker_vector[c(2,3,4)]

> poker_midweek

Tuesday Wednesday Thursday

-50 20 -120

> Poker_earlyweek=poker_vector[c(1:3)]

> Poker_earlyweek

Monday Tuesday Wednesday

140 -50 20

> poker_endweek=poker_vector[c(4,5)]

> poker_endweek

Thursday Friday

-120 240

> mean(Poker_earlyweek)

[1] 36.66667

> mean(poker_midweek)

[1] -50

> mean(poker_endweek)

[1] 60

Another way to use vectors is by using the names of the vector elements (Monday, Tuesday, …) instead of their numeric positions. For example,

poker_vector[“Monday”]

will select the first element of poker_vector since “Monday” is the name of that first element.

Selection by comparison

By making use of comparison operators, we can approach the previous question in a more proactive way.

The (logical) comparison operators known to R are:

< for less than

> for greater than

<= for less than or equal to

>= for greater than or equal to

== for equal to each other

!= not equal to each other

Stating 6 > 5 returns TRUE. The nice thing about R is that you can use these comparison operators also on vectors. For example, the statement c(4,5,6) > 5 returns: FALSE FALSE TRUE. In other words, you test for every element of the vector if the condition stated by the comparison operator is TRUE or FALSE. Do not take our word for it! Try it in the console ;-).

Behind the scenes, R recycles the value 5 when you execute c(4,5,6) > 5. R wants to do an element-wise comparison of each element in c(4,5,6) with each corresponding element in 5. However, 5 is not a vector of length three. To solve this, R automatically replicates the value 5 to generate a vector of three elements, c(5, 5, 5) and then carries out the element-wise comparison.

# What days of the week did you make money on poker?

> selection_vector <- poker_vector > 0

> selection_vector

Monday Tuesday Wednesday Thursday Friday

TRUE FALSE TRUE FALSE TRUE

> selection_vector <- roulette_vector > 0

> selection_vector

Monday Tuesday Wednesday Thursday Friday

FALSE FALSE TRUE FALSE TRUE

Working with comparisons will make your data analytical life easier. Instead of selecting a subset of days to investigate yourself (like before), you can simply ask R to return only those days where you realized a positive return for poker.

You used selection_vector <- poker_vector > 0 to find the days on which you had a positive poker return. Now, you would like to know not only the days on which you won, but also how much you won on those days.

You can select the desired elements, by putting selection_vector between the square brackets that follow poker_vector. This works, because R only selects those elements where selection_vector is TRUE by default. For selection_vector this means where poker_vector > 0.

> # Select from poker_vector these days

> poker_winning_days <- poker_vector[selection_vector]

> poker_winning_days

Monday Wednesday Friday

140 20 240

Just like you did for poker, you also want to know those days where you realised a positive return for roulette.

> # What days of the week did you make money on roulette?

> selection_vector <-roulette_vector>0

>

> # Select from roulette_vector these days

> roulette_winning_days= roulette_vector[selection_vector]

> roulette_winning_days

Wednesday Friday

100 10

Ok, I hope you would feel bit more confident now about R: although we just scratch the surface, at the end of the day it is not so difficult, believe me! So, keep it going, discover it and enjoy the visualisation of data with plots and charts!