```
<- 1
x print(x)
```

`[1] 1`

` x`

`[1] 1`

`<- "hello" msg `

Introduction to data types and objects in R

module 4

week 4

R

programming

Published

September 20, 2022

Material for this lecture was borrowed and adopted from

At the R prompt we type expressions. The `<-`

symbol is the assignment operator.

The grammar of the language determines **whether an expression is complete or not**.

```
Error: <text>:2:0: unexpected end of input
1: x <- ## Incomplete expression
^
```

The `#`

character indicates a **comment**.

Anything to the right of the `#`

(including the `#`

itself) is ignored. **This is the only comment character in R**.

Unlike some other languages, R does not support multi-line comments or comment blocks.

When a complete expression is entered at the prompt, **it is evaluated and the result of the evaluated expression is returned**.

The result may be **auto-printed**.

The `[1]`

shown in the output indicates that `x`

is a vector and `5`

is its first element.

Typically with **interactive work**, we **do not explicitly print objects** with the `print()`

function; it is much easier to just auto-print them by typing the name of the object and hitting return/enter.

However, when **writing scripts, functions, or longer programs**, there is sometimes a **need to explicitly print objects** because auto-printing does not work in those settings.

When an R vector is printed you will notice that an index for the vector is printed in square brackets `[]`

on the side. For example, see this integer sequence of length 20.

The numbers in the square brackets are not part of the vector itself, they are merely part of the **printed output**.

The most basic type of R object is a **vector**.

There is really only one rule about vectors in R, which is that

A vector can only contain objects of the same class

To understand what we mean here, we need to dig a little deeper. We will come back this in just a minute.

There are two types of **vectors** in R:

**Atomic vectors**:**logical**:`FALSE`

,`TRUE`

, and`NA`

**integer**(and**doubles**): these are known collectively as**numeric**vectors (or real numbers)**complex**: complex numbers**character**: the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data**raw**: used to store fixed-length sequences of bytes. These are not commonly used directly in data analysis and I won’t cover them here.

**Lists**, which are sometimes called**recursive vectors**because lists can contain other lists.

[**Source**: R 4 Data Science]

Empty vectors can be created with the `vector()`

function.

The `c()`

function can be used to **create vectors of objects** by **concatenating** things together.

So, I know I said there is one rule about vectors:

A vector can only contain objects of the same class

But of course, like any good rule, there is an exception, which is a **list** (which we will get to in greater details a bit later).

For now, just know a **list** is **represented as a vector** but can **contain objects of different classes**. Indeed, that’s usually why we use them.

**Integer** and **double** vectors are known collectively as **numeric vectors**.

In R, numbers are doubles by default.

To make an integer, place an `L`

after the number:

Numbers in R are generally treated as **numeric objects** (i.e. double precision real numbers).

This means that even if you see a number like “1” or “2” in R, which you might think of as integers, they are likely represented behind the scenes as numeric objects (so something like “1.00” or “2.00”).

This isn’t important most of the time…except when it is!

If you **explicitly want an integer**, you need to specify the `L`

suffix. So entering `1`

in R gives you a numeric object; entering `1L`

explicitly gives you an integer object.

R objects can have **attributes**, which are like **metadata for the object**.

These metadata can be very useful in that they **help to describe the object**.

For example, **column names** on a data frame help to tell us what data are contained in each of the columns. Some examples of R object attributes are

- names, dimnames
- dimensions (e.g. matrices, arrays)
- class (e.g. integer, numeric)
- length
- other user-defined attributes/metadata

Attributes of an object (if any) can be accessed using the `attributes()`

function. Not all R objects contain attributes, in which case the `attributes()`

function returns `NULL`

.

However, every **vector** has two key properties:

- Its
**type**, which you can determine with`typeof()`

.

```
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
```

`[1] "character"`

` [1] 1 2 3 4 5 6 7 8 9 10`

`[1] "integer"`

- Its
**length**, which you can determine with`length()`

.

There are occasions when **different classes of R objects get mixed together**.

Sometimes this happens by accident but it can also happen on purpose.

Why is this happening?

In each case above, we are **mixing objects of two different classes** in a vector.

But remember that the only rule about vectors says this is not allowed?

When different objects are mixed in a vector, **coercion** occurs so that **every element in the vector is of the same class**.

In the example above, we see the effect of **implicit coercion**.

What R tries to do is find a way to represent all of the objects in the vector in a reasonable fashion. Sometimes this does exactly what you want and…sometimes not.

For example, combining a numeric object with a character object will create a character vector, because numbers can usually be easily represented as strings.

Objects can be explicitly coerced from one class to another using the `as.*()`

functions, if available.

`[1] "integer"`

`[1] 0 1 2 3 4 5 6`

`[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE`

`[1] "0" "1" "2" "3" "4" "5" "6"`

Sometimes, **R can’t figure out how to coerce an object** and this can result in `NA`

s being produced.

`Warning: NAs introduced by coercion`

`[1] NA NA NA`

`[1] NA NA NA`

When nonsensical coercion takes place, you will usually get a warning from R.

**Matrices** are **vectors with a dimension attribute**.

- The
**dimension attribute**is**itself an integer vector**of length 2 (number of rows, number of columns)

```
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
```

`[1] 2 3`

```
$dim
[1] 2 3
```

Matrices are **constructed column-wise**, so entries can be thought of starting in the “upper left” corner and running down the columns.

Matrices can also be created directly from vectors by adding a dimension attribute.

` [1] 1 2 3 4 5 6 7 8 9 10`

```
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
```

Matrices can be created by **column-binding** or **row-binding** with the `cbind()`

and `rbind()`

functions.

Lists are a special type of **vector** that **can contain elements of different classes**. Lists are a very important data type in R and you should get to know them well.

Lists can be explicitly created using the `list()`

function, which takes an arbitrary number of arguments.

We can also create an empty list of a prespecified length with the `vector()`

function

**Factors** are used to represent **categorical data** and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a **label**.

Using factors with labels is **better** than using integers because factors are self-describing.

Factor objects can be created with the `factor()`

function.

```
[1] yes yes no yes no
Levels: no yes
```

```
x
no yes
2 3
```

```
[1] 2 2 1 2 1
attr(,"levels")
[1] "no" "yes"
```

Often factors will be automatically created for you when you read in a dataset using a function like `read.table()`

.

- Those functions often
**default to creating factors when they encounter data that look like characters or strings**.

The order of the levels of a factor can be set using the `levels`

argument to `factor()`

. This can be important in linear modeling because the first level is used as the baseline level.

**Missing values** are denoted by `NA`

or `NaN`

for undefined mathematical operations.

`is.na()`

is used to test objects if they are`NA`

`is.nan()`

is used to test for`NaN`

`NA`

values have a class also, so there are integer`NA`

, character`NA`

, etc.A

`NaN`

value is also`NA`

but the converse is not true

**Data frames** are used to store **tabular data** in R. They are an important type of object in R and are used in a variety of statistical modeling applications. Hadley Wickham’s package dplyr has an optimized set of functions designed to work efficiently with data frames.

Data frames are **represented as a special type of list** where **every element of the list has to have the same length**.

- Each element of the list can be thought of as a column
- The length of each element of the list is the number of rows

Unlike matrices, **data frames can store different classes of objects in each column**. Matrices must have every element be the same class (e.g. all integers or all numeric).

In addition to column names, indicating the names of the variables or predictors, data frames have a special attribute called `row.names`

which indicate information about each row of the data frame.

Data frames are usually created by reading in a dataset using the `read.table()`

or `read.csv()`

. However, data frames can also be created explicitly with the `data.frame()`

function or they can be coerced from other types of objects like lists.

```
foo bar
1 1 TRUE
2 2 TRUE
3 3 FALSE
4 4 FALSE
```

`[1] 4`

`[1] 2`

```
$names
[1] "foo" "bar"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4
```

Data frames can be converted to a matrix by calling `data.matrix()`

. While it might seem that the `as.matrix()`

function should be used to coerce a data frame to a matrix, almost always, what you want is the result of `data.matrix()`

.

```
foo bar
[1,] 1 1
[2,] 2 1
[3,] 3 0
[4,] 4 0
```

```
$dim
[1] 4 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "foo" "bar"
```

R objects can have **names**, which is very useful for writing readable code and self-describing objects.

Here is an example of assigning names to an integer vector.

`NULL`

```
New York Seattle Los Angeles
1 2 3
```

`[1] "New York" "Seattle" "Los Angeles"`

```
$names
[1] "New York" "Seattle" "Los Angeles"
```

**Lists can also have names**, which is often very useful.

```
$`Los Angeles`
[1] 1
$Boston
[1] 2
$London
[1] 3
```

`[1] "Los Angeles" "Boston" "London" `

**Matrices can have both column and row names**.

```
c d
a 1 3
b 2 4
```

Column names and row names can be set separately using the `colnames()`

and `rownames()`

functions.

There are a variety of different builtin-data types in R. In this chapter we have reviewed the following

- atomic classes: numeric, logical, character, integer, complex
- vectors, lists
- factors
- missing values
- data frames and matrices

All R objects can have attributes that help to describe what is in the object. Perhaps the most useful attribute is names, such as column and row names in a data frame, or simply names in a vector or list. Attributes like dimensions are also important as they can modify the behavior of objects, like turning a vector into a matrix.

Here are some post-lecture questions to help you think about the material discussed.