Python for R users

module 2 week 7 python reticulate R Markdown R programming

Introduction to using Python in R and the reticulate package

Stephanie Hicks https://stephaniehicks.com/ (Department of Biostatistics, Johns Hopkins)https://www.jhsph.edu
10-14-2021

Pre-lecture materials

Read ahead

Before class, you can prepare by reading the following materials:

  1. https://rstudio.github.io/reticulate
  2. https://py-pkgs.org/02-setup
  3. The Python Tutorial

Acknowledgements

Material for this lecture was borrowed and adopted from

Learning objectives

At the end of this lesson you will:

Python in R Markdown

For this lesson, we will be using the reticulate R package, which provides a set of tools for interoperability between Python and R. The package includes facilities for:

reticulate R package logo

Figure 1: reticulate R package logo

[Source: Rstudio]

Installing python: If you would like recommendations on installing python, I like this resource: https://py-pkgs.org/02-setup#installing-python

What’s happening under the hood?: reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability.

If you are an R developer that uses Python for some of your work or a member of data science team that uses both languages, reticulate can make your life better!

Let’s try it out. Before we get started, you will need to install the packages, if not already:

install.package("reticulate")

We will also load the here and tidyverse packages for our lesson:

python path

By default, reticulate uses the version of Python found on your PATH

Sys.which("python3.9")
python3.9 
       "" 

The use_python() function enables you to specify an alternate version, for example:

use_python("/usr/<new>/<path>/local/bin/python")

For example, I can define the path explicitly:

use_python("/Users/shicks/opt/miniconda3/bin/python3.9", required = TRUE)

Calling Python

There are a variety of ways to integrate Python code into your R projects:

  1. Python in R Markdown — A new Python language engine for R Markdown that supports bi-directional communication between R and Python (R chunks can access Python objects and vice-versa).

  2. Importing Python modules — The import() function enables you to import any Python module and call its functions directly from R.

  3. Sourcing Python scripts — The source_python() function enables you to source a Python script the same way you would source() an R script (Python functions and objects defined within the script become directly available to the R session).

  4. Python REPL — The repl_python() function creates an interactive Python console within R. Objects you create within Python are available to your R session (and vice-versa).

Below I will focus on introducing the first and last one. However, before we do that, let’s introduce a bit about python basics.

Python basics

Python is a high-level, object-oriented programming language useful to know for anyone analyzing data. The most important thing to know before learning Python, is that in Python, everything is an object. There is no compiling and no need to define the type of variables before using them. No need to allocate memory for variables. The code is very easy to learn and easy to read (syntax).

There is a large scientific community contributing to Python. Some of the most widely used libraries in Python are numpy, scipy, pandas, and matplotlib.

start python

There are two modes you can write Python code in: interactive mode or script mode. If you open up a UNIX command window and have a command-line interface, you can simply type python (or python3) in the shell:

python3

and the interactive mode will open up. You can write code in the interactive mode and Python will interpret the code using the python interpreter.

Another way to pass code to Python is to store code in a file ending in .py, and execute the file in the script mode using

python3 myscript.py

To check what version of Python you are using, type the following in the shell:

python3 --version

objects in python

Everything in Python is an object. Think of an object as a data structure that contains both data as well as functions. These objects can be variables, functions, and modules which are all objects. We can operate on this objects with what are called operators (e.g. addition, subtraction, concatenation or other operations), define/apply functions, test/apply for conditionals statements, (e.g. if, else statements) or iterate over the objects.

Not all objects are required to have attributes and methods to operate on the objects in Python, but everything is an object (i.e. all objects can be assigned to a variable or passed as an argument to a function). A user can work with built-in defined classes of objects or can create new classes of objects. Using these objects, a user can perform operations on the objects by modifying / interacting with them.

variables

Variable names are case sensitive, can contain numbers and letters, can contain underscores, cannot begin with a number, cannot contain illegal characters and cannot be one of the 31 keywords in Python:

“and, as, assert, break, class, continue, def, del, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while, with, yield”

operators

2 ** 3
8
x = 3 
x > 1 and x <= 5
True

format operators

If % is applied to strings, this operator is the format operator. It tells Python how to format a list of values in a string. For example,

print('In %d days, I have eaten %g %s.' % (5, 3.5, 'crabs'))
In 5 days, I have eaten 3.5 crabs.

functions

Python contains a small list of very useful built-in functions. All other functions need defined by the user or need to be imported from modules. For a more detailed list on the built-in functions in Python, see Built-in Python Functions.

The first function we will discuss, type(), reports the type of any object, which is very useful when handling multiple data types (remember, everything in Python is an object). Here are some the mains types you will encounter:

If we asked for the type of a string “Let’s go Ravens!”

type("Let's go Ravens!")
<class 'str'>

This would return the str type.

You have also seen how to use the print() function. The function print will accept an argument and print the argument to the screen. Print can be used in two ways:

print("Let's go Ravens!")
[1] "Let's go Ravens!"

new functions

New functions can be defined using one of the 31 keywords in Python def.

def new_world(): 
    return 'Hello world!'
    
print(new_world())
Hello world!

The first line of the function (the header) must start with def, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon. The arguments can be specified in any order.

The rest of the function (the body) always has an indentation of four spaces. If you define a function in the interactive mode, the interpreter will print ellipses (…) to let you know the function is not complete. To complete the function, enter an empty line (not necessary in a script).

To return a value from a function, use return. The function will immediately terminate and not run any code written past this point.

def squared(x):
    """ Return the square of a  
        value """
    return x ** 2

print(squared(4))
16

Note: python has its version of ... (also from docs.python.org)

def concat(*args, sep="/"):
 return sep.join(args)  

concat("a", "b", "c")
'a/b/c'

iteration

Iterative loops can be written with the for, while and break statements.

Defining a for loop is similar to defining a new function. The header ends with a colon and the body is indented. The function range(n) takes in an integer n and creates a set of values from 0 to n - 1. for loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries.

for i in range(3):
  print('Baby shark, doo doo doo doo doo doo!')
Baby shark, doo doo doo doo doo doo!
Baby shark, doo doo doo doo doo doo!
Baby shark, doo doo doo doo doo doo!
print('Baby shark!')
Baby shark!

The function len() can be used to:

x = 'Baby shark!'
len(x)
11

methods for each type of object (dot notation)

For strings, lists and dictionaries, there are set of methods you can use to manipulate the objects. In general, the notation for methods is the dot notation. The syntax is the name of the objects followed by a dot (or period) followed by the name of the method.

x = "Hello Baltimore!"
x.split()
['Hello', 'Baltimore!']

Data structures

We have already seen lists. Python has other data structures built in.

dict = {"a" : 1, "b" : 2}
dict['a']
1
dict['b']
2

More about data structures can be founds at the python docs

reticulate

Python engine within R Markdown

The reticulate package includes a Python engine for R Markdown with the following features:

  1. Run Python chunks in a single Python session embedded within your R session (shared variables/state between Python chunks)

  2. Printing of Python output, including graphical output from matplotlib.

  3. Access to objects created within Python chunks from R using the py object (e.g. py$x would access an x variable created within Python from R).

  4. Access to objects created within R chunks from Python using the r object (e.g. r.x would access to x variable created within R from Python)

Built in conversion for many Python object types is provided, including NumPy arrays and Pandas data frames.

From Python to R

As an example, you can use Pandas to read and manipulate data then easily plot the Pandas data frame using ggplot2:

Let’s first create a flights.csv dataset in R:

if(!file.exists(here("data", "flights.csv"))){
  readr::write_csv(nycflights13::flights, 
                   file = here("data", "flights.csv"))
}

Use Python to read in the file and do some data wrangling

import pandas
flights_path = "/Users/shicks/Documents/github/teaching/jhustatcomputing2021/data/flights.csv"
flights = pandas.read_csv(flights_path)
flights = flights[flights['dest'] == "ORD"]
flights = flights[['carrier', 'dep_delay', 'arr_delay']]
flights = flights.dropna()
flights
       carrier  dep_delay  arr_delay
5           UA       -4.0       12.0
9           AA       -2.0        8.0
25          MQ        8.0       32.0
38          AA       -1.0       14.0
57          AA       -4.0        4.0
...        ...        ...        ...
336645      AA      -12.0      -37.0
336669      UA       -7.0      -13.0
336675      MQ       -7.0      -11.0
336696      B6       -5.0      -23.0
336709      AA      -13.0      -38.0

[16566 rows x 3 columns]
head(py$flights)
   carrier dep_delay arr_delay
5       UA        -4        12
9       AA        -2         8
25      MQ         8        32
38      AA        -1        14
57      AA        -4         4
70      UA         9        20
py$flights_path 
[1] "/Users/shicks/Documents/github/teaching/jhustatcomputing2021/data/flights.csv"
class(py$flights)
[1] "data.frame"
class(py$flights_path)
[1] "character"

Next, we can use R to visualize the Pandas DataFrame. The data frame is loaded in as an R object now stored in the variable py.

ggplot(py$flights, aes(x = carrier, y = arr_delay)) + 
  geom_point() + 
  geom_jitter()

Note that the reticulate Python engine is enabled by default within R Markdown whenever reticulate is installed.

From R to Python

Use R to read and manipulate data

library(tidyverse)
flights <- read_csv(here("data","flights.csv")) %>%
  filter(dest == "ORD") %>%
  select(carrier, dep_delay, arr_delay) %>%
  na.omit()

flights
# A tibble: 16,566 × 3
   carrier dep_delay arr_delay
   <chr>       <dbl>     <dbl>
 1 UA             -4        12
 2 AA             -2         8
 3 MQ              8        32
 4 AA             -1        14
 5 AA             -4         4
 6 UA              9        20
 7 UA              2        21
 8 AA             -6       -12
 9 MQ             39        49
10 B6             -2        15
# … with 16,556 more rows

Use Python to print R dataframe

If you recall, we can access objects created within R chunks from Python using the r object (e.g. r.x would access to x variable created within R from Python). We can then ask for the first ten rows using the head() function in python.

r.flights.head(10)
  carrier  dep_delay  arr_delay
0      UA       -4.0       12.0
1      AA       -2.0        8.0
2      MQ        8.0       32.0
3      AA       -1.0       14.0
4      AA       -4.0        4.0
5      UA        9.0       20.0
6      UA        2.0       21.0
7      AA       -6.0      -12.0
8      MQ       39.0       49.0
9      B6       -2.0       15.0

import python modules

You can use the import() function to import any Python module and call it from R. For example, this code imports the Python os module in python and calls the listdir() function:

os <- import("os")
os$listdir(".")
[1] "python-for-r-users_files" "python-for-r-users.Rmd"  
[3] "python-for-r-users.html" 

Functions and other data within Python modules and classes can be accessed via the $ operator (analogous to the way you would interact with an R list, environment, or reference class).

Imported Python modules support code completion and inline help:

Using reticulate tab completion

Figure 2: Using reticulate tab completion

[Source: Rstudio]

Similarly, we can import the pandas library:

pd <- import('pandas')
test <- pd$read_csv(here("data","flights.csv"))
head(test)
  year month day dep_time sched_dep_time dep_delay arr_time
1 2013     1   1      517            515         2      830
2 2013     1   1      533            529         4      850
3 2013     1   1      542            540         2      923
4 2013     1   1      544            545        -1     1004
5 2013     1   1      554            600        -6      812
6 2013     1   1      554            558        -4      740
  sched_arr_time arr_delay carrier flight tailnum origin dest
1            819        11      UA   1545  N14228    EWR  IAH
2            830        20      UA   1714  N24211    LGA  IAH
3            850        33      AA   1141  N619AA    JFK  MIA
4           1022       -18      B6    725  N804JB    JFK  BQN
5            837       -25      DL    461  N668DN    LGA  ATL
6            728        12      UA   1696  N39463    EWR  ORD
  air_time distance hour minute            time_hour
1      227     1400    5     15 2013-01-01T10:00:00Z
2      227     1416    5     29 2013-01-01T10:00:00Z
3      160     1089    5     40 2013-01-01T10:00:00Z
4      183     1576    5     45 2013-01-01T10:00:00Z
5      116      762    6      0 2013-01-01T11:00:00Z
6      150      719    5     58 2013-01-01T10:00:00Z
class(test)
[1] "data.frame"

or the scikit-learn python library:

skl_lr <- import("sklearn.linear_model")

Calling python scripts

source_python("secret_functions.py")
subject_1 <- read_subject("secret_data.csv")

Calling the python repl

If you want to work with Python interactively you can call the repl_python() function, which provides a Python REPL embedded within your R session.

Objects created within the Python REPL can be accessed from R using the py object exported from reticulate. For example:

Using the repl_python() function

Figure 3: Using the repl_python() function

[Source: Rstudio]

i.e. objects do have permenancy in R after exiting the python repl.

So typing x = 4 in the repl will put py$x as 4 in R after you exit the repl.

Enter exit within the Python REPL to return to the R prompt.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Hicks (2021, Oct. 14). Statistical Computing: Python for R users. Retrieved from https://stephaniehicks.com/jhustatcomputing2021/posts/2021-10-14-python-for-r-users/

BibTeX citation

@misc{hicks2021python,
  author = {Hicks, Stephanie},
  title = {Statistical Computing: Python for R users},
  url = {https://stephaniehicks.com/jhustatcomputing2021/posts/2021-10-14-python-for-r-users/},
  year = {2021}
}