PandA Python Training 1: Intro to Python

Andrew Lau

In [2]:
# this just gets the notebook to print all the output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Contents

  • Introduction

    • Notebooks and IDE
    • What is Python?
    • Python vs R vs SAS
  • Python Crash Course

    • Syntax
    • Control structures
    • Data Types/Structures
    • Functions
    • Object oriented programming
    • Modules

Introduction

Notebooks and IDE

Before we begin, let's talk a bit about notebooks and IDEs.

Jupyter Notebooks

  • What you looking at in your web browser (hopefully, if you got it working, otherwise a PDF) is a Jupyter Notebook
  • It was designed for data science in mind and reproducible research
  • It acts as a data scientists' notebook, allowing note taking, documentation, mathematical notation as well as most importantly running code and storing results all in the one document
  • These notes are live and interactive! You can run and edit the code during this session!

IDEs

  • a 'code editor'. R Studio is an IDE.
  • Some good IDE (code editors, Integrated Development Environment)
    • Spyder (feels like R Studio, made for Data Science, probably easiest to get used to). If you have installed Anaconda, you already have it. Hit your windows key and type 'spyder' and you should see it.
    • Visual Studio
    • Sublime Text
    • Eclipse

What is Python

  • Unlike R or SAS, Python is a general purpose programming language that can do basically anything
  • Pythons most popular applications are:
    • Web development (powering backends using django/flask frameworks). Website that use some Python for their websites include Google, YouTube, Quora, Dropbox, Yahoo!, Reddit, Instagram and Spotify.
    • Data science and machine learning. The ultra high tech, users of machine learning such as Google, Amazon will have Python driving a lot of their machine learning.
    • Scripting. Python's readibility and speed of development lends well to creating quick scripts to automate tasks
    • Education. Python is probably the most popular language to teach computer science with as unis.
  • Python can be do virtually anything:
    • games (not the best for making games though)
    • breaching security networks
    • developing desktop applications (not the best for this, however Spyder (a desktop app) is written in Python)
    • robotics
    • raspberry pi
  • Developed initially to teach children how to program. This is why the Python language has been built to value simplicity and readibility.
    • Pythonic: code which is simple, clear, concise, maintainable and uses Python the way it was intended.

Python vs R vs SAS

Python:

  • General purpose language developed by computer scientists/software engineers
  • More popular with computer scientists and software engineers. More popular on Kaggle.
  • Has superior and better developed deep learning packages (e.g. tensorflow, keras, pytorch)
  • Much easier to integrate into systems (e.g. live machine learning models, integrating a model into a website backend)
  • Because Python is a general purpose programming language, it is easier to build software and systems that have machine learning or AI in it
  • Lighter footprint, advantageous for cloud computing
  • Data science modules are far more unifed (a more cohesive and inified system of modules)
  • The syntax, naming conventions and how things work are far more unified as there are a few key packages basically everyone uses and is worked on by everyone (compared to the plethora of R packages maintained by a few individuals)

R:

  • Built from ground up for statistical analysis
  • Better for performing traditional statistical analysis (e.g. significance tests, ANOVA, statistical inference, linear models...)
  • More popular with those from stats/maths backgrounds
  • Has niche stats libraries/packages not found in Python
  • Can be better for plotting / EDA (e.g. ggplot2)

SAS

  • Closed source, with far, far fewer packages available
  • Is not really used at big tech companies like Amazon, Google, Facebook or by "high end" data scientists
  • Much harder to find help on online. There are far more users of Python/R and there is a strong culture of contributing to the open source project so it is far easier to find help online
  • ML, and AI pales in comparison to Python/R
  • Graphs and plotting in SAS is not great
  • Expensive

Overview of the language

  • We will cover the basics of the Python language today
  • Python standard libaries

Syntax

  • Python uses whitespace instead of brackets to delimit blocks (indicate the start and end of if statements, functions etc)
  • This was a conscious design choice to encourage clean code through making tab indentation mandatory as well as ditching a lot of brackets.
In [3]:
# tabs and new lines are used to delimit the condition and the action to be executed
# if the condition is true
for i in range(5):
    print(i, "hello")
    
# function and if statement
def double_num(x):
    if x > 5:
        return 2 * x
    return x / 2

# hashes are used to comment. press ctrl + '/' to (un)comment
print("double_num(2):", double_num(2))  # you can also inline comment
print("double_num(2):", double_num(6))
"""
Use three quotation marks to do block commenting.
"""
0 hello
1 hello
2 hello
3 hello
4 hello
double_num(2): 1.0
double_num(2): 12
Out[3]:
'\nUse three quotation marks to do block commenting.\n'

Control Structures

if statements

In [4]:
x = 5
if x > 2:
    print("x is greater than 2")
    
if x == 5:  # note that unlike SAS, the equality operator "==" is different to the assignment operator "="
    print("x is 5")
x is greater than 2
x is 5

for loops

In [5]:
# in Python we can loop over many things (we say it is 'iterable')
for i in range(10):  # the range(x) 'generator' allows us to loop over 0 to 1 - x
    print(i)
    
my_list = ['t', 'o', 'n', 'y']

for letter in my_list:  # we can loop over lists
    print(letter)
0
1
2
3
4
5
6
7
8
9
t
o
n
y

while loops

In [6]:
x = 0
while x < 10:
    print(x)
    x += 1
0
1
2
3
4
5
6
7
8
9

Data Types

The basic data types in Python are:

In [7]:
# integers
2
3
4
Out[7]:
2
Out[7]:
3
Out[7]:
4
In [8]:
# floats - you can think of these as decimals, (floats are scientific notation in binary to
# allow efficient storage of decimals and very big/small numbers)
2.234
3.2309
Out[8]:
2.234
Out[8]:
3.2309
In [9]:
# strings - series of characters
"hello"
'bye'  # you can use single or double quotations marks to delimit strings
# if you want single (double) quotation marks in your string, you can use the double (single) 
# quotation marks to delimit
"single 'quotations'"
'double "quotations"'
# there are many useful string operations
[x for x in dir(str) if not x.startswith('_')]  # this is a list comprehension, more on this later
Out[9]:
'hello'
Out[9]:
'bye'
Out[9]:
"single 'quotations'"
Out[9]:
'double "quotations"'
Out[9]:
['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
In [10]:
bad = 'bad'
bad.capitalize()
bad.upper()
Out[10]:
'Bad'
Out[10]:
'BAD'
In [11]:
# booleans
True
False
# data types can be 'cast' into other data types. The boolean True has the value 1 and False 0:
True + 1
50 - False
Out[11]:
True
Out[11]:
False
Out[11]:
2
Out[11]:
50
In [12]:
# zero is considered True, and other numbers False
def is_true(number):
    if number:
        print(number, "is True")
    else:
        print(number, "is False")

for number in [-100, -0.5, 0, 0.5, 100, 'Jessie']:
    is_true(number)
-100 is True
-0.5 is True
0 is False
0.5 is True
100 is True
Jessie is True

Data Structures

The standard data structures (things that can hold data) in Python are:

  • lists
  • tuples
  • dictionaries
  • sets

Lists

Perhaps the most versatile and commonly used data structure in Python

In [13]:
# lists can hold anything! from basic data types like integers, floats, strings...
[1, 2, 3, 'one', 'two', 'three', 1.0, 2.0, 3.0]

# to other data structures...
[['this is a str in a list'], {'this is a str in a set'}, {'apple':'red', 'strawberry':'red'}]

# to functions...
def add_2(x):
    return x + 2

def add_3(x):
    return x + 2

[add_2, add_3]

# and any other 'object' <- more on this later!
Out[13]:
[1, 2, 3, 'one', 'two', 'three', 1.0, 2.0, 3.0]
Out[13]:
[['this is a str in a list'],
 {'this is a str in a set'},
 {'apple': 'red', 'strawberry': 'red'}]
Out[13]:
[<function __main__.add_2(x)>, <function __main__.add_3(x)>]
In [14]:
# accessing elements of a list
my_list = [1 ,2, 3, 4, 5]
my_list[0]  # note that Python (and a lot of other languages) start their indexing at 0 rather than 1!
my_list[3]
Out[14]:
1
Out[14]:
4
In [15]:
# 'slicing' - accessing subsets of a list
my_list[0:2]  # the index at the right of the colon operator is not included in the slice
my_list[2:4]
my_list[:2]
my_list[2:]
Out[15]:
[1, 2]
Out[15]:
[3, 4]
Out[15]:
[1, 2]
Out[15]:
[3, 4, 5]
In [16]:
# generating lists
# one of the most common ways to build a list is to start with an empty one and build it up with a loop
x = []  # initialise empty list
for i in range(5):
    x.append(i)
x
Out[16]:
[0, 1, 2, 3, 4]
In [17]:
# list comprehensions are a more compact way to build a list
y = [i for i in range(5)]
In [18]:
# the dir() function gives you a list of any objects attributes/methods <- more on this later
dir(list)
Out[18]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']
In [19]:
# use a 'list comprehension' to find all the list methods
# [DO ACTION TO SOMETHING for THAT SOMETHING in ITERABLE if CONDITION]
['list.' + something for something in dir(list) if not something.startswith('_')]  
Out[19]:
['list.append',
 'list.clear',
 'list.copy',
 'list.count',
 'list.extend',
 'list.index',
 'list.insert',
 'list.pop',
 'list.remove',
 'list.reverse',
 'list.sort']
In [20]:
[x for x in dir(str) if not x.startswith('_')] 
Out[20]:
['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
In [21]:
my_string = 'pandas'
my_string
my_string = my_string.upper()
my_string
Out[21]:
'pandas'
Out[21]:
'PANDAS'
In [22]:
# useful list methods and the help function
help(list.extend)
help(list.append)
help(list.pop)
help(list.sort)
Help on method_descriptor:

extend(self, iterable, /)
    Extend list by appending elements from the iterable.

Help on method_descriptor:

append(self, object, /)
    Append object to the end of the list.

Help on method_descriptor:

pop(self, index=-1, /)
    Remove and return item at index (default last).
    
    Raises IndexError if list is empty or index is out of range.

Help on method_descriptor:

sort(self, /, *, key=None, reverse=False)
    Stable sort *IN PLACE*.

Tuples

In [23]:
# like lists, tuples can hold any object
my_tup = ('p', 'a', 'n', 'd', 'a', 123, 2.3, dir())
# and they can be sliced
my_tup[:5]
# but they are 'immutable', meaning they cannot be altered once they have been defined
# useful if you need to store something that you DO NOT want changed after creation
Out[23]:
('p', 'a', 'n', 'd', 'a')

Sets

As in maths, a set in Python is an unordered collection with no duplicate elements. Uses of sets:

  • checking for membership of a set
    • you can do this with a list, but it is less efficient
  • set unions, differences and intersections
  • when you need a data structure that needs unique values
In [24]:
[x for x in dir(set) if not x.startswith('_')]
Out[24]:
['add',
 'clear',
 'copy',
 'difference',
 'difference_update',
 'discard',
 'intersection',
 'intersection_update',
 'isdisjoint',
 'issubset',
 'issuperset',
 'pop',
 'remove',
 'symmetric_difference',
 'symmetric_difference_update',
 'union',
 'update']
In [25]:
stuff_powerlifters_like = {"bench", "squat"}
stuff_weightlifters_like = {"clean and jerk", "snatch", "squat"}
stuff_powerlifters_like
stuff_weightlifters_like
stuff_powerlifters_like.add("deadlift")
stuff_powerlifters_like.add("bench")  # adding an existing element does nothing
# can perform mathematical set operations
stuff_powerlifters_like.intersection(stuff_weightlifters_like)
stuff_powerlifters_like.intersection(stuff_powerlifters_like)
Out[25]:
{'bench', 'squat'}
Out[25]:
{'clean and jerk', 'snatch', 'squat'}
Out[25]:
{'squat'}
Out[25]:
{'bench', 'deadlift', 'squat'}
In [26]:
'squat' in stuff_powerlifters_like
Out[26]:
True
In [27]:
'foam rolling' in stuff_powerlifters_like
Out[27]:
False

Dictionaries

A mapping from a unique key to a value (you can think of it like a v-lookup, except the lookup value has to be unique)

In [28]:
ML_models = {'Linear Regression Models': 'Regressors', 'Decision Trees': 'Regressors or Classifiers', 'SVMs':'Regressors or Classifiers',
         'Naive Bayes Models':'Classifiers', 'K Nearest Neighbors Models':'Regressors or Classifiers'}
ML_models
Out[28]:
{'Linear Regression Models': 'Regressors',
 'Decision Trees': 'Regressors or Classifiers',
 'SVMs': 'Regressors or Classifiers',
 'Naive Bayes Models': 'Classifiers',
 'K Nearest Neighbors Models': 'Regressors or Classifiers'}
In [29]:
# can access dictionary values using the index notation
ML_models['Naive Bayes Models']
ML_models['K Nearest Neighbors Models']

for model in ML_models:  # can iterate over dictionaries
    print(model, "are", ML_models[model])
Out[29]:
'Classifiers'
Out[29]:
'Regressors or Classifiers'
Linear Regression Models are Regressors
Decision Trees are Regressors or Classifiers
SVMs are Regressors or Classifiers
Naive Bayes Models are Classifiers
K Nearest Neighbors Models are Regressors or Classifiers

Functions

  • Create reusable code
  • begins with def func_name(arg_1, arg_2, ...):
In [30]:
# an example of a recursive function
def fibonacci(x=5):  # you can set a default parameter value
    """
    It is good practice to add documentation at the beginning of your function.
    When someone calls help() on your function, they will see the documentation.
    """
    if x <= 1:
        return 1  # the function call ends when a return statement is reached, so an else statement is not needed
    return fibonacci(x - 2) + fibonacci(x - 1)  # if not return statement is specified, None will be returned
    
# list comprehension
[fibonacci(x) for x in range(10)]

# default parameter value
fibonacci()

help(fibonacci) 
Out[30]:
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
Out[30]:
8
Help on function fibonacci in module __main__:

fibonacci(x=5)
    It is good practice to add documentation at the beginning of your function.
    When someone calls help() on your function, they will see the documentation.

Object Oriented Programming

Something to be aware of in Python, is that everything is an object. The full details of object oriented programming are beyond the scope of this training, but there are some things to be aware of:

  • objects have attributes and methods, accessed with a dot - '.'
    • object.attribute - attributes are data about that object, stored in the object
    • object.method() - methods are functions associated with an object and as such need a set of brackets to call them object.method(). Like all functions, these methods may take arguments.
  • classes are types/classes of objects. They can be thought of as the prototype/sketch of a class/type of object. All objects in that class will 'inherit' attributes/methods from that class.
  • objects are 'instantiated' (brought to life, created), use the 'generator' of a class, which generates an instance of that class

This is all probably a bit confusing, so let's go through some examples.

In [31]:
# everything in Python is an object. For example, strings are objects
my_string = "python"  # this instantiates an instance/object of the class string
my_string_2 = "pandas" # this instantiates another instance/object of the class string

# the below are string attributes and methods. all strings have these.
[x for x in dir(my_string) if not x.startswith("_")]
[x for x in dir(my_string_2) if not x.startswith("_")]
# both the above strings have inherited the same attributes/methods from the overarching string class.
Out[31]:
['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
Out[31]:
['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']
In [32]:
# we can access string methods like this
my_string.upper()  # returns the string with everything made into upper case
my_string_2.isdigit()  # returns whether the string is a digit
Out[32]:
'PYTHON'
Out[32]:
False
In [33]:
# lists are objects as well
higher_level_languages = ['Python', 'Java Script']  # this creates an instance/object of the class string
lower_level_languages = ['C', 'x86 Assembly']
# all lists have the below attributes/methods, including the ones we just made
[x for x in dir(higher_level_languages) if not x.startswith("_")]
Out[33]:
['append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']
In [34]:
higher_level_languages.extend(lower_level_languages)
higher_level_languages
Out[34]:
['Python', 'Java Script', 'C', 'x86 Assembly']

You can also create your own classes as well!

We will see next week that machine learning models are objects, and that a basic understanding of how objects work will be needed to use them.

Modules

  • Modules are python scripts
  • You import modules to access pre-written functions, variables and classes
  • Similar to R's library() function
  • For data science, you will need:
    • import pandas as pd
    • import numpy as np
    • import sklearn
  • You can write your own modules (it is just a .py file) and then import it into your code if it is in the same directory with import your_module
In [35]:
import pandas as pd  # import a module under an alias
from matplotlib import pyplot as plt # you can import specific things from a module with this syntax

# use a dot '.' to access things from that module, in this case the DataFrame class
my_df = pd.DataFrame({'x':[1, 2, 3, 4, 5], 'y':[2, 4, 6, 8, 10]})
# the DataFrame class has a plot method.
my_df.plot()
plt.show()

from math import sqrt  # import a specific thing (sqrt function) from a module. you can now call sqrt().
import math  # import the whole module, you will need to specify what you are calling with math.THING.

print("sqrt(4) is", sqrt(4))
print("math.pi is", math.pi)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x168c35eb4a8>
<Figure size 640x480 with 1 Axes>
sqrt(4) is 2.0
math.pi is 3.141592653589793