Graphs are data structures composed of a set of objects (nodes) and pairwise relationships between them (edges). Notably, edges can have properties, like a direction or a weight.
Graphs can be represented as:
A common type of graph in computer science are grids, in which nodes are laid in a grid, and they are connected to the nodes selected top, bottom, left and right.
A tree is a graph in which there is only one path between every pair of nodes. Some concepts related to trees are: root, the (only) node on level 1; parent, the connected node in the level above; child, a connected in the level below; and leaf, a node with no children. Importantly, a tree has only one root. A very useful type of tree are binary trees, in which every node has at most two children.
Often trees are represented using classes. Specifically, we would have an object Node
like:
class Node:
def __init__(self, val=None):
self.val = val
self.left = None
self.right = None
We would keep a reference to the root, and build a try by successively creating new nodes and assigning them to .left
or .right
.
(Min-)Heaps are binary trees in which the value of every parent is lower or equal than any of its children. This gives them their most interesting property: the minimum element is always on top. (Similarly, in max-heaps, the maximum stands at the root.) Because of that, they are also called priority queues. A famous algorithm that can be solved with heaps is computing the running median of a data stream.
In Python, heapq
provides an implementation of the heap. Any populated list can be transformed in-place into a heap:
import heapq
x = [5, 123, 8, 3, 2, 6, -5]
heapq.heapify(x)
[-5, 2, 5, 3, 123, 6, 8]
The elements have been reordered to represent a heap: each parent note is indexed by \(k\), and its children by \(2k+1\) and \(2k+2\).
Let’s see some common operations:
heapq.heappush(x, -10)
print(x)
[-10, -5, 5, 2, 123, 6, 8, 3]
heapq.heappop(x)
-10
Combine the two operations:
heapq.heappushpop(x, -7) # [-5, 2, 5, 3, 123, 6, 8]
-7
heapq.heapreplace(x, -7) # [-7, 2, 5, 3, 123, 6, 8]
-5
Let’s examine the time complexity of each operation:
Note: Heaps are great to recover the smallest element, but not the k^{th} smallest one. BSTs might me more appropriate for that.
Binary serach trees (BSTs) are binary trees in which every node meets two properties:
They provide a good balance between insertion and search speeds:
The time complexity of both is \(O(\log n)\) when the tree is balanced; otherwise it is \(O(n)\). (Balanced trees are those whose height is small compared to the number of nodes. Visually, they look full and all branches look similarly long.) As a caveat, no operation takes constant time on a BST.
Tries (from retrieval) are trees that store strings:
Due to its nature, tries excel at two things:
These two properties make them excellent at handling spell checking and autocomplete functions.
Union-finds, also known as Disjoint-sets, store a collection of non-overlapping sets. Internally, sets are represented as directed trees, in which every member points towards the root of the tree. The root is just another member, which we call the representative. Union-finds provide two methods:
Union-finds can be represented as an array, in which every member of the universal set is one element. Members linked to a set take as value the index of another member of the set, often the root. Consequently, members that are the only members of a set take their own value. The same goes for the root. While this eliminates many meaningful pairwise relationship between the elements, it speeds up the two core operations.
Every set has a property, the rank, which approximates its depth. Union is performed by rank: the root with the highest rank is picked as the new root. Find performs an additional step, called path compresion, in which every member in the path to the root will be directly bound to the root. This increases the cost of that find operation, but keeps the tree shallow and the paths short, and hence speeds up subsequent find operations.
Here is a Python implementation:
class UnionFind:
def __init__(self, size):
self.parent = [i for i in range(size)]
self.rank = [0] * size
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x]) # Path compression
return self.parent[x]
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
if self.rank[rootX] > self.rank[rootY]:
self.parent[rootY] = rootX
self.rank[rootX] = self.rank[rootY]
else:
self.parent[rootX] = rootY
self.rank[rootY] = self.rank[rootX]
Bloom filters are data structures to probabilistically check if an element is a member of a set. It can be used when false positives are acceptable, but false negatives are not. For instance, if we have a massive data set, and we want to quickly discard all the elements that are not part of a specific set.
The core structure underlying bloom filters is a bit array, which makes it highly compact in memory. When initialized, all the positions are set to 0. When inserting a given element, we apply multiple hash functions to it, each of which would map the element to a bucket in the array. This would be the element’s “signature”. Then, we would set the value of each of these buckets to 1. To probabilistically verify if an element is in the array, we would compute its signature and examine if all the buckets take a value of 1.
A linked list is a DAG in which almost every node has exactly one inbound edge and one outbound edge. The exceptions are the head, a node with no inbound egde, and the tail, a node with no outbound edge. Like arrays, linked lists are ordered. However, they have one key diference: insertions in the middle of an array are expensive (\(O(n)\)), since they require copying all the items of the array, while they are cheap in the linked list (\(O(1)\)), since they only require changing two pointers.
This is an implementation of a linked list:
class Node:
def __init__(self, val):
self.val = val
self.next = None
a = Node("A")
b = Node("B")
c = Node("C")
d = Node("D")
a.next = b
b.next = c
c.next = d
Divide and conquer algorithms work by breaking down a problem into two or more smaller subproblems of the same type. These subproblems are tackled recursively, until the subproblem is simple enough to have a trivial solution. Then, the solutions are combined in a bottom-up fashion. For examples in sorting, see merge sort and quick sort.
The input of interval problems is a list of lists, each of which contains a pair [start_i, end_i]
representing an interval. Typical questions revolve around how much they overlap with each other, or inserting and merging a new element.
Note: There are many corner cases, like no intervals, intervals which end and start at the same time or intervals that englobe other intervals. Make sure to think it through.
Note: If the intervals are not sorted, the first step is almost always sorting them, either by start or by end. This usually brings the time complexity to \(O(n \log n)\). In some cases we need to perform two sorts, by start and end separately, before merging them. This produces the sequence of events that are happening.
Sorting consists on arranging the elements of an input array according to some criteria. There are multiple ways to sort an input, each offerintg different trade-offs:
I implement a couple of those below. Their complexities are as follows:
Algorithm | Time complexity | Space complexity |
---|---|---|
Selection | \(O(n^2)\) | \(O(1)\) |
Bubble | \(O(n^2)\) | \(O(1)\) |
Merge | \(O(n \log n)\) | \(O(n)\) |
Quicksort | \(O(n \log n)\) (average) | \(O(\log n)\) |
Topological | \(O(|V| + |E|)\) | \(O(|V|)\) |
def selection_sort(x):
for i in range(len(x)):
curr_max, curr_max_idx = float("-inf"), None
for j in range(len(x) - i):
if x[j] > curr_max:
curr_max = x[j]
curr_max_idx = j
x[~i], x[curr_max_idx] = x[curr_max_idx], x[~i]
return x
bubble_sort([3,5,1,8,-1])
def bubble_sort(x):
for i in range(len(x) - 1):
for j in range(i + 1, len(x)):
if x[i] > x[j]:
x[i], x[j] = x[j], x[i]
return x
bubble_sort([3,5,1,8,-1])
def merge_sort(x):
# base case
if len(x) <= 1:
return x
# recursively sort the two halves
mid = len(x) // 2
sorted_left = merge_sort(x[:mid])
sorted_right = merge_sort(x[mid:])
# merge the two sorted halves
i = j = 0
merged = []
while i < len(sorted_left) and j < len(sorted_right):
if sorted_left[i] < sorted_right[j]:
merged.append(sorted_left[i])
i += 1
else:
merged.append(sorted_right[j])
j += 1
# since slicing forgives out of bounds starts
# hence, this will work when i >= len(sorted_left)
merged.extend(sorted_left[i:])
merged.extend(sorted_right[j:])
return merged
merge_sort([3,5,1,8,-1])
def quick_sort(x):
if len(x) <= 1:
return x
pivot = x[-1] # preferrable to modifying the input with x.pop()
lower = []
higher = []
# populate lower and higher in one loop,
# instead of two list comprehensions
for num in x[:-1]:
if num <= pivot:
lower.append(num)
else:
higher.append(num)
return quick_sort(lower) + [pivot] + quick_sort(higher)
quick_sort([3,5,1,8,-1])
Traversing a linked list simply consists on passing through every element. We can do that starting from the head, following the pointer to the next node and so on.
For instance, this algorithm stores all the values into an array:
class Node:
def __init__(self, val):
self.val = val
self.next = None
def create_list():
a = Node("A")
b = Node("B")
c = Node("C")
d = Node("D")
a.next = b
b.next = c
c.next = d
return a
def fetch_values(head):
curr = head
values = []
while curr:
values.append(curr.val)
curr = curr.next
a = create_list()
fetch_values(a)
['A', 'B', 'C', 'D']
Or recursively:
def fetch_values(node):
if not node: return values
return [node.val] + fetch_values(node.next)
fetch_values(a)
['A', 'B', 'C', 'D']
def find_value(node, target):
if not node: return False
elif node.val == target: return True
return find_value(node.next, target)
find_value(a, "A") # True
find_value(b, "A") # False
Often multiple pointers are needed in order to perform certain operations on the list, like reversing it or deleting an element in the middle.
def reverse_list(head):
left, curr = None, head
while curr:
right = curr.next
curr.next = left
left, curr = curr, right
return left
fetch_values(reverse_list(a))
['D', 'C', 'B', 'A']
a = create_list()
x = Node("X")
y = Node("Y")
x.next = y
def merge(head_1, head_2):
tail = head_1
curr_1, curr_2 = head_1.next, head_2
counter = 0
while curr_1 and curr_2:
if counter & 1:
tail.next = curr_1
curr_1 = curr_1.next
else:
tail.next = curr_2
curr_2 = curr_2.next
tail = tail.next
counter += 1
if curr_1: tail.next = curr_1
elif curr_2: tail.next = curr_2
return head_1
fetch_values(merge(a, x))
['A', 'X', 'B', 'Y', 'C', 'D']
Using two pointers that iterate the list at different speeds can help with multiple problems: finding the middle of a list, detecting cycles, or finding the element at a certain distance from the end. For instance, this is how you would use this technique to find the middle node:
def find_middle(head):
fast = slow = head
while fast and fast.next:
fast = fast.next.next
slow = slow.next
return slow.val
a = create_list()
print(find_middle(a))
TODO
TODO
TODO
TODO
A very useful algorithm to know is how to iterate a BST in order, from the smallest to the largest value in the tree. It has a very compact recursive implementation:
class TreeNode:
def __init__(self, val=0, left=None, right=None):
self.val = val
self.left = left
self.right = right
def inorder_traversal(root):
if root:
return inorder_traversal(root.left) + [root.val] + inorder_traversal(root.right)
else:
return []
However, a non-recursive implementation might be more easily adaptable to other problems:
def inorder_traversal(root):
output = []
stack = []
while root or stack:
while root:
stack.append(root)
root = root.left
root = stack.pop()
output.append(root.val)
root = root.right
return output
For instance, to finding the k-smallest element:
def find_k_smallest(root, k):
stack = []
while root or stack:
while root:
stack.append(root)
root = root.left
root = stack.pop()
k -= 1
if k == 0:
return root.val
root = root.right
return None
# Construct the BST
# 3
# / \
# 1 4
# \
# 2
root = TreeNode(3)
root.left = TreeNode(1)
root.right = TreeNode(4)
root.left.right = TreeNode(2)
find_k_smallest(root, 2)
2
TODO
TODO
The bread and butter of graph problems are traversal algorithms. Let’s study them.
In a depth-first traversal (DFT), given a starting node, we recursively visit each of its neighbors before moving to the next one. In a 2D grid, it would involve picking a direction, and following it until we reach a bound. Then we would pick another direction, and do the same. Essentially, the exploration path looks like a snake.
The data structure underlying DFT is a stack:
Let’s see an explicit implementation of the stack:
graph = {
"a": {"b", "c"},
"b": {"d"},
"c": {"e"},
"d": {"f"},
"e": set(),
"f": set(),
}
def depth_first_print(graph: dict[str, set[str]], seed: str) -> None:
stack = [seed]
while stack:
curr_node = stack.pop()
print(curr_node)
stack.extend(graph[curr_node])
depth_first_print(graph, "a")
a
b
d
f
c
e
Alternatively, we can use a recursive approach and an implicit stack:
def depth_first_print(graph: dict[str, set[str]], seed: str) -> None:
print(seed)
for neighbor in graph[seed]:
depth_first_print(graph, neighbor)
depth_first_print(graph, "a")
a
c
e
b
d
f
For a graph with nodes \(V\) and edges \(E\), the time complexity is \(O(\|V\|+\|E\|)\) and the space complexity is \(O(\|V\|)\).
Note: Watch out for cycles. Without explicing handling, we might get stuck in infinite traversals. We can keep track of which nodes we have visited using a set, and exit early as soon as we re-visit one.
Note: Some corner cases are the empty graph, graphs with one or two nodes, graphs with multiple components and graphs with cycles.
In a breadth-first traversal (BFT), given a starting node, we first visit its neighbors, then their neighbors, and so on.
In a 2D grid, it doesn’t favour any direction. Instead, it looks like a water ripple.
The data structure underlying BFT is a queue:
Let’s see an implementation:
graph = {
"a": {"b", "c"},
"b": {"d"},
"c": {"e"},
"d": {"f"},
"e": set(),
"f": set(),
}
from collections import deque
def breadth_first_print(graph: dict[str, set[str]], seed: str) -> None:
queue = deque([seed])
while queue:
curr_node = queue.popleft()
print(curr_node)
queue.extend(graph[curr_node])
breadth_first_print(graph, "a")
a
b
c
d
e
f
For a graph with nodes \(V\) and edges \(E\), the time complexity is \(O(\|V\|+\|E\|)\) and the space complexity is \(O(\|V\|)\).
A topological sort (or top sort) is an algorithm whose input is a DAG, and whose output is an array such that every node appears after all the nodes that point at it. (Note that, in the presence of cycles, there is no valid topological sorting.) The algorithm looks like this:
Put together, the time complexity of top sort is \(O(\|V\| + \|E\|)\), and the space complexity, \(O(\|V\|)\).
TODO
TODO
TODO
As for graph related problems, problems involving trees often require traversals, either depth or breadth first. The same principles and data structures apply. For a tree with \(n\) nodes, the time complexity is \(O(n)\), and the time complexity is \(O(n)\). If the tree is balanced, depth first has a space complexity of \(O(\log n)\).
The two pointer approach can be used in problems involving searching, comparing and modifying elements in a sequence. A naive approach would involve two loops, and hence take \(O(n^2)\) time. Instead, in the two pointer approach we have two pointers storing indexes, and, by moving them in a coordinate way, we can reduce the complexity down to \(O(n)\). Generally speaking, the two pointers can either move in the same direction, or in opposite directions.
Note: Some two pointer problems require the sequence to be sorted to move the pointers efficiently. For instance, to find the two elements that produce a sum, having a sorted array is key to know which pointer to increase or decrease.
Note: Sometimes we need to iterate an \(m \times n\) table. While we can use two pointers for that, we can to with a single pointer \(i \in [0, m \times n)\): row = i // n
, col = i % n
.
Sliding window problems are a type of same direction pointer problems. They are optimization problems involving contiguous sequences (substrings, subarrays, etc.), particularly involving cumulative properties. The general approach consists on starting with two pointers, st
and ed
at the beginning of the sequence. We can keep track of the cumulative property and update it as the window expands or contracts. We keep increasing st
until we find a window that meets our constraint. Then, we try to reduce it by increasing st
, until it doesn’t meet it anymore. Then, we go back to increasing ed
, and so on.
Permutation problems can be tackled by recursion.
Backtracking is a family of algorithms characterized by:
Since solutions are built incrementally, backtracting they can be visualized as a depth-first search on a tree. At each node, the algorithm checks if it will lead to a valid solution. If the answer is negative, it will backtrack to the parent node, and continue the process.
Note: Because of the need to backtrack, a recursive implementation of the DFS is often more convenient, since undoing a step simply involves invoking return
. A stack might require a more elaborate implementation.
As we will see in a few examples, the solution to a backtracking problem looks like this:
def solve(candidate):
if is_solution(candidate):
output(candidate)
return
for child in get_children(candidate):
if is_valid(child):
place(child)
solve(child)
remove(child)
A famous application of backtracking is solving the eight queens puzzle:
The eight queens puzzle is the problem of placing eight chess queens on an 8×8 chessboard so that no two queens threaten each other; thus, a solution requires that no two queens share the same row, column, or diagonal. There are 92 solutions.
I present here a solution, which mirrors the recipe presented above:
board = []
def under_attack(row, col):
for row_i, col_i in board:
if row_i == row or col_i == col:
return True
# check the diagonals
if abs(row_i - row) == abs(col_i - col):
return True
return False
def eight_queens(row=0, count=0):
if row == 8:
return count + 1
for col in range(8):
# check the constraints: the explored square
# is not under attack
if not under_attack(row, col):
board.append((row, col))
# explore a (so-far) valid path
count = eight_queens(row + 1, count)
# backtrack!
board.pop()
return count
total_solutions = eight_queens()
print(f"Total solutions: {total_solutions}")
Total solutions: 92
from pprint import pprint
board = [[0, 0, 0, 1, 0, 0, 0, 0, 5],
[0, 0, 0, 0, 0, 4, 0, 1, 0],
[1, 0, 3, 0, 0, 8, 4, 2, 7],
[0, 0, 1, 7, 4, 6, 0, 9, 0],
[0, 0, 6, 0, 3, 2, 1, 0, 8],
[0, 3, 2, 5, 8, 0, 6, 0, 4],
[0, 0, 7, 8, 0, 0, 0, 4, 0],
[0, 0, 5, 0, 2, 7, 9, 8, 0],
[0, 0, 0, 4, 6, 0, 0, 0, 0]]
def is_valid(board, row, col, num):
block_row, block_col = (row // 3) * 3, (col // 3) * 3
for i in range(9):
if board[row][i] == num:
return False
elif board[i][col] == num:
return False
if board[block_row + i // 3][block_col + i % 3] == num:
return False
return True
def solve(board):
for row in range(9):
for col in range(9):
if board[row][col]:
continue
for num in range(1, 10):
if is_valid(board, row, col, num):
board[row][col] = num
if solve(board):
return True
board[row][col] = 0
return False
return True
if solve(board):
pprint(board)
else:
print("No solution exists.")
[[2, 7, 4, 1, 9, 3, 8, 6, 5],
[6, 5, 8, 2, 7, 4, 3, 1, 9],
[1, 9, 3, 6, 5, 8, 4, 2, 7],
[5, 8, 1, 7, 4, 6, 2, 9, 3],
[7, 4, 6, 9, 3, 2, 1, 5, 8],
[9, 3, 2, 5, 8, 1, 6, 7, 4],
[3, 2, 7, 8, 1, 9, 5, 4, 6],
[4, 6, 5, 3, 2, 7, 9, 8, 1],
[8, 1, 9, 4, 6, 5, 7, 3, 2]]
def permute(nums):
res = []
size = len(nums)
if not size: return [[]]
for i in range(size):
# exclude element i
rest = nums[:i] + nums[i+1:]
perms = [[nums[i]] + x for x in permute(rest)]
res.extend(perms)
return res
The hallmark of a dynamic programming problem are overlapping subproblems.
The key to the problem is identifying the trivially smallest input, the case for which the answer is trivially simple.
We have two strategies:
Draw a strategy!!
Recursion is a technique in to solve problems which in turn depend on solving smaller subproblems. It permeates many other methods, like backtracking, merge sort, quick sort, binary search or tree traversal.
Recursive functions have two parts:
The space complexity of recursion will be, at least, the length of the stack which accumulates all the function calls.
Note: CPython’s recursion limit is 1,000. This can limit to the depth of the problems we can tackle.
TODO
In DP, combining recursion and memoization is a powerful way to trade space complexity for time complexity. Specifically, since problems are overlapping, it is likely we are solving the same subproblems over and over, which can get expensive due to recursion. Caching them can greatly improve the speed of our algorithm.
Here is a recipe for solving these problems (from here):
The computational complexity will be impacted by two factors:
m
: the average length of the elements of the input. For instance, if the input is a list, m = len(input)
; it it is an integer, it is m = input
. This will impact the height of the tree.n
: the length of the input. This will impact the branching factor. For instance, if the input is a list, n = len(input)
.Brute force: for every node, we have a n
options. Usually, the time complexity of DP problems will be exponential, of \(O(n^m*k)\), where $k$ is the complexity of a single recursive call. The memory complexity is the call stack, \(O(m)\).
Memoized: memoization reduces the branching factor by storing previous results. In other words, it trades time complexity for space complexity; usually both become polynomial.
TODO
Taken from here:
Some caveats:
These are some materials that helped me understand dynamic programming (the order matters!):
def how_sum(target: int, nums: list[int], memo: dict = {}) -> None | list[int]:
if target == 0: return []
if target < 0: return None
if target in memo.keys(): return memo[target]
for num in nums:
solution = how_sum(target - num, nums, memo)
if solution is not None:
memo[target] = solution + [num]
return memo[target]
memo[target] = None
return None
how_sum(300, [7, 14])
def best_sum(target: int, nums: list[int], memo: dict = {}) -> None | list[int]:
if target in memo: return memo[target]
if target == 0: return []
if target < 0: return None
memo[target] = None
length_best_solution = float("inf")
for num in nums:
solution = best_sum(target - num, nums, memo)
if solution is not None and len(solution) < length_best_solution:
memo[target] = solution + [num]
length_best_solution = len(memo[target])
return memo[target]
print(best_sum(7, [5, 3, 4, 7]))
print(best_sum(8, [1, 4, 5]))
print(best_sum(100, [1, 2, 5, 25]))
def can_construct(target: str, dictionary: list, memo: dict = {}) -> bool:
if target in memo: return memo[target]
if not target: return True
memo[target] = False
for word in dictionary:
if target.startswith(word):
new_target = target.removeprefix(word)
if can_construct(new_target, dictionary, memo):
memo[target] = True
break
return memo[target]
print(can_construct("abcdef", ["ab", "abc", "cd", "def", "abcd"]))
print(can_construct("skateboard", ["bo", "rd", "ate", "t", "ska", "sk", "boar"]))
print(can_construct("eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeef", ["e", "ee", "eee", "eeee", "eeeee", "eeeee"]))
def count_construct(target: str, dictionary: list, memo: dict = {}) -> int:
if target in memo: return memo[target]
if not target: return 1
memo[target] = 0
for word in dictionary:
if target.startswith(word):
new_target = target.removeprefix(word)
memo[target] += count_construct(new_target, dictionary, memo)
return memo[target]
print(count_construct("abcdef", ["ab", "abc", "cd", "def", "abcd"]))
print(count_construct("purple", ["purp", "p", "ur", "le", "purpl"]))
print(count_construct("skateboard", ["bo", "rd", "ate", "t", "ska", "sk", "boar"]))
print(count_construct("eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeef", ["e", "ee", "eee", "eeee", "eeeee", "eeeee"]))
def all_construct(target: str, dictionary: list, memo: dict = {}) -> list[list[str]]:
if target in memo: return memo[target]
if not target: return [[]]
memo[target] = []
for word in dictionary:
if target.startswith(word):
new_target = target.removeprefix(word)
constructs = all_construct(new_target, dictionary, memo)
constructs = [[word] + c for c in constructs]
memo[target].extend(constructs)
return memo[target]
print(all_construct("abcdef", ["ab", "abc", "cd", "def", "abcd", "ef", "c"]))
print(all_construct("purple", ["purp", "p", "ur", "le", "purpl"]))
print(all_construct("skateboard", ["bo", "rd", "ate", "t", "ska", "sk", "boar"]))
print(all_construct("eeeeeeeeeeeeeeeeeeeeef", ["e", "ee", "eee", "eeee", "eeeee", "eeeee"]))
def fib_t(n: int) -> int:
table = [0] * (n + 2)
table[1] = 1
for i in range(n):
table[i + 1] += table[i]
table[i + 2] += table[i]
return table[n]
print(fib_t(6))
print(fib_t(50))
```python
def grid_traveler(m: int, n: int) -> int:
grid = [[0] * (n + 1) for _ in range(m + 1)]
grid[1][1] = 1
for i in range(m + 1):
for j in range(n + 1):
if (i + 1) <= m:
grid[i + 1][j] += grid[i][j]
if (j + 1) <= n:
grid[i][j + 1] += grid[i][j]
return grid[m][n]
print(grid_traveler(1, 1))
print(grid_traveler(2, 3))
print(grid_traveler(3, 2))
print(grid_traveler(3, 3))
print(grid_traveler(18, 18))
def can_sum_t(target: int, nums: list) -> bool:
"""
Complexity:
- Time: O(m*n)
- Space: O(m)
"""
grid = [False] * (target + 1)
grid[0] = True
for i in range(len(grid)):
if not grid[i]:
continue
for num in nums:
if (i + num) <= len(grid):
grid[i + num] = True
return grid[target]
print(can_sum_t(7, [2 ,3])) # True
print(can_sum_t(7, [5, 3, 4])) # True
print(can_sum_t(7, [2 ,4])) # False
print(can_sum_t(8, [2, 3, 5])) # True
print(can_sum_t(300, [7, 14])) # False
def how_sum_t(target: int, nums: list[int]) -> None | list[int]:
"""
Complexity:
- Time: O(m*n^2)
- Space: O(m*n)
"""
grid = [None] * (target + 1)
grid[0] = []
for i in range(len(grid)):
if grid[i] is None:
continue
for num in nums:
if (i + num) < len(grid):
grid[i + num] = grid[i].copy()
grid[i + num].append(num)
return grid[target]
print(how_sum_t(7, [2 ,3])) # [2, 2, 3]
print(how_sum_t(7, [5, 3, 4, 7])) # [3, 4]
print(how_sum_t(7, [2 ,4])) # None
print(how_sum_t(8, [2, 3, 5])) # [2, 2, 2, 2]
print(how_sum_t(300, [7, 14])) # None
def best_sum_t(target: int, nums: list[int], memo: dict = {}) -> None | list[int]:
"""
Complexity:
- Time: O(m*n^2)
- Space: O(m^2)
"""
grid = [None] * (target + 1)
grid[0] = []
for i in range(len(grid)):
if grid[i] is None:
continue
for num in nums:
if (i + num) < len(grid):
if grid[i + num] is None or len(grid[i + num]) > len(grid[i]):
grid[i + num] = grid[i].copy()
grid[i + num].append(num)
return grid[target]
print(best_sum_t(7, [2 ,3])) # [2, 2, 3]
print(best_sum_t(7, [5, 3, 4, 7])) # [7]
print(best_sum_t(7, [2 ,4])) # None
print(best_sum_t(8, [2, 3, 5])) # [5, 3]
print(best_sum_t(300, [7, 14])) # None
def can_construct_t(target: str, words: list[str]) -> bool:
"""
Complexity:
- Time: O(m^2*n)
- Space: O(m)
"""
grid = [False] * (len(target) + 1)
grid[0] = True
for i in range(len(grid)):
if not grid[i]:
continue
prefix = target[:i]
for word in words:
if (i + len(word)) >= len(grid):
continue
if target.startswith(prefix + word):
grid[i + len(word)] = True
return grid[len(target)]
print(can_construct_t("abcdef", ["ab", "abc", "cd", "def", "abcd"])) # True
print(can_construct_t("skateboard", ["bo", "rd", "ate", "t", "ska", "sk", "boar"])) # False
print(can_construct_t("enterapotentpot", ["a", "p", "ent", "enter", "ot", "o", "t"])) # True
print(can_construct_t("eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeef", ["e", "ee", "eee", "eeee", "eeeee", "eeeee"])) # False
def count_construct_t(target: str, words: list[str]) -> int:
"""
Complexity:
- Time: O(m^2*n)
- Space: O(m)
"""
grid = [0] * (len(target) + 1)
grid[0] = 1
for i in range(len(grid)):
if not grid[i]:
continue
for word in words:
if (i + len(word)) >= len(grid):
continue
prefix = target[:i]
if target.startswith(prefix + word):
grid[i + len(word)] += grid[i]
return grid[len(target)]
print(count_construct_t("abcdef", ["ab", "abc", "cd", "def", "abcd"])) # 1
print(count_construct_t("purple", ["purp", "p", "ur", "le", "purpl"])) # 2
print(count_construct_t("skateboard", ["bo", "rd", "ate", "t", "ska", "sk", "boar"])) # 0
print(count_construct_t("enterapotentpot", ["a", "p", "ent", "enter", "ot", "o", "t"])) # 4
print(count_construct_t("eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeef", ["e", "ee", "eee", "eeee", "eeeee", "eeeee"])) # 0
from copy import deepcopy
def all_construct_t(target: str, words: list[str]) -> list[list[str]]:
"""
Complexity:
- Time: O(n^m)
- Memory: O(n^m)
"""
grid = [[] for _ in range(len(target) + 1)]
grid[0] = [[]]
for i in range(len(grid)):
if not grid[i]:
continue
for word in words:
if (i + len(word)) > len(grid):
continue
prefix = target[:i]
if target.startswith(prefix + word):
new_constructs = deepcopy(grid[i])
for x in new_constructs:
x.append(word)
if grid[i + len(word)]:
grid[i + len(word)].extend(new_constructs)
else:
grid[i + len(word)] = new_constructs
return grid[len(target)]
print(all_construct_t("abcdef", ["ab", "abc", "cd", "def", "abcd", "ef", "c"])) # [['ab', 'cd', 'ef'], ['ab', 'c', 'def'], ['abc', 'def'], ['abcd', 'ef']]
print(all_construct_t("purple", ["purp", "p", "ur", "le", "purpl"])) # [['purp', 'le'], ['p', 'ur', 'p', 'le']]
print(all_construct_t("skateboard", ["bo", "rd", "ate", "t", "ska", "sk", "boar"])) # []
print(all_construct_t("enterapotentpot", ["a", "p", "ent", "enter", "ot", "o", "t"])) # # [['enter', 'a', 'p', 'ot', 'ent', 'p', 'ot'], ['enter', 'a', 'p', 'ot', 'ent', 'p', 'o', 't'], ['enter', 'a', 'p', 'o', 't', 'ent', 'p', 'ot'], ['enter', 'a', 'p', 'o', 't', 'ent', 'p', 'o', 't']]
Stakeholders will sometimes come to us with problems, and we might need to produce a good algorithmic solution pretty quickly; say 45-60 minutes. This is a template on how to tackle these situations.
If our stakeholder is prepared, they might come with a written down problem statement. They might share it with us ahead of our meeting or right at the start.
While it can be tempting to implement a solution right away, it is worth spending some time drafting the problem. After all, our stakeholder might have given it some thought already, and could be able to point us in the right direction.
During the implementation phase, it might help to go from the big picture to the small picture. Start by defining the global flow of the program, calling unimplemented functions with clear names. This will allow you to make sure your proposal make sense before getting entangled in the specifics.
In order to allow our stakeholder follow our logic, it is important that they can follow along:
Once you have a working solution, revisit it:
Once our solution is ready, it might be a good idea to give it a go. Simply call your function on a few examples. Consider:
If some examples fail, we need to debug our code. Throw in a few print statements, predict what you expect to see, and go for it.
After successfully presenting a solution, our stakeholder might have some follow-up questions:
Note: some algorithms have some implicit and potentially unexpected behaviors. Ctrl + F
“Note:” in order to find some of them.
Pandas provides several data structures, out of which two are particularly popular: Series and DataFrames.
A Series is a vector-like structure, that extends NumPy vectors.
import pandas as pd
x = pd.Series([0, 1, 2, 3], index=["a", "b", "c", "d"])
x
a 0
b 1
c 2
d 3
dtype: int64
The Series stores the data as a NumPy vectors, inheriting its advantages and disadvantages. But computations on Series come with an extra overhead, since Pandas puts extra effort in handling missing values.
DataFrames are matrix-like structures, which build on top of Series. They can be created in multiple ways, some of which are:
data = {'Column1': [1, 2, 3], 'Column2': ['A', 'B', 'C']}
df = pd.DataFrame(data)
data = [{'Column1': 1, 'Column2': 'A'}, {'Column1': 2, 'Column2': 'B'}, {'Column1': 3, 'Column2': 'C'}]
df = pd.DataFrame(data)
The DataFrame stores data as multiple Series with a shared index. While the data of a Series lives altogether, the different Series of a DataFrame are scattered in memory. In consequence, adding a new column to a DataFrame is fast: Pandas just needs to add its reference to the registry.
As in NumPy vectors, we can access a Series’ elements using their positional indexes. But, furthermore, it has an index, a hash map structure which allows us to access each element in the array using a label:
.iloc[]
uses the positional indices, and slicing works as usual: x.iloc[2:3]
c 2
dtype: int64
.loc[]
uses labels, and slicing includes both beginning and end: x.loc["c":"d"]
c 2
d 3
dtype: int64
DataFrames also have a .loc[]
and an .iloc[]
function, which accepts columns as a second argument.
Thanks to their dictionary-like properties, indexes allow to access an element in constant time. However, including non-unique indexes might lead to a worst case \(O(n)\) lookup time.
Unless otherwise specified, the index gets initialized to a (lazy) enumeration of the rows/items. We can access the index using .index()
, and revert it to this default behaviour using .reset_index(drop=True)
. Note that indexes are immutable, to ensure data integrity. In other words, adding or deleting entries will not alter the index of the remaining elements, in contrast to the positional index.
MultiIndex is an index in which is key is a (unique) tuple. We can create them from lists of lists or of tuples, from DataFrames, or from the cross-product of two iterables:
x = pd.Series([1,2,3,4])
class_1 = ["foo", "bar"]
class_2 = [1, 2]
index = pd.MultiIndex.from_product((class_1, class_2),
# the name of the levels themselves
names = ["first", "second"])
x.index = index
x
first second
foo 1 1
2 2
bar 1 3
2 4
dtype: int64
As shown above, the items within a particular position in the tuple to not need to be unique within that position. This allows to select subgroups using partial indexes:
x["foo"]
second
1 1
2 2
dtype: int64
x[:, 1]
first
foo 1
bar 3
dtype: int64
As NumPy, Pandas distinguishes between viewing an object and copying it.
TODO
While Python offers multiple ways of formatting strings (i.e., combining predefined text and variables), F-strings are particularly elegant:
constants = {
"pi": 3.14159265358979323846,
"sqrt(2)": 1.41421356237309504880,
"Euler's number": 2.71828182845904523536
}
for name, value in constants.items():
print(f"{name} = {value}")
pi = 3.141592653589793
sqrt(2) = 1.4142135623730951
Euler's number = 2.718281828459045
The float variables can be rounded to a given precision:
for name, value in constants.items():
print(f"{name} = {value:.3f}")
pi = 3.142
sqrt(2) = 1.414
Euler's number = 2.718
Values can also be formatted to occupy a minimum fixed width:
for name, value in constants.items():
print(f"{name:10} = {value:.3f}")
pi = 3.142
sqrt(2) = 1.414
Euler's number = 2.718
Note that the string “Euler’s number” exceeds the minimum length of 10, and is hence represented as is.
Strings placed next to each other are automatically concatenated:
assert "foo" "bar" == "foo" + "bar"
This is useful to cleanly produce long strings while respecting a certain maximum line length:
message = "Hello, " \
"World!"
print(message)
Hello, World!
enumerate
with an offsetThe enumerate
function creates a lazy generator over an iterable that will return a tuple (index, item). It can take a second parameter, to indicate the first index to start counting from:
x = ["a", "b", "c"]
for idx, item in enumerate(x, 10):
print(f"{idx}: {item}")
10: a
11: b
12: c
zip
and itertools.zip_longest
The zip
function combines two or more iterators, generating a lazy generator which yields the next item from each. It is particularly useful to handle related lists that have the same length:
numbers = [1, 2, 3]
squared = [x**2 for x in numbers]
for number, square in zip(numbers, squared):
print(f"The square of {number} is {square}.")
The square of 1 is 1.
The square of 2 is 4.
The square of 3 is 9.
However, when the two iterables have different lenghts, zip
will only emit as many elements as the shortest of them:
xs = list(range(4))
ys = list(range(5))
for x, y in zip(xs, ys):
print(x, y)
0 0
1 1
2 2
3 3
When we do not wish this truncation to happen, itertools.zip_longest
might be what we need:
from itertools import zip_longest
xs = list(range(4))
ys = list(range(5))
for x, y in zip_longest(xs, ys):
print(x, y)
0 0
1 1
2 2
3 3
None 4
The list.sort
method orders a list’s elements in ascending order. It will work as long as the items have defined the <
comparison operator, as is the case for floats, integers and strings. However, in some cases that operator might not be implemented, or might not be making the comparison that we care about. The key
argument is helpful in those cases:
class Animal:
def __init__(self, name, weight):
self.name = name
self.weight = weight
def __repr__(self):
return f"Animal({self.name}, {self.weight})"
animals = [
Animal("whale", 100000),
Animal("sea lion", 200),
Animal("lion", 200),
Animal("possum", 2.5)
]
# sort by weight
animals.sort(key = lambda x: x.weight)
print(animals)
[Animal(possum, 2.5), Animal(sea lion, 200), Animal(lion, 200), Animal(whale, 100000)]
# sort by name
animals.sort(key = lambda x: x.name)
print(animals)
[Animal(lion, 200), Animal(possum, 2.5), Animal(sea lion, 200), Animal(whale, 100000)]
As shown, key
takes a function which will receive an item, and output a comparable value. If we want to order first by weight, then by name, we just need to combine both in a tuple:
# sort by name
animals.sort(key = lambda x: (x.weight, x.name))
print(animals)
[Animal(possum, 2.5), Animal(lion, 200), Animal(sea lion, 200), Animal(whale, 100000)]
The walrus operator (:=
) allows to assign variables in the middle of expressions:
def is_divisor(x, y):
"""
Check if y is a divisor of x.
Parameters:
- x (int): The dividend.
- y (int): The potential divisor.
Returns:
tuple: A tuple containing a boolean indicating whether y is a divisor of x,
and the remainder when x is divided by y. If y is a divisor, the
boolean is True, and the remainder is 0; otherwise, the boolean is
False, and the remainder is the result of x % y.
"""
if remainder := x % y:
return False, remainder
else:
return True, 0
print(is_divisor(10, 5))
(True, 0)
print(is_divisor(10, 3))
(False, 1)
The walrus operator is present in the first line of the is_divisor
function. It allows two things to happen at once. First, the if
clause will evaluate the expression x % y
(false if the remainder is 0; true if it’s any other number). Additionally, it is setting the remainder
variable to x % y
. This makes the code easier to understand, since remainder
is only defined if it is going to be used.
We can use underscores _
as visual separators between any pair of digits in integers, floats or complex numbers:
assert 10_000_000 == 10000000
assert 1_100.3 == 1100.3
I find this particularly useful when dealing with large numbers.
Most decimal floating-point numbers cannot be represented as binary floating-point numbers. Instead, computers just store an approximation. This behavior is not evident by just asking Python to display a number, since it will round it:
print(0.1)
0.1
However, if we request Python to give more significant digits:
format(0.1, '.20g')
'0.10000000000000000555'
While this approximation is smaller than \(2^(-53)\), that is enough to cause errors:
assert .1 + .2 == .3
AssertionError
Luckily, we can get around it with a little extra work:
import math
assert math.isclose(.1 + .2, .3)
assert round(.1 + .2, ndigits=1) == round(.3, ndigits=1)
TODO
In this series, we have seen multiple examples in which the type of a variable is specified. For instance:
x: dict[int, int] = {0: 0, 1: 1}
def pretty_print(x: str, prefix: str | None = None) -> None:
prefix = f"{prefix}: " if prefix else ""
print(f"{prefix}{x.title()}.")
Note that typing hints are a relatively recent addition to Python. Typing hints of recent verions of Python might produce parsing errors on older version.
The stdlib’s typing
module gives many options to control type hints. (Widely used packages bring their own typing hints, like numpy.) Below I explore some interesting features.
The decorator @typing.overload
allows to overload functions, that is, have a function behave differently depending on the argument type.
from typing import overload
@overload
def square(x: int) -> int:
...
@overload
def square(x: list[int]) -> list[int]:
...
def square(x: list[int] | int) -> list[int] | int:
if isinstance(x, list):
return [square(_x) for _x in x]
return x * 2
Python is a dynamically typed language. Hence, typing hints are just, that, hints. However, we can use mypy on our entire codebase to check that types are used correctly.
Some people are really concerned by performance. Their concern is such that they are willing to sacrifice code readability for minor gains in performance. Such people might get satisfaction from replacing arithmetic operations involving integers by bitwise operations. Since those act directly on the bit representation of the integer, they can be more efficient. Despite compilers performing some optimization of their own, there is some somewhat old evidence supporting that bitwise operations are faster. I describe below some common optimizations.
The >>
and the <<
operators shift the bit representation to the left and to the right, respectively. This can be used to quickly divide or multiply integers by powers of two:
x = 0b101 # 5
# shift to the right by 1
# 0b101 >> 0b10
# equivalent to 5 // 2**1
5 >> 1 # 2
# shift to the left by 4
# 0b101 >> 0b1010000
# 5 * 2**4
5 << 4 # 80
The &
operator is the bitwise AND operator. When we use &
between any integer and a 1, we are effectively cheching if the last bit is a 1 (odd) or a 0 (even):
# 0b1110 & 0b0001 = 0b0000 = 0
assert not 14 & 1
# 0b1111 & 0b0001 = 0b0001 = 1
assert 15 & 1
The ~
operator is the complement operator, which switches 1s by 0s and vice versa. Let’s see it in action:
# 0b01 -> 0b10
assert ~1 == -2
assert ~-2 == 3
Since the first bit represents the sign, it has the effect of turning \(x\) into \(-x - 1\). This is useful when we need to simultaneously iterate the front and the back of a list:
def is_palindrome(word: str) -> bool:
return all([word[i] == word[~i] for i in range(len(word) // 2)])
assert is_palindrome("kayak")
assert not is_palindrome("dog")
Handling exceptions with try: ... except: ...
is a common in Python code. But there are some additional nuances:
y = list()
x = 1
try:
x + 1
y.append(1)
{}[1]
# we can handle multiple, specific exceptions
except TypeError:
print(f"Can't sum an integer and a {type(x)}.")
except AttributeError:
print(f"Can't append to {type(y)}.")
# we can still add a catch-all exception
except:
# we can throw our own exception
raise Exception("Something went wrong.")
# behavior if no error is raised
else:
print("All good.")
# a block that will be run no matter what,
# usually good for clean up
finally:
print("Thanks anyway.")
A context manager is a programming construct that makes it easy to allocate and release resources. It is useful to handle file operations, network connections or database transactions, when it is important to release the resource when we are done with it. They can be used using the with
statement. The context manager class needs two methods: __enter__
, to setup the resource, and __exit__
, to clean up and release the resource.
A language is statically typed when variables have types, i.e., the type of the variables are checked before execution (usually at compilation). In contrast, in dynamycally typed languages variable names don’t have types, runtime values (objects) do, i.e., the variable types are checked during execution.
It is often said that whether a language is compiled or interpreted is an “implementation detail”. That is, we should separate Python, the programming language itself, from its specific implementation (like CPython, IronPython or PyPy). Nonetheless, the most popular implementations, indeed, behave like interpreters. More specifically, they execute code in two steps:
*.pyc
, stored in __pycahce__
), called “bytecode”.Note that the “compilation” step is quite different from what it would involve for a so-called compiled language, like C or C++. For the latter, we would end up with an independent executable. Furthermore, CPython’s puts emphasis in quickly executing the code. Hence, it spends little time in optimizing the executable. On the other hand, compilation in C/C++ can take a significant amount of time, as these optimizations take place.
The Python interpreter comes with some predefined types:
int
, float
, complex
)bool
)list
, tuple
, range
)str
)bytes
, bytearray
, memoryview
)set
, frozenset
)dict
)Generic Alias
, Union
)None
and others)Python has two kinds of data types, mutable and immutable, which respectively can and cannot be modified after being created. Mutable data types include lists, dictionaries and sets; immutable data types, integers, floats, booleans, strings and tuples. Let’s see an example:
# and (immutable) int(1) object is created
# both x and y point at it
x = y = 1
assert x is y
# we change the value of x. since integers are immutable,
# a new int(x + 1) is created to store that value, and
# x is assigned that new reference
x += 1
# x and y don't refer to the same object anymore
assert x != y
assert x is not y
Let’s compare this behaviour to that of a mutable object:
# an list is created, and both x and y point at it
x = y = [1]
assert x is y
# we change the value of x. since lists are mutable,
# the original list gets altered
x.append(2)
# x and y still refer to the same object
assert x == y
assert x is y
Interestingly, and as we saw when examining the refcount, Python leverages this immutability:
# two int(1) objects are created, each assigned a name
x = 1
y = 1
assert x is not y
AssertionError
In other words, 1 (and other common objects, like 0 or True
) are singletons. That way, Python does not need to keep allocating memory for new objects that are used very often. This does not happen for more unique immutable objects:
x = 8457
y = 8457
assert x is not y
Mutability also has implications on memory allocation. Python knows at runtime how much memory an immutable data type requires. However, the memory requirements of mutable containers will change as we add and remove elements. Hence, to add new elements quickly if needed, Python allocates more memory than is strictly needed.
A namespace is a mapping from names to objects. In fact, underlying a namespace there is a dictionary: its keys are symbolic names (e.g., x
) and its values, the object they reference (e.g., an integer with a value of 8). During the execution of a typical Python program, multiple namespaces are created, each with its own lifetime. There are four types of namespaces:
print
, int
or len
.globals()
.locals()
.Namespaces are related to scopes, which are the parts of the code in which a specific set of namespaces can be accessed. When a Python needs to lookup a name, if resolves it by examining the namespaces using the LEGB rule: it starts at the Local namespace; if unsuccessful, it moves to the Enclosing namespace; then the Global, and lastly the Builtin. By default, assignments and deletions happen on the local namespace. However, this behaviour can be altered using the nonlocal
and global
statements:
def enclosing_test():
foo = "enclosed"
print(f"Inside enclosing_test, foo = {foo}")
def local_test():
foo = "local"
print(f"Inside local_test, foo = {foo}")
local_test()
print(f"After local_test, foo = {foo}")
def nonlocal_test():
nonlocal foo
foo = "nonlocal"
print(f"Inside nonlocal_test, foo = {foo}")
nonlocal_test()
print(f"After nonlocal_test, foo = {foo}")
def global_test():
global foo
foo = "global"
print(f"Inside global_test, foo = {foo}")
global_test()
print(f"After global_test, foo = {foo}")
foo = "original"
print(f"At the beginning, foo = {foo}")
enclosing_test()
print(f"Finally, foo = {foo}")
At the beginning, foo = original
Inside enclosing_test, foo = enclosed
Inside local_test, foo = local
After local_test, foo = enclosed
Inside nonlocal_test, foo = nonlocal
After nonlocal_test, foo = nonlocal
Inside global_test, foo = global
After global_test, foo = nonlocal
Finally, foo = global
In CPython, all the objects live in a private heap. Memory management is handled exclusively by the Python memory manager. In other words, and in contrast to languages like C, the user has no way directly to manipulate items in memory. The Python heap is further subdivided into arenas to reduce data fragmentation.
When an object is created, the memory manager allocates some memory for it in the heap, and its reference is stored in the relevant namespace.
Conversely, the garbage collector is an algorithm that deallocates objects when they are no longer needed. The main mechanism uses the reference count of the object: when it falls to 0, its memory is deallocated. However, the garbage collector also watches for objects that still have a non-zero refcount, but have become inaccessible, for instance:
# create a list - refcount = 1
x = []
# add a reference to itself - refcount = 2
x.append(x)
# delete the original reference - refcount = 1
del x
GIL is a mechanism to make CPython’s thread safe, by only allows one thread to execute Python bytecode at a time. This vastly simplifies CPython’s implementation and writing extensions for it, since thread safety is not a concern. It also leads to faster single-thread applications. However, CPU-bound tasks cannot be sped up by multithreading, since nonetheless the threads will run sequentially, never in parallel. However, it can be used to speed up I/O-bound operations.
When parallel processing is needed, Python can still do that via:
A Python module is simply a file containing Python functions, classes, constants and runnable code. When we want to use them, we need to import the module using the import
statement. For instance:
import numpy as np
It imports this file from your installed NumPy package as a module object and assigns its reference name np
.
There are multiple things that Python recognizes as modules:
TO EXPAND
Dictionaries and sets are ideal data structures to collect items when 1. the data has no intrinsic order; and 2. elements can be retrieved using keys, i.e., so-called hashable objects (further described below). The underpinnings of dictionaries and sets are very similar, a data structure called the hash map. Similarly to lists and tuples, we can visualize it as a finite number of memory buckets, each of which can store a reference to an object. The number of buckets is called the capacity. Each possible key maps univocally to a bucket, thanks to the hash function. Hence, lookups, insertions and deletions are performed in constant time. The difference between dictionaries and sets lies in what goes in the bucket: dictionaries store key-value pairs, while sets only store keys.
Hash maps rely on the key objects implementing a hash method (.__hash__()
). This function maps each object to a fixed-byte integer. When we apply the hash()
function to the object, its hash method gets called. The hash method needs to be fast, since all operations on a hash map are limited by its speed. In addition, it needs to be deterministic and produce fixed-length values. For reasons we will go over later, a hashable object needs to additionally implement either an __eq__
operator, or a __cmp__
operator.
Let’s see some examples using the immutable builtins:
hash(1) # 1
hash(1.) # 1
hash("1") # 6333942777250828306
hash((1)) # 1
hash((1, 1)) # 8389048192121911274
hash((1., 1.)) # 8389048192121911274
Imagine that our hash map has a capacity of 16, i.e., we have 16 buckets indexed from 0 to 15. However, as we just saw, hashing can produce very large integers, which makes it impossible to use the integer as a bucket index. To keep things small, let’s say our object’s hash is 62. To map 62 to one of our 16 buckets, we need an additional operation called mask. A simple mask is the modulo function using the capacity, which will always produce a value between 0 and 15:
62 % 15
2
In other words, 62 would map to bucket number 2. However, there are faster masks. Specifically, Python uses the bitwise AND (&
):
# bin(62) = 0b111110
# &
# bin(15) = 0b001111
# ------------------
# 0b001110 = 14
62 & 15
14
Note that since the number of buckets will often be much smaller than the hash, &
is effectively operating only on the tails.
If the number of empty buckets is large enough, a new key can be mapped to an empty bucket with high probability. (Empty buckets contain a NULL
value.) In that case, we will simply store the key-value pair in the bucket.
However, it is possible that the bucket is already occupied. In that case, Python will first check if the keys are equal. That is why 1. we store the key object in the bucket; and 2. the key object needs to implement .__eq__
or .__cmp__
. If there is a match, the key-value pair gets updated. Otherwise, we have a collision, i.e., two keys map to the same bucket. In that case, a deterministic exploration of other buckets starts. The details of this process, called probing, are beyond the scope of this article. Once an empty bucket is found, the key-value pair will be stored there. At lookup time, this probing process will be followed, until either the right key or an empty bucket is found.
When a value is deleted, we cannot simply overwrite it with a NULL
. That would make the bucket identical to a “virgin” bucket, and potentially disrupt the probing strategy, leading to inconsistent results. Instead, a special value is written (sometimes called a “turd”). Nonetheless, this memory is not wasted: another key-value pair can take its place if needed, without compromising the integrity of the data.
Note: User-defined classes are hashable by default, even if they don’t implement __hash__
, __eq__
or __cmp__
. The hash is computed using the object’s id()
, and all objects compare unequal.
Understanding what underlies dictionaries and sets allows us to estimate the time complexity of the different operations.
Lookup: given a key, in the vast majority of cases a lookup is done in \(O(1)\). The actual time depends on how fast the hash function is. Collisions are the main hurdle to lookup: in the worst case, all keys collide, meaning we have to iterate over all the elements to find ours. In that case, complexity is \(O(n)\). Luckily, a good hash function ensures that colisions are very rare.
Insertion: for similar reasons, the amortized time complexity is \(O(1)\) and the worst case is \(O(n)\).
Deletion: for similar reasons, the amortized time complexity is \(O(1)\) and the worst case is \(O(n)\).
Resizing: Python doubles the capacity of a dictionary when it becomes 2/3 full. Similarly, the capacity of a set gets quadrupled when it becomes 2/3 full. When such a thing happens, all key-value pairs need to be relocated into their new buckets. This is a pretty expensive step, albeit very infrequent, which keeps the amortized insertion complexity at \(O(1)\).
Putting it all together, this is a very rough implementation of a dictionary:
from collections.abc import Hashable
from typing import Any
class Dictionary:
"""
A dictionary implementation using linear probing for collision resolution.
"""
def __init__(self, capacity: int = 1024) -> None:
"""
Initialize the Dictionary with a given capacity.
Parameters:
- capacity (int): The initial capacity of the dictionary.
Returns:
- None
"""
# virgin buckets are set to None, deleted buckets to False
self.__buckets = [None for _ in range(capacity)]
self.__size = capacity
self.__n_items = 0
def __setitem__(self, key: Hashable, value: Any) -> None:
"""
Set the value for a given key in the dictionary.
Parameters:
- key (Hashable): The key to be inserted.
- value (Any): The value associated with the key.
Returns:
- None
"""
idx = self.__find_key_bucket(key)
self.__buckets[idx] = (key, value)
self.__n_items += 1
self.__resize_check()
def __getitem__(self, key: Hashable) -> Any:
"""
Get the value associated with a given key from the dictionary.
Parameters:
- key (Hashable): The key whose value needs to be retrieved.
Returns:
- Any: The value associated with the given key.
Raises:
- KeyError: If the key is not found in the dictionary.
"""
idx = self.__find_key_bucket(key)
return self.__buckets[idx][1]
def __contains__(self, key: Hashable) -> bool:
"""
Check if the dictionary contains a given key.
Parameters:
- key (Hashable): The key to check for existence in the dictionary.
Returns:
- bool: True if the key is present in the dictionary, False otherwise.
"""
try:
idx = self.__find_key_bucket(key)
except KeyError:
return False
return bool(self.__buckets[idx])
def __delitem__(self, key: Hashable) -> None:
"""
Delete the entry for a given key from the dictionary.
Parameters:
- key (Hashable): The key to be deleted.
Returns:
- None
Raises:
- KeyError: If the key is not found in the dictionary.
"""
idx = self.__find_key_bucket(key)
self.__buckets[idx] = False
self.__n_items -= 1
def __resize_check(self) -> None:
"""
Check if resizing of the dictionary is necessary based on the load factor (2/3).
Returns:
- None
"""
if self.__n_items < (self.__size * 2 / 3):
return
new_size = 4 * self.__size
dummy_dict = Dictionary(new_size)
# reinsert all existing key-value pairs
for bucket in self.__buckets:
if not bucket:
continue
key, value = bucket
dummy_dict[key] = value
self.__size = new_size
self.__buckets = dummy_dict.__buckets
self.__n_items = dummy_dict.__n_items
def __find_key_bucket(self, key: Hashable) -> int:
"""
Find the index of the bucket corresponding to a given key.
Parameters:
- key (Hashable): The key to be found.
Returns:
- int: The index of the bucket.
Raises:
- KeyError: If the key is not found in the dictionary.
"""
idx = hash(key) & (self.__size - 1)
n_iters = 0
# find an empty bucket (either None or False)
while self.__buckets[idx]:
# stop once we have checked all buckets
if n_iters >= self.__size:
raise KeyError
if isinstance(self.__buckets[idx], tuple):
if self.__buckets[idx][0] == key:
break
idx += 1
n_iters += 1
if idx >= self.__size:
idx = 0
return idx
Let’s see it in action:
# I show under each command shows the the internal bucket state
my_dict = Dictionary(4)
# [None, None, None, None]
my_dict[1] = "a"
# [None, (1, 'a'), None, None]
my_dict[2] = "b"
# [None, (1, 'a'), (2, 'b'), None]
# 75% is full, which will trigger a resizing
my_dict[5] = "c"
# [None, (1, 'a'), (2, 'b'), None, None, (5, 'c'), None,
# None, None, None, None, None, None, None, None, None]
my_dict[140] = "d"
# [None, (1, 'a'), (2, 'b'), None, None, (5, 'c'), None,
# None, None, None, None, None, (140, 'd'), None, None, None]
my_dict[1] = 2
# [None, (1, 2), (2, 'b'), None, None, (5, 'c'), None, None,
# None, None, None, None, (140, 'd'), None, None, None]
1 in my_dict # True
del my_dict[1]
# [None, False, (2, 'b'), None, None, (5, 'c'), None, None,
# None, None, None, None, (140, 'd'), None, None, None]
1 in my_dict # False
my_dict[1] = "x"
# [None, (1, 'x'), (2, 'b'), None, None, (5, 'c'), None,
# None, None, None, None, None, (140, 'd'), None, None, None]
my_dict[1] # x
my_dict[2] # b
Dictionaries have two star operators: *
and **
. Let’s see how they work:
ingredients = {
"carrots": 3,
"tomatoes": 2,
"lettuces": 1,
}
print({*ingredients})
{'tomatoes', 'carrots', 'lettuces'}
*
unpacked the keys, which went into a set.
print({**ingredients})
{'carrots': 3, 'tomatoes': 2, 'lettuces': 1}
**
unpacked the key-value pairs, which went into a new dictionary.
Since Python 3.6, Python dictionaries preserve insertion order, i.e., the items are printed in the same order in which they were inserted in the dictionary:
ingredients = {
"carrots": 3,
"tomatoes": 2,
"lettuces": 1,
}
print(ingredients)
{'carrots': 3, 'tomatoes': 2, 'lettuces': 1}
ingredients = {
"lettuces": 1,
"tomatoes": 2,
"carrots": 3,
}
print(ingredients)
{'lettuces': 1, 'tomatoes': 2, 'carrots': 3}
After Python 3.10, there are three ways of merging two dictionaries. Two of them are equivalent: unpacking using the **
operator and the merge operator |
:
dairy_1 = {
"cheese": 5,
"yogurt": 4
}
dairy_2 = {
"cheese": 3,
"paneer": 2
}
{**dairy_1, **dairy_2}
{'cheese': 3, 'yogurt': 4, 'paneer': 2}
dairy_1 | dairy_2
{'cheese': 3, 'yogurt': 4, 'paneer': 2}
These options create a new dictionary with all the key-value pairs. As one might expect, the key insertion order is preserved from left to right. Note than when there are shared keys, the last value is kept.
An alternative is dict.update()
, wich merges the dictionaries in place, updating the values when a key is shared:
dairy_1.update(dairy_2)
print(f"dairy_1 = {dairy_1}")
print(f"dairy_2 = {dairy_2}")
dairy_1 = {'cheese': 3, 'yogurt': 4, 'paneer': 2}
dairy_2 = {'cheese': 3, 'paneer': 2}
This is more memory efficient, since it does not create a new dictionary. However, it is not desirable if we want to keep the original dictionaries.
dict.setdefault
to set and fetch valuesThe dict.setdefault
method is useful to assign a value to a key if and only if the key is missing:
ingredients = {
"carrots": 3,
"tomatoes": 2,
"lettuces": 1,
}
ingredients.setdefault("carrots", 0)
ingredients.setdefault("pineapples", 0)
print(f"Number of carrots: {ingredients['carrots']}")
print(f"Number of pineapples: {ingredients['pineapples']}")
Number of carrots: 3
Number of pineapples: 0
However, and despite its name, dict.setdefault
will also fetch the value (either the preexisting one, or the newly created):
carrots = ingredients.setdefault("carrots", 0)
print(f"Number of carrots: {carrots}")
Number of carrots: 3
defaultdict
when there is a single default valueThe collections.defaultdict
goes one step beyond. They are a good replacement for dictionaries when there is a unique default value. Its first argument is a function which returns the default value. It will be called if and only if the key is missing:
from collections import defaultdict
ingredients_dd = defaultdict(lambda: 0)
for ingredient, amount in ingredients.items():
ingredients_dd[ingredient] = amount
print(ingredients_dd)
print(f"Number of cabbages: {ingredients_dd['cabbages']}")
print(ingredients_dd)
defaultdict(<function <lambda> at 0x1014eb6d0>, {'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0})
Number of cabbages: 0
defaultdict(<function <lambda> at 0x1014eb6d0>, {'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0, 'cabbage': 0})
Note that the ingredients_dd
contains an item for cabbages which was never explicitly inserted. defaultdict
not only allows us to write simpler code, but is more efficient than setdefault
to avoid unnecesary calls to the default factory. For instance, ingredients.setdefault("carrot", set())
would instantiate a new set even if the key carrot
already exists; defaultdict
would avoid that call.
collections.Counter
to countThe collections.Counter
is a type of dictionary specialized in counting objects, i.e., the values are integers. It can be initialized from an existing dictionary:
from collections import Counter
ingredients_counter = Counter(ingredients)
print(ingredients_counter)
Counter({'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0})
By default missing keys have a value of 0, but they are not inserted:
print(f"Number of cabbages: {ingredients_counter['cabbage']}")
print(ingredients_counter)
Number of cabbages: 0
Counter({'carrots': 3, 'tomatoes': 2, 'lettuces': 1, 'pineapples': 0})
Counters extend dictionaries in interesting ways. For instance, they make it easy to find the elements with the most counts:
ingredients_counter.most_common(1)
[('carrots', 3)]
Since tuples are hashable objects, they can be used as keys:
# store ingredients and purchase date
ingredients = {
("carrots", "2024-01-04"): 3,
("tomatoes", "2024-01-13"): 2,
("carrots", "2024-01-13"): 1,
}
Then, composite keys are used like this:
ingredients["carrots", "2024-01-13"]
1
zip
to create dictionaries from listsWhen we have two lists of the same length, we can quickly combine them using zip
:
ingredient_list = ["carrots", "tomatoes", "lettuces"]
counts = [3, 2, 1]
ingredients = dict(zip(ingredient_list, counts))
print(ingredients)
{'carrots': 3, 'tomatoes': 2, 'lettuces': 1}
In plain terms, the CPU is the part of the computer that carries out the computations themselves. It does so in a stream of discrete operations, called CPU instructions. Roughly, one instruction is computed for each clock cycle: a 3 GHz CPU executes 3 billion instructions each second. CPUs are fast.
The RAM is the (temporary) memory of the computer, where the data lives. We can picture it as a grid of buckets, each of which can contain 1 byte of information. (From now on, I will use the terms “bytes” and “buckets” interchangeably, depending on whether I want to emphasize to the metaphor or the data.) The grid aspect of it is important:
One byte, 8 bits, can take 256 different values. In consequence, it can store an integer from 0 to 255. If we need to store a larger value, we need to use multiple bytes.
The CPU can interact with the RAM in a limited number of ways:
When we execute a program, it interacts with the operating systems kernel to handle its resources. The memory assigned to a program is split into two components: the stack and the heap. They work differently, and serve different purposes.
The stack is the memory that serves as scratch space for the program. It handles function calls, local variables and context. It appropriately works as a stack, as a Last-In-First-Out (LIFO) structure with two operations: push and pop. For instance, when a function is called, a new block (a frame) is pushed on top of the stack. This frme will contain local variables and other information. When the function returns, the frame is popped.
The heap is the memory set aside for memory allocation: when a program needs more memory, it places a request to the kernel, which will allocate the requested amount from the heap, i.e., reserve it for the program’s use. Once the program does not need that chunk anymore, it will deallocate it, i.e., hand it back to the operating system. Unlike the stack, allocations and deallocations on the heap do not follow a specific order. Instead, it is the task of the program to keep track of what data is stored where, which parts are still used, and which ones are not and can be deallocated. Specifically, many programming languages have a routine called the garbage collector. The garbage collector monitors which pieces of data won’t be needed anymore, and periodically deallocates them. Note that allocations, deallocations and garbage collector’s runs are expensive.
Reading a byte directly from RAM takes around 500 CPU cycles. However, this can be sped up by copying the data to the CPU’s cache, which is closer to the CPU and faster. However, the cache can only hold a few kilobytes. When the CPU needs a piece of data, it first checks if it is already available in the cache. In that case, retrieving it just takes a couple of cycles. If it is not, we are in a situation known as cache miss, in which the program can’t proceed until the required data is retrieved. The amount of time lost in a single cache miss is minuscule. However, if cache misses are common in a program, they can quickly add up.
CPUs have a component, called prefetcher, which tries to mitigate cache misses. To achieve that, it actively tries to predict which pieces of data the CPU will need in the near future, and preemptively copies them to the cache. For instance it assumes that data that lives together works together: when it needs data stored in a particular memory address it fetches the whole cache line. Also, when data is accessed in a predictable manner, the prefetcher will learn and exploit this pattern.
The registers are the small sized slots on which the CPU acts on. Traditionally they had 8 bytes in size, just enough for one 64-bit integer or float. For instance, adding two such numbers requires the CPU to use two registers. Modern CPUs, however, have specialized larger registers, of up to 64 bytes. While they can hold massive 512-bit numbers, it is more interesting to have them hold vectors, e.g., of eight 64-bit numbers or sixteen 32-bit numbers. This unlocks efficient vectorization operations, or single instruction, multiple data (SIMD).
Similarly to the computer’s memory, lists and tuples can be visualized as a sequence of equally-sized buckets. Each bucket can store a fixed-length integer (e.g., 64 bits in modern computers), representing the memory address to an object. The buckets are located consecutively in memory, in a data structure known as array. When Python instantiates an array, it will request \(N\) consecutive buckets to the kernel. Out of those, the first bucket stores the length of the array, and the remainng \(N - 1\) will store the elements. However, lists are stored in so-called dynamic arrays, while tuples are stored in static arrays. Let’s explore why:
Lookup: since the buckets in the array are equally-sized and consecutive, we can quickly retrieve any item by knowing where the array starts and the index of its bucket. For instance, if our array starts at bucket index 1403, and our bucket is index 5 within the array, we simply need to go to bucket index 1408. Hence, accessing a given index is \(O(1)\).
Search: if we need to find a particular object in an unsorted array, we need to perform a linear search. This algorithm has a complexity \(O(n)\). If the array has been sorted, we can use binary search, which is \(O(\log n)\).
Sort: Python uses Timsort, a combination of heuristics, and insertion and merge sort. Best case is \(O(n)\), worst case is \(O(n \log n)\).
Insertion: we can replace an existing element in \(O(1)\). That is also the case in most insertions of a new element at the end using append()
. However, when the list’s array gets full (the worst case) inserting a new element is \(O(n)\).
Deletion: \(O(1)\), using del
.
Insertion: though tuples are immutable, we can consider the combination of two tuples into a longer one as an insertion operation. If they have sizes \(m\) and \(n\), each item needs to be copied to the new tuple. Hence, the complexity is \(O(m+n)\).
The list.sort
method orders a list’s elements in ascending order. It will work as long as the items have defined the <
comparison operator, as is the case for floats, integers and strings. However, in some cases that operator might not be implemented, or might not be making the comparison that we care about. The key
argument can be helpful in those cases:
class Animal:
def __init__(self, name, weight):
self.name = name
self.weight = weight
def __repr__(self):
return f"Animal({self.name}, {self.weight})"
animals = [
Animal("whale", 100000),
Animal("sea lion", 200),
Animal("lion", 200),
Animal("possum", 2.5)
]
# sort by weight
animals.sort(key = lambda x: x.weight)
print(animals)
[Animal(possum, 2.5), Animal(sea lion, 200), Animal(lion, 200), Animal(whale, 100000)]
# sort by name
animals.sort(key = lambda x: x.name)
print(animals)
[Animal(lion, 200), Animal(possum, 2.5), Animal(sea lion, 200), Animal(whale, 100000)]
As shown, key
takes a function which will receive an item, and output a comparable value. If we want to order first by weight, then by name, we just need to combine both in a tuple:
# sort by name
animals.sort(key = lambda x: (x.weight, x.name))
print(animals)
[Animal(possum, 2.5), Animal(lion, 200), Animal(sea lion, 200), Animal(whale, 100000)]
When comparing tuples, Python first compares the initial elements. If they are equal, it then compares the second one, and so on.
We can use unpacking to swap multiple of a list in place, without requiring additional temporary variables:
x = [1, 2, 3]
x[1], x[0] = x[0], x[1]
print(x)
[2, 1, 3]
In some cases we might want to unpack a tuple whose length we don’t know a priori. In such cases, we can use an starred expression, which will receive all the values that are not captured by another item:
x = [1, 2, 3, 4, 5]
first, second, *middle, last = x
print(first, second, middle, last)
1 2 [3, 4] 5
If there is nothing left to unpack, the starred item will become an empty list:
x = [1, 2, 3]
first, second, *middle, last = x
print(first, second, middle, last)
1 2 [] 3
Starred elements cannot be used without non-starred elements (e.g., *all = x
); multiple starred expressions cannot be used together either (e.g., first, *middle1, *middle2 = x
).
collections.deque
for queuing problemsWhen dealing with queuing problems, we can use collections.deque
,
from collections import deque
q = deque()
q.append(1)
q.append(2)
q.appendleft(3)
q.appendleft(4)
print(q)
deque([4, 3, 1, 2])
q.pop()
2
q.popleft()
4
Deque is more efficient than a list when dealing with queuing problems.
collections.ChainMap
It is often said that in Python, everything is an object: builtins, functions, classes, instances, etc. Thus, improving our understanding of objects will is key to mastering Python. In this first post I explore some general concepts related to objects.
Simply put, an object is a data structure with an internal state (a set of variables) and a behaviour (a set of functions). The “class” is the template to create new objects or instances. New objects are defined using the class
operator, and instantiated using the class name.
Every object has, at least, three properties: a reference, a class, and a refcount.
A reference is a pointer, a way to access the memory address that stores the object. It can be associated to a name, or an element in a collection:
# create a new integer object, and
# copy its reference to the name "a"
a = 1
# create a new integer object, and
# append its reference to the list "x"
x = list()
x.append(1)
We can retrieve the memory address using id()
, represented as an integer:
id(a)
4342270592
Note that the assignment operator (=
) never makes a copy of the value being assigned, it just copies the reference. Similarly, the del
operator never deletes an object, just a reference to it. We can check if two names point to the same memory location using is
:
x = [1, 2, 3]
y = [1, 2, 3]
z = x
# same reference?
assert x is z
assert id(x) == id(z)
assert x is not y
# same value?
assert x == y
A class is the type of the object (e.g., a float, or a str). Each object contains a pointer to its class, as we will see below. We can know an object’s class using the type
operator:
type(1)
<class 'int'>
type("1")
<class 'str'>
type(1.)
<class 'float'>
Similarly, we can check if an object is an instance of a given class:
assert isinstance(1, int)
assert isinstance("1", str)
assert isinstance(1., float)
The refcount is a counter that keeps track of how many references point to an object. Its value gets increased by 1 when, for instance, an object gets assigned to a new name. It gets decreased by 1 when a name goes out of scope or is explicitly deleted (del
). When the refcount reaches 0, its object’s memory will be reclaimed by the garbage collector.
In principle, we can access the refcounts of a variable using sys.getrefcount
:
x = []
sys.getrefcount(x)
2
Note that its output of getrefcount
is always increased by 1, as the function itself contains a reference for the variable. Let’s see another example:
sys.getrefcount(1)
218
I expected that a newly created integer would have a refcount
of 1. However, the actual number is much higher. The explanation might involve optimizations happening under the hood. Accordingly, the refcount of more and more unique integers gets smaller:
assert sys.refcount(2) < sys.getrefcount(123456)
On top of these three properties, objects have additional properties and methods that encode their state and behaviors. For instance, the float
class has an additional property that stores the numerical value, as well as multiple methods that enable algebraic operations. A user-defined object will have an arbitrary number of attributes and methods.
Notably, the None
object has no other properties. It is a singleton: only one such object exists:
a = None
b = None
# same object, despite independent assignments
assert a is b
Objects are first-class citizens in Python. In other words, they can:
def pretty_print(x: str):
print(x.title() + ".")
pp = pretty_print
pp("hey there")
Hey there.
from typing import Callable
def format(x: str, formatter: Callable[[str], None]):
formatter(x)
format("hey there", pretty_print)
Hey there.
def formatter_factory():
return pretty_print
formatter_factory()("hey there")
Hey there.
As mentioned above, =
does not copy objects, only references. If we need to copy an object, we need to use the copy
module. There are two kinds of copies:
copy.copy
copies the object, but any reference it stores will just get copied, i.e., not the whole referenced object.copy.deepcopy
recursively copies the object, all the objects it references to, and so on.from copy import copy
x = [1, 2, [3, 4]]
# copies the two first integers, but only
# the reference to the 3rd element
y = copy(x)
x[2].append(5)
print(y[2])
[3, 4, 5]
from copy import deepcopy
x = [1, 2, [3, 4]]
# copies the two first integers as
# well as the list
y = deepcopy(x)
x[2].append(5)
print(y[2])
[3, 4]
Python allows us to define our own classes:
class Animal:
phylum = "metazoan"
def __init__(self, name, weight):
self.name = name
self.weight = weight
self.__favorite = True
def eat(self):
self.weight += 1
print("chompchomp")
Below I zoom in on some interesting features.
In other languages, a class’ attributes can be set as protected (only accessible within the class and subclasses) or as private (only accessible within the class). While you can always modify attributes from the outside in Python, the language emulates protected and private attributes by prepending one or two underscores respectively:
whale = Animal("whale", 100000)
whale.__favorite # a private attribute
AttributeError: 'Animal' object has no attribute '__favorite'
If we want to access that attribute, we need to put some extra effort:
print(whale._Animal__favorite)
True
However, and rather confusingly, this is valid:
whale.__favorite = False
print(whale._Animal__favorite)
print(whale.__favorite)
True
False
Two dictionaries underlie each object, and are accessible using instance.__dict__
and Class.__dict__
. The first one is the instance-specific dictionary, unique to that instance and containing its writable attributes:
whale = Animal("whale", 100000)
print(whale.__dict__)
{'name': 'whale', 'weight': 100000, '_Animal__favorite': True}
Note that private attributes like __favorite
appear with an altered name of the form _{class name}{attribute}
.
Similarly, each class has its own dictionary, containing the data and functions used by all instances (class’ methods, the attributes defined at the class level, etc.):
Animal.__dict__
mappingproxy({'__module__': '__main__', 'phylum': 'metazoan', '__init__': <function Animal.__init__ at 0x103236b90>, 'eat': <function Animal.eat at 0x103236c20>, '__dict__': <attribute '__dict__' of 'Animal' objects>, '__weakref__': <attribute '__weakref__' of 'Animal' objects>, '__doc__': None})
For instance, this is where the Animal.eat()
method lives. This dictionary is shared by all the instances, which is why every non-static method requires the instance to be passed as the first argument. Under the hood, when we call an instance’s method, Python finds the method in the class dictionary and passes the instance as first argument. But we can also do it explicitly:
Animal.__dict__["eat"]()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Animal.eat() missing 1 required positional argument: 'self'
Animal.__dict__["eat"](whale)
chompchomp
Both dictionaries are linked by instance.__class__
, which is assigned to the class object:
assert whale.__class__.__dict__ == Animal.__dict__
As we saw, an attribute might exist in either dictionary. To find an attribute at runtime, Python will first search instance.__dict__
, and then Class.__dict__
if unsuccessful.
__slots__
helps with memory optimizationThe instance’s dictionary keeps the class flexible, since we can add new attributes at any time:
whale.medium = "water"
print(whale.__dict__)
{'name': 'whale', 'weight': 100000, '_Animal__favorite': True, 'medium': 'water'}
__slots__
allows us to fix the possible attributes a priori, allowing Python to reserve the exact amoung of memory needed and to bypass the creation of the dictionary:
class EfficientAnimal:
__slots__ = ["name", "weight", "__favorite"]
phylum = "metazoan"
def __init__(self, name, weight):
self.name = name
self.weight = weight
self.__favorite = True
dog = EfficientAnimal("dog", 10)
dog.__dict__
AttributeError: 'EfficientAnimal' object has no attribute '__dict__'. Did you mean: '__dir__'?
In addition to the memory optimizations, this approach also helps to prevent bugs caused by typos in variable names:
dog.namme = "puppy"
AttributeError: 'EfficientAnimal' object has no attribute 'namme'
__slots__
: https://wiki.python.org/moin/UsingSlotsThe main selling point of NumPy is the speed up in computations it offers. However, to understand that, we need to first understand the underpinnings of the NumPy array, or ndarray
. The ndarray
is made up of two components: an array (the data buffer) containing the data, and the metadata, containing information about the data buffer.
The data buffer is an array of elements of the same type:
import numpy as np
# ndarrays can contain multiple data types, e.g.,
# integers
np.array([1, 2, 3])
# floats
np.array([1., 2., 3.])
# 1-character string
np.array(["1", "2", "3"])
# Python references - which, as we will see soon,
# greatly undoes the benefits of NumPy
np.array([int, float, str])
Let’s compare this to a Python builtin: a list containing floats. Remember that each of the 64-bit buckets does not store the float, but a 64-bit reference to the float object. This keeps lists flexible, since each bucket can contain a reference to any object, float or not. But it comes with a memory overhead since, on top of the reference, we need to store the object itself, which is more complex than a naked float value. In contrast, an ndarray
stores only the floats themselves, requiring less than 25% of the memory. Furthermore, in the list of floats, the data is fragmented: the list itself, and all the objects it references are scattered across memory. In contrast, the whole data buffer lives in a single block of memory.
The metadata includes important information about the data buffer. We can access the metadata like this:
x = np.array([1, 2, 3, 4, 5, 6])
x.__array_interface__
{'data': (105553130143936, False),
'descr': [('', '<i8')],
'shape': (6,),
'strides': None,
'typestr': '<i8',
'version': 3}
For instance, each ndarray
has a data type, or dtype
, which specifies what type of elements it contains (e.g., float64
, float16
or int32
). If we need further memory savings, we can consider reducing the numerical precision:
np.array([1., 2., 3.], dtype = "float16").nbytes # 6
np.array([1., 2., 3.], dtype = "float64").nbytes # 24
Vectorization refers to computing multiple computations at once. As an example:
# non-vectorized
xs = [1, 2, 3]
ys = [4, 5, 6]
# we sequentially iterate the
# pairs and compute the sum
[x + y for x, y in zip(xs, ys)]
[5, 7, 9]
# vectorized
import numpy as np
xs = np.array(xs)
ys = np.array(ys)
# all 3 sums happen at once
xs + ys
array([5, 7, 9])
Importantly, a vectorized operation is vastly faster than its explicit for
loop counterpart. To understand why, we need to take a step back, and understand how the RAM and the CPU interact. CPU and RAM are the two sides of this story.
First, Python’s native data structures are highly fragmented. This severely hampers the prefetcher, and cache misses are common. In other words, the list is an object with an attribute containing a reference to an array which in turn stores references to float objects. The object, the array and the floats are scattered across memory. (Potentially, this could be solved by arrays
. But they seem to have their own downsides.) Keeping data together, as ndarrays
do, leads to less cache misses. (Additional gains are possible by reducing the number of cache lines an array spans, e.g., using aligning their beginning to the memory grid. So far I have find some indications, but not strong sources supporting that NumPy attempts this too.)
Second, Python is dynamically typed. This means that every operation between two numbers becomes a complex interaction between two heavy data structures. Internally, Python needs to find out the types of the objects, recover their values, run the computation, and store the result in a new object. Statically typed languages avoid much of this overhead.
Vectorization has another meaning in hardware. Specifically, it refers to SIMD, the ability of the CPU to handle multiple numbers in a single instruction. CPython does not leverage SIMD, or gives us access to them. However, many NumPy functions are implemented in low-level languages to take advantage of this instruction set. This leads to even faster code. However, we won’t take advantage of this optimization when using Python-implemented vector operations, for instance custom transformations of our data.
By default, ndarrays
store matrices in a row-major order, that is, as a concatenation of the rows of the matrix. In other words, the elements from the same row live close (sharing cache lines), but the ones in the same column might live very far apart. Since retrieving one element will copy to the CPU cache the whole matrix row, row operations are fast. In contrast, column operations are slow, since they require copying as many cache lines as rows. Let’s see one example:
import numpy as np
import time
# create a large matrix
n_rows = 100_000
n_cols = 100_000
# the default order is "C",
# which confusingly refers to row-major
matrix = np.ones((n_rows, n_cols), order = "C", dtype = "int8")
# time row operation
start_time = time.time()
_ = matrix.sum(axis = 1) # sum the columns row-wise
end_time = time.time()
row_time = end_time - start_time
print("Time taken for the row operation:", row_time)
# time column operation
start_time = time.time()
_ = matrix.sum(axis = 0) # sum the rows col-wise
end_time = time.time()
col_time = end_time - start_time
print("Time taken for the column operation:", col_time)
Time taken for the row operation: 8.663719177246094
Time taken for the column operation: 9.641089916229248
(Note that ndarray.sum()
calls an efficient, low-level function. The gap is much larger for most user-defined functions.)
Consistenly, shifting to column-major order (order = "F"
) produces the opposite result.Carefully considering the operations we will be carrying out can have a major impact.
NumPy introduced an important but tricky concept: data views. A views is just a new way to access the data buffer of an existing ndarray
, with different metadata. Some operations produce views, like basic indexing (i.e., using single indexes and slices):
x = np.array([1, 2, 3, 4, 5, 6])
y = x[:2]
# x and y point to the same data buffer
assert x.__array_interface__['data'][0] == y.__array_interface__['data'][0]
We can check if an ndarray is a view using the base attribute:
assert x.base is None
assert y.base is not None
assert y.base is x
Other operations return a copy of the data:
x = np.array([1, 2, 3, 4, 5, 6])
# advanced indexing
# (e.g., integer or boolean arrays)
y = x[[0, 1]]
z = x[x > 4]
assert y.base is None
assert z.base is None
# arithmetic operations
y = x + 1
assert y.base is None
Other functions, like numpy.reshape
or numpy.ravel
, will produce a view whenever possible and a copy otherwise.
Lastly, note that some operations happen in-place:
x = np.array([1, 2, 3, 4, 5, 6])
pointer_1 = x.__array_interface__['data'][0]
x += 1
pointer_2 = x.__array_interface__['data'][0]
assert pointer_1 == pointer_2
assert x.base is None
Let’s see some specific cases.
Casting is the conversion of data from one type into another. The simplest form of casting is simply changing the type of an ndarray
:
x = np.array([1, 2, 3], dtype = "int8") # x.dtype is int64
y = x.astype("float16")
In general, casting triggers a copy:
assert y.base is None
Casting commonly occurs when performing arithmetic operations between different types. In those cases, NumPy picks the largest type that can safely represent both operants without losing precision. For instance, x + y
involves int8
and float16
. Since the latter can represent the former, the type of the sum will be float16
.
We can explicitly trigger a reinterpretation of the data buffer under another type using ndarray.view()
. Note that produces a view, not a re-casting of the original ndarray
. Let’s see an example:
x = np.array([1, 2], dtype = "uint8")
binary_repr = ''.join(format(byte, '08b') for byte in x.data.tobytes())
print(binary_repr)
0000000100000010
This is what our array looks like in memory. We can see the binary encoding as an np.unit8
of 1 (00000001
) and of 2 (00000010
). ndarray.view()
can reinterpret this sting of 16 bits as an np.int16
:
assert x.view("int16").byteswap()[0] == int(binary_repr, base = 2) # 258
Ignore the byteswap()
call. It relates to how NumPy parses a sequence of bits.
Understanding how exposures, such as gene expression, cause complex phenotypes is a key question in biology. While randomized controlled trials are the gold standard for establishing causality, their use on humans is unethical, necessitating the use of observational studies. Crucially, observational studies cannot directly prove causality: they can produce biased estimates and be affected by unknown confounders and reverse causality. Interestingly, SNPs are generally equivalent to a randomized treatment, as they are randomized at conception and fixed throughout life, making them unaffected by confounders (though LD and population structure are exceptions to the former, and canalization to the latter). Hence, when this is true, we can indeed establish causality. This is the case between siblings, for which the inheritance of each SNP is completely randomized. However, in the general population, the distribution of variants is not completely random. For instance, due to assortative mating, people with similar heritable phenotypes will tend to get together. In general, traits that are more genetically proximal (e.g., molecular traits like gene expression) are less prone to these biases.
Mendelian Randomization (MR) studies the causal effect of an exposure on an outcome using genetic variants (SNPs). When certain assumptions are met (Key assumptions and how to verify them), it can address the aforementioned issues. MR leverages this directed acyclic graph (DAG).
Our goal is to precisely estimate the effect of an exposure on a trait. However, the presence of confounders that are either unobserved or difficult to measure makes this impossible. An instrument, such as the SNP, is a variable that modifies the exposure and is not affected by the confounders. Since there are no backdoor paths between the SNP and the outcome, any effect between them must occur through the exposure.
In mathematical terms, we aim to estimate the effect of an exposure $X$ on an outcome $Y$using an instrument $Z$. Since $Z$ is a SNP, it will usually take values in 0, 1 and 2 (the number of minor alleles). Assume some potential confounders affecting both $X$ and $Y$. The most basic MR protocol assumes linearity, and consists of three steps:
The estimated causal effect $\beta_{\hat{X}Y}$ is unconfounded! This is usually performed via two-stage least squares. Alternatively, we can obtain $\beta_{\hat{X}Y}$ without computing it explicitly via the Wald ratio estimator: $\beta_{\hat{X}Y} = \frac {\beta_{ZY}} {\beta_{ZX}}$. Note that the standard error of $\beta_{\hat{X}Y}$ also needs to be computed.
MR relies on three assumptions:
Other common, but optional, assumptions are linearity and homogeneity (the effect does not vary across strata in the population, like SNP levels or sex).
Although these assumptions are sometimes violated, we have ways to detect such cases:
MR methods can be classified according to multiple criteria:
One-sample: the SNP, exposure and outcome are measured on the same samples.
Example: two-stage least squares.
Two-sample: the SNP-exposure relationship is measured on a set of samples, and the exposure-outcome on another one.
Example: Wald ratio estimate
MR provides an ethical way to find causal links between exposures and diseases.
Exposure: it can be simple (e.g., gene expression), or complex (e.g., body mas index).
Instruments: we will use as many SNPs associated to the exposure of interest as possible (e.g., polygenic score). This will capture the full genetic architecture of the trait as well as violations of the assumptions.
In the clinical setting, our interest is learning about modifiable ****causes of disease, i.e., those that can be treated.
Exposure: typically, a protein, since it needs to be targetable by a small molecule, a monoclonal antibody, etc. The expectation is that it will affect a complex biomarker through vertical pleiotropy (e.g., drugs that lower blood LDL do not target LDL directly, but hamper LDL/cholesterol synthesis).
Instruments: SNPs affecting the gene/protein of interest, often cis-pQTL. Trans-pQTL can also be used, but they are more likely to partake in horizontal pleiotropy.
We can use multivariable MR to get tissue-specific exposures, and partition the effect on the phenotype. Two disclaimers are in order: GWAS hits often colocalize with eQTLs in multiple tissues; and expression control in the diseased state might not match that in the healthy state.
Multivariable MR studies the causal impact of multiple exposures. This allows to consider complex causal pathways, even those involving mediators.