Skip to content

4. Snippets and Recipes

Kamal Banga edited this page Aug 1, 2019 · 8 revisions

math mini-tour

from statistics import mean
from math import log, exp, factorial
data = [3, 5, 14, 20, 25]
geometric_mean = exp(mean(map(log, data)))

e = sum(1/factorial(x) for x in range(10)) # 2.718; base of natural log

Flattening a nested sequence

from collections import Iterable

def flatten(items):
    for x in items:
        if isinstance(x, Iterable):
            yield from flatten(x)
    else:
        yield x

items = [1, 2, [3, 4, [5, 6], 7], 8]

list(flatten(items)) # [1, 2, 3, 4, 5, 6, 7, 8]

here,

yield from flatten(x)

is equivalent to

for i in flatten(x):
    yield i

This recipe won't work for iterables of strings because str object is an Iterable too

def flatten(items):
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
            yield from flatten(x)
    else:
        yield x

For, names = ['Ram', ['John'], 'Shahid'], list(flatten(names)) gives ['Ram', 'John', 'Shahid']


Random Variables

What happens if you sum a bunch of random variables and then sample from it?

from random import random
from collections import Counter
rvs = Counter(round(sum(random() for _ in range(300))) for _ in range(1000))
for i in range(min(rvs), max(rvs) + 1):
  print('-' * rvs[i])

What does the plot 📊 look like? We take sum of uniform random numbers, round it to integer and finally plot histogram of frequencies of resulting integers multiple times. The plot closely follows Normal distribution (also called bell curve 🔔). This is the statement of the Central Limit Theorem.


Iterator tools 🐎 💨

itertools are a bunch of superb, fast, memory-efficient tools for iterators. For e.g., take combinations (nCk):

from itertools import combinations
import math
  1. Powerset: Powerset of a set S is the set of all subsets of S. In other words powerset is the set of all subsets of size k for all k and subsets of size k of a set is just a combination
powerset = chain.from_iterable(combinations(ls, r) for r in range(len(ls)+1))
  1. Pythagorean triplets: Triplets (a, b, c) such that a^2 + b^2 = c^2
for x, y, z in combinations(range(100, 3)):
  if math.gcd(x, y) == 1 and x**2 + y**2 == z**2:
    print(x, y, z)
  1. Sensitivity Conjecture: If you consider all n-bit words and take any set of more than half of them (i.e. at least 1+2^(n-1)), then this set will contain some word for which there are at least √n other words in the set that differ in exactly one bit position.

Verifying for n = 4,

n = 4
m = math.sqrt(n)

def degree(word, s):
    """How many other words differ in one bit position"""
    return sum(1 for w in s if bin(word ^ w).count('1') == 1)

words = range(2**n)  # All n-bit words
for s in combinations(words, 1 + 2 ** (n - 1)):
    assert any(degree(word, s) >= m for word in s)

The idea here is, to consider any set of more than half of all n-bit words you need to take a combination. Another way to code the same would be to use all instead of for-loop with an assert

all(any(degree(word, s) >= m for word in s) for s in combinations(words, 1 + 2 ** (n - 1)))

Classes

from math import pow

class Polynomial:
    def __init__(self, *coeffs):
        self.coeffs = coeffs

    def __add__(self, other):
        if not isinstance(other, Polynomial):
            raise TypeError(f'{other} is not an instance of Polynomial')
        new_coeffs = []
        for x, y in zip_longest(reversed(self.coeffs), reversed(other.coeffs), fillvalue=0):
            new_coeffs.append(x + y)
        return Polynomial(*list(reversed(new_coeffs)))

    def __repr__(self):
        return f'Polynomial({self.coeffs})'

    def __len__(self):
        return len(self.coeffs)

    def __call__(self, value):
        return sum(coeff * pow(value, idx)
                for idx, coeff in enumerate(reversed(self.coeffs)))
>>> p1 = Polynomial(1, 2, 7)
>>> p2 = Polynomial(5, 3, 0, 2)
>>> p1
Polynomial((1, 2, 7))
>>> print(len(p1))
3
>>> p1 + p2
Polynomial((5, 4, 2, 9))
>>> p2(4)
370.0

McIlroy vs Knuth and word counts

Jon Bentley had a regular column called “Programming Pearls” in the Communications of the ACM. In 1986 he got interested in literate programming, so he asked Donald Knuth to write a program in that style as a guest column and Doug McIlroy to write a literary-style critique of it. The program was

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

Knuth's 10+ pages of Pascal program used a clever, purpose-built data structure for keeping track of the words and frequency counts. McIlroy replied with following script

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

Doing the same task in Python is trivial using the collections.Counter class. A Counter is a dict subclass where elements are stored as keys and their counts as values. The Counter class is similar to C++ multisets.

from collections import Counter

def wordcount(filepath, n=10):
    with open(filepath) as handle:
        words = handle.read().lower().split()
    return Counter(words).most_common(n)
for w, c in wordcount('tales.txt', 5):
    print(w, c)

prints

the 5247
and 3948
to 3614
of 3019
he 2021

DFS

def dfs(graph, node, visited=None):
	if visited is None:
		visited = set()
	if node in visited:
		return
	print(node)
	visited.add(node)
	children = graph.get(node)
	if children:
		for child in children:
			dfs(graph, child, visited)
	return

Quicksort

def qsort(L):
    if len(L) <= 1: return L
    return qsort([le for le in L[1:] if le <= L[0]]) + L[0:1] + \
           qsort([gt for gt in L[1:] if gt > L[0]])

Bits 0️⃣1️⃣

def is_power_of_2(x):
  return x & (x-1) == 0

Fuzzy Matching

Let's say you are trying to understand the intent of a search query and want to match input text to one of your catalogue brands,

brands = ['bare', 'alvin kelly', 'firetrap', 'victorias secret', 'bagforever', 'topshe', 'sublime', 'snobz', 'allen cooper', 'chavvis jewel', 'harry potter', 'status quo', 'neha taneja', 'elie saab', 'aady austin', 'disrupt', 'mark & keith', 'regal', 'tarusa', 'indian silks', 'bluestone', 'marcha ballerina', 'georgia pacific', 'hidekraft', '8 one 9', 'avira home', 'rangeelo rajasthan', 'adaira', 'thomas & friends', 'kids ville', 'betty jackson', 'my lil berry', 'jacques bogart', 'bolts and barrels', 'old spice', 'martini', 'simba', 'yolo-you only live once', 'yves rocher', 'jumps and bumps', 'arah', 'man of steel', 'guava', 'jazzup', 'mrignain', 'ucla', 'sitara', 'true soles', 'hubland', 'john varvatos', 'talinum', 'v-moda', 'seven for all souls', 'tanya sharma for stylista', 'minions', 'dsigner', 'aujjessa', 'beauty co seoul', 'emper', 'yellow jeans', 'rangrezz', 'espiral by the darling trap', 'alba botanica', 'orewa', 'organic therapie', 'happy face', 'marie comfort', 'ventoland', 'la plazeite', 'alaia paris', 'kids on board', 'f gear', 'chhabra xclusive', 'fritzberg', 'j. del pozo', 'crosscreek', 'indigo nation', 'trend arrest', 'justanned', 'clayspray', 'simpsons', 'osaiz', 'paani puri', 'swiss design', 'little kangaroos', 'akavya', 'travel blue', 'adidas', 'danish design', 'timex', 'evogue me', 'dezignare', 'cottstrings', 'van heusen', 'miss chase', 'bay island', 'ligalz', 'andrew scott', 'vedantika herbals', 'joy & mario', 'ananda in the himalayas']

you could use the fuzzywuzzy library

from fuzzywuzzy import process

process.extractOne('addidas', brands) # ('adidas', 92)

Linear Algebra

Effective Rank

Many matrix-based algorithms' "goodness" depends on the rank of your data matrix. For e.g., collaborative filtering assumes the original user-item-rating matrix is low-rank. But the concept of rank is of little help in real life since it's very sensitive to minute values.

import numpy as np
eps = 1e-5
m1 = np.array([[1, 1], [1, 1]])
m2 = np.array([[1, 1+eps], [1, 1]])

Straightforward rank gives values 1 and 2 simultaneously,

from numpy.linalg import matrix_rank as rank
rank(m1), rank(m2)
# 1, 2

Singular Value Decomposition

Using Singular Value Decomposition(SVD), we can make up something akin to effective rank. The diagonal elements in SVD are scalars called eigenvalues, which indicate how material a dimension is to form the original matrix.

from numpy.linalg import svd

def eff_rank(mat, tolerance=1e-4):
  _, sigma, _ = svd(mat)
  return len([s for s in sigma if s > tolerance])

This gives us something more intuitive and hopefully helpful,

eff_rank(m1), eff_rank(m2)
# 1, 1

Consistency in Investing

Avg CAGR; Warren Buffet consistency quote; dirichlet distribution for a simplex

Compound interest is the eighth wonder of the world. He who understands it, earns it ... he who doesn't ... pays it.

compound = lambda x: (1+1/x)**x
>>> compound(1)
2.0
>>> compound(10)
2.59
>>> compound(100)
2.705
>>> compound(1000)
2.7169
from operator import mul
from functools import reduce
from random import randint
from statistics import stdev

def avg_cagr(*percents):
    amount = reduce(mul, [1+p/100 for p in percents])
    return amount**(1/len(percents)) - 1

def random_sum(n=5, amount=100):
    '''Returns n random numbers which sum to "amount"'''
    randoms = [randint(1, amount) for _ in range(n)]
    div_factor = sum(randoms)/amount
    randoms_normalized = [round(r/div_factor, 2) for r in randoms]
    return randoms_normalized
    
def best_returns(n=5, amount=100, iterations=100):
    results = []
    for _ in range(iterations):
        rands = random_sum(n, amount)
        returns = avg_cagr(*rands)
        deviation = stdev(rands)
        results.append((deviation, returns))
    return sorted(results, key=lambda x: x[1], reverse=True)

Binary Search

Python's standard library has bisect module that provides methods for bisection on sorted lists. The bisect_left function locates the proper insertion point for a given item in a (sorted) list to maintain sorted order.

from bisect import bisect_left

def binary_search(arr, x, lo=0, hi=None):  # can't use a to specify default for hi
    hi = hi or len(arr)  # when hi is None it defaults to length of array
    pos = bisect_left(arr, x, lo, hi)  # find insertion position
    return (pos if pos != hi and arr[pos] == x else -1)  # don't walk off the end

Why can't we pass default argument of len(arr) to hi like hi=len(arr)?
Here you could also write hi = hi if hi is not None else len(a). See or operator

Limits

lim x→1 (1-xi)/(1-xn) evaluates to i/n (using L'Hospital's Rule). To simulate this function, we will need arbitrary precision floating point for which we will use the standard library module decimal.

from decimal import Decimal
def f(x, i, n):
  return Decimal((1-x**i)/(1-x**n))

f(1-1e-10, 5, 10) returns Decimal('0.5')

Memory usage (in GB)

import os
import psutil

def mem_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss/1e9

Split a list into evenly sized chunks

chunks function 👇 is a generator

def chunks(l, n):
    '''Yield successive n-sized chunks from l.'''
    for i in range(0, len(l), n):
        yield l[i:i+n]

Practice

Grouping: Given a list of user scores, group them user-wise.

user_scores = [('joe', 110), ('john', 120), ('joe', 95), ('john', 88)]
from collections import defaultdict
user_scores_dict = defaultdict(list)
for user, score in user_scores.items():
  user_scores_dict[user].append(score)

Direct links

Iterators

Bell Curve

Clone this wiki locally