-
Notifications
You must be signed in to change notification settings - Fork 0
4. Snippets and Recipes
from statistics import mean
from math import log, exp, factorial
data = [3, 5, 14, 20, 25]
geometric_mean = exp(mean(map(log, data)))
e = sum(1/factorial(x) for x in range(10)) # 2.718; base of natural log
from collections import Iterable
def flatten(items):
for x in items:
if isinstance(x, Iterable):
yield from flatten(x)
else:
yield x
items = [1, 2, [3, 4, [5, 6], 7], 8]
list(flatten(items)) # [1, 2, 3, 4, 5, 6, 7, 8]
here,
yield from flatten(x)
is equivalent to
for i in flatten(x):
yield i
This recipe won't work for iterables of strings because str
object is an Iterable
too
def flatten(items):
for x in items:
if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
yield from flatten(x)
else:
yield x
For, names = ['Ram', ['John'], 'Shahid']
, list(flatten(names))
gives ['Ram', 'John', 'Shahid']
What happens if you sum a bunch of random variables and then sample from it?
from random import random
from collections import Counter
rvs = Counter(round(sum(random() for _ in range(300))) for _ in range(1000))
for i in range(min(rvs), max(rvs) + 1):
print('-' * rvs[i])
What does the plot 📊 look like? We take sum of uniform random numbers, round it to integer and finally plot histogram of frequencies of resulting integers multiple times. The plot closely follows Normal distribution (also called bell curve 🔔). This is the statement of the Central Limit Theorem.
itertools
are a bunch of superb, fast, memory-efficient tools for iterators. For e.g., take combinations
(nCk):
from itertools import combinations
import math
-
Powerset: Powerset of a set S is the set of all subsets of S. In other words powerset is the set of all subsets of size k for all k and subsets of size k of a set is just a
combination
powerset = chain.from_iterable(combinations(ls, r) for r in range(len(ls)+1))
- Pythagorean triplets: Triplets (a, b, c) such that a^2 + b^2 = c^2
for x, y, z in combinations(range(100, 3)):
if math.gcd(x, y) == 1 and x**2 + y**2 == z**2:
print(x, y, z)
- Sensitivity Conjecture: If you consider all n-bit words and take any set of more than half of them (i.e. at least 1+2^(n-1)), then this set will contain some word for which there are at least √n other words in the set that differ in exactly one bit position.
Verifying for n = 4
,
n = 4
m = math.sqrt(n)
def degree(word, s):
"""How many other words differ in one bit position"""
return sum(1 for w in s if bin(word ^ w).count('1') == 1)
words = range(2**n) # All n-bit words
for s in combinations(words, 1 + 2 ** (n - 1)):
assert any(degree(word, s) >= m for word in s)
The idea here is, to consider any set of more than half of all n-bit words you need to take a combination.
Another way to code the same would be to use all
instead of for
-loop with an assert
all(any(degree(word, s) >= m for word in s) for s in combinations(words, 1 + 2 ** (n - 1)))
from math import pow
class Polynomial:
def __init__(self, *coeffs):
self.coeffs = coeffs
def __add__(self, other):
if not isinstance(other, Polynomial):
raise TypeError(f'{other} is not an instance of Polynomial')
new_coeffs = []
for x, y in zip_longest(reversed(self.coeffs), reversed(other.coeffs), fillvalue=0):
new_coeffs.append(x + y)
return Polynomial(*list(reversed(new_coeffs)))
def __repr__(self):
return f'Polynomial({self.coeffs})'
def __len__(self):
return len(self.coeffs)
def __call__(self, value):
return sum(coeff * pow(value, idx)
for idx, coeff in enumerate(reversed(self.coeffs)))
>>> p1 = Polynomial(1, 2, 7)
>>> p2 = Polynomial(5, 3, 0, 2)
>>> p1
Polynomial((1, 2, 7))
>>> print(len(p1))
3
>>> p1 + p2
Polynomial((5, 4, 2, 9))
>>> p2(4)
370.0
McIlroy vs Knuth and word counts
Jon Bentley had a regular column called “Programming Pearls” in the Communications of the ACM. In 1986 he got interested in literate programming, so he asked Donald Knuth to write a program in that style as a guest column and Doug McIlroy to write a literary-style critique of it. The program was
Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
Knuth's 10+ pages of Pascal program used a clever, purpose-built data structure for keeping track of the words and frequency counts. McIlroy replied with following script
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
Doing the same task in Python is trivial using the collections.Counter
class. A Counter is a dict subclass where elements are stored as keys and their counts as values. The Counter class is similar to C++ multisets.
from collections import Counter
def wordcount(filepath, n=10):
with open(filepath) as handle:
words = handle.read().lower().split()
return Counter(words).most_common(n)
for w, c in wordcount('tales.txt', 5):
print(w, c)
prints
the 5247
and 3948
to 3614
of 3019
he 2021
def dfs(graph, node, visited=None):
if visited is None:
visited = set()
if node in visited:
return
print(node)
visited.add(node)
children = graph.get(node)
if children:
for child in children:
dfs(graph, child, visited)
return
def qsort(L):
if len(L) <= 1: return L
return qsort([le for le in L[1:] if le <= L[0]]) + L[0:1] + \
qsort([gt for gt in L[1:] if gt > L[0]])
def is_power_of_2(x):
return x & (x-1) == 0
Let's say you are trying to understand the intent of a search query and want to match input text to one of your catalogue brands,
brands = ['bare', 'alvin kelly', 'firetrap', 'victorias secret', 'bagforever', 'topshe', 'sublime', 'snobz', 'allen cooper', 'chavvis jewel', 'harry potter', 'status quo', 'neha taneja', 'elie saab', 'aady austin', 'disrupt', 'mark & keith', 'regal', 'tarusa', 'indian silks', 'bluestone', 'marcha ballerina', 'georgia pacific', 'hidekraft', '8 one 9', 'avira home', 'rangeelo rajasthan', 'adaira', 'thomas & friends', 'kids ville', 'betty jackson', 'my lil berry', 'jacques bogart', 'bolts and barrels', 'old spice', 'martini', 'simba', 'yolo-you only live once', 'yves rocher', 'jumps and bumps', 'arah', 'man of steel', 'guava', 'jazzup', 'mrignain', 'ucla', 'sitara', 'true soles', 'hubland', 'john varvatos', 'talinum', 'v-moda', 'seven for all souls', 'tanya sharma for stylista', 'minions', 'dsigner', 'aujjessa', 'beauty co seoul', 'emper', 'yellow jeans', 'rangrezz', 'espiral by the darling trap', 'alba botanica', 'orewa', 'organic therapie', 'happy face', 'marie comfort', 'ventoland', 'la plazeite', 'alaia paris', 'kids on board', 'f gear', 'chhabra xclusive', 'fritzberg', 'j. del pozo', 'crosscreek', 'indigo nation', 'trend arrest', 'justanned', 'clayspray', 'simpsons', 'osaiz', 'paani puri', 'swiss design', 'little kangaroos', 'akavya', 'travel blue', 'adidas', 'danish design', 'timex', 'evogue me', 'dezignare', 'cottstrings', 'van heusen', 'miss chase', 'bay island', 'ligalz', 'andrew scott', 'vedantika herbals', 'joy & mario', 'ananda in the himalayas']
you could use the fuzzywuzzy library
from fuzzywuzzy import process
process.extractOne('addidas', brands) # ('adidas', 92)
Many matrix-based algorithms' "goodness" depends on the rank of your data matrix. For e.g., collaborative filtering assumes the original user-item-rating matrix is low-rank. But the concept of rank
is of little help in real life since it's very sensitive to minute values.
import numpy as np
eps = 1e-5
m1 = np.array([[1, 1], [1, 1]])
m2 = np.array([[1, 1+eps], [1, 1]])
Straightforward rank gives values 1 and 2 simultaneously,
from numpy.linalg import matrix_rank as rank
rank(m1), rank(m2)
# 1, 2
Using Singular Value Decomposition(SVD), we can make up something akin to effective rank. The diagonal elements in SVD are scalars called eigenvalues, which indicate how material a dimension is to form the original matrix.
from numpy.linalg import svd
def eff_rank(mat, tolerance=1e-4):
_, sigma, _ = svd(mat)
return len([s for s in sigma if s > tolerance])
This gives us something more intuitive and hopefully helpful,
eff_rank(m1), eff_rank(m2)
# 1, 1
Avg CAGR; Warren Buffet consistency quote; dirichlet distribution for a simplex
Compound interest is the eighth wonder of the world. He who understands it, earns it ... he who doesn't ... pays it.
compound = lambda x: (1+1/x)**x
>>> compound(1)
2.0
>>> compound(10)
2.59
>>> compound(100)
2.705
>>> compound(1000)
2.7169
from operator import mul
from functools import reduce
from random import randint
from statistics import stdev
def avg_cagr(*percents):
amount = reduce(mul, [1+p/100 for p in percents])
return amount**(1/len(percents)) - 1
def random_sum(n=5, amount=100):
'''Returns n random numbers which sum to "amount"'''
randoms = [randint(1, amount) for _ in range(n)]
div_factor = sum(randoms)/amount
randoms_normalized = [round(r/div_factor, 2) for r in randoms]
return randoms_normalized
def best_returns(n=5, amount=100, iterations=100):
results = []
for _ in range(iterations):
rands = random_sum(n, amount)
returns = avg_cagr(*rands)
deviation = stdev(rands)
results.append((deviation, returns))
return sorted(results, key=lambda x: x[1], reverse=True)
Python's standard library has bisect
module that provides methods for bisection on sorted lists. The bisect_left
function locates the proper insertion point for a given item in a (sorted) list to maintain sorted order.
from bisect import bisect_left
def binary_search(arr, x, lo=0, hi=None): # can't use a to specify default for hi
hi = hi or len(arr) # when hi is None it defaults to length of array
pos = bisect_left(arr, x, lo, hi) # find insertion position
return (pos if pos != hi and arr[pos] == x else -1) # don't walk off the end
Why can't we pass default argument of len(arr)
to hi
like hi=len(arr)
?
Here you could also write hi = hi if hi is not None else len(a)
. See or
operator
lim x→1 (1-xi)/(1-xn) evaluates to i/n (using L'Hospital's Rule).
To simulate this function, we will need arbitrary precision floating point for which we will use the standard library module decimal
.
from decimal import Decimal
def f(x, i, n):
return Decimal((1-x**i)/(1-x**n))
f(1-1e-10, 5, 10)
returns Decimal('0.5')
import os
import psutil
def mem_usage():
process = psutil.Process(os.getpid())
return process.memory_info().rss/1e9
chunks
function 👇 is a generator
def chunks(l, n):
'''Yield successive n-sized chunks from l.'''
for i in range(0, len(l), n):
yield l[i:i+n]
Grouping: Given a list of user scores, group them user-wise.
user_scores = [('joe', 110), ('john', 120), ('joe', 95), ('john', 88)]
from collections import defaultdict
user_scores_dict = defaultdict(list)
for user, score in user_scores.items():
user_scores_dict[user].append(score)