Skip to content

Latest commit

 

History

History
1399 lines (1073 loc) · 47.7 KB

File metadata and controls

1399 lines (1073 loc) · 47.7 KB

Advanced Python Topics

Important Knowledge Points

  • How to use comprehensions

    prices = {
        'AAPL': 191.88,
        'GOOG': 1186.96,
        'IBM': 149.24,
        'ORCL': 48.44,
        'ACN': 166.89,
        'FB': 208.09,
        'SYMC': 21.29
    }
    # Build a new dictionary with stocks whose price is greater than 100
    prices2 = {key: value for key, value in prices.items() if value > 100}
    print(prices2)

    Note: Comprehensions can be used to make lists, sets, and dictionaries.

  • A pitfall of nested lists

    names = ['Guan Yu', 'Zhang Fei', 'Zhao Yun', 'Ma Chao', 'Huang Zhong']
    courses = ['Chinese', 'Math', 'English']
    # Enter the scores of five students in three courses
    # Wrong - see http://pythontutor.com/visualize.html#mode=edit
    # scores = [[None] * len(courses)] * len(names)
    scores = [[None] * len(courses) for _ in range(len(names))]
    for row, name in enumerate(names):
        for col, course in enumerate(courses):
            scores[row][col] = float(input(f'Please enter {name}\'s {course} score: '))
            print(scores)

    Python Tutor - VISUALIZE CODE AND GET LIVE HELP

  • The heapq module (heap sort)

    """
    Find the largest or smallest N elements in a list
    Heap structure (max heap / min heap)
    """
    import heapq
    
    list1 = [34, 25, 12, 99, 87, 63, 58, 78, 88, 92]
    list2 = [
        {'name': 'IBM', 'shares': 100, 'price': 91.1},
        {'name': 'AAPL', 'shares': 50, 'price': 543.22},
        {'name': 'FB', 'shares': 200, 'price': 21.09},
        {'name': 'HPQ', 'shares': 35, 'price': 31.75},
        {'name': 'YHOO', 'shares': 45, 'price': 16.35},
        {'name': 'ACME', 'shares': 75, 'price': 115.65}
    ]
    print(heapq.nlargest(3, list1))
    print(heapq.nsmallest(3, list1))
    print(heapq.nlargest(2, list2, key=lambda x: x['price']))
    print(heapq.nlargest(2, list2, key=lambda x: x['shares']))
  • The itertools module

    """
    Iteration tools module
    """
    import itertools
    
    # Produce all permutations of ABCD
    itertools.permutations('ABCD')
    # Produce 3-element combinations from ABCDE
    itertools.combinations('ABCDE', 3)
    # Produce the Cartesian product of ABCD and 123
    itertools.product('ABCD', '123')
    # Produce an infinite cycle of ABC
    itertools.cycle(('A', 'B', 'C'))
  • The collections module

    Common utility classes:

    • namedtuple: a named tuple. It is a class factory. It accepts a type name and an attribute list to create a class.
    • deque: a double-ended queue. It is another way to implement a list. In Python, a list is based on an array, but deque is based on a doubly linked list. So when you need to add or delete elements at both ends, deque has better performance, and its asymptotic time complexity is O(1).
    • Counter: a subclass of dict. The key is the element, and the value is the count of that element. Its most_common() method can help us get the elements that appear most often. I think the inheritance relationship between Counter and dict is open to discussion. According to the CARP principle, it would be more reasonable to design the relationship between Counter and dict as an association relationship.
    • OrderedDict: a subclass of dict. It records the order in which key-value pairs are inserted. It looks like it has the behavior of both a dictionary and a linked list.
    • defaultdict: similar to a dictionary type, but it can get a default value for a key through a factory function. Compared with the setdefault() method of a dictionary, this is more efficient.
    """
    Find the elements that appear most often in a sequence
    """
    from collections import Counter
    
    words = [
        'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes',
        'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around',
        'the', 'eyes', "don't", 'look', 'around', 'the', 'eyes',
        'look', 'into', 'my', 'eyes', "you're", 'under'
    ]
    counter = Counter(words)
    print(counter.most_common(3))

Data Structures and Algorithms

  • Algorithm: the method and steps used to solve a problem.

  • To judge whether an algorithm is good or bad, we usually look at asymptotic time complexity and asymptotic space complexity.

  • Big-O notation for asymptotic time complexity:

    • O(c) - constant time complexity - Bloom filters / hash storage
    • O(log_2n) - logarithmic time complexity - binary search
    • O(n) - linear time complexity - sequential search / counting sort
    • O(n * log_2n) - log-linear time complexity - advanced sorting algorithms (merge sort, quicksort)
    • O(n^2) - square time complexity - simple sorting algorithms (selection sort, insertion sort, bubble sort)
    • O(n^3) - cubic time complexity - Floyd's algorithm / matrix multiplication
    • O(2^n) - geometric-series time complexity - Tower of Hanoi
    • O(n!) - factorial time complexity - traveling salesman problem / NPC

  • Sorting algorithms (selection sort, bubble sort, merge sort) and searching algorithms (sequential search and binary search)

    def select_sort(items, comp=lambda x, y: x < y):
        """Simple selection sort"""
        items = items[:]
        for i in range(len(items) - 1):
            min_index = i
            for j in range(i + 1, len(items)):
                if comp(items[j], items[min_index]):
                    min_index = j
            items[i], items[min_index] = items[min_index], items[i]
        return items
    def bubble_sort(items, comp=lambda x, y: x > y):
        """Bubble sort"""
        items = items[:]
        for i in range(len(items) - 1):
            swapped = False
            for j in range(len(items) - 1 - i):
                if comp(items[j], items[j + 1]):
                    items[j], items[j + 1] = items[j + 1], items[j]
                    swapped = True
            if not swapped:
                break
        return items
    def bubble_sort(items, comp=lambda x, y: x > y):
        """Cocktail sort (an upgraded version of bubble sort)"""
        items = items[:]
        for i in range(len(items) - 1):
            swapped = False
            for j in range(len(items) - 1 - i):
                if comp(items[j], items[j + 1]):
                    items[j], items[j + 1] = items[j + 1], items[j]
                    swapped = True
            if swapped:
                swapped = False
                for j in range(len(items) - 2 - i, i, -1):
                    if comp(items[j - 1], items[j]):
                        items[j], items[j - 1] = items[j - 1], items[j]
                        swapped = True
            if not swapped:
                break
        return items
    def merge(items1, items2, comp=lambda x, y: x < y):
        """Merge (merge two sorted lists into one sorted list)"""
        items = []
        index1, index2 = 0, 0
        while index1 < len(items1) and index2 < len(items2):
            if comp(items1[index1], items2[index2]):
                items.append(items1[index1])
                index1 += 1
            else:
                items.append(items2[index2])
                index2 += 1
        items += items1[index1:]
        items += items2[index2:]
        return items
    
    
    def merge_sort(items, comp=lambda x, y: x < y):
        return _merge_sort(list(items), comp)
    
    
    def _merge_sort(items, comp):
        """Merge sort"""
        if len(items) < 2:
            return items
        mid = len(items) // 2
        left = _merge_sort(items[:mid], comp)
        right = _merge_sort(items[mid:], comp)
        return merge(left, right, comp)
    def seq_search(items, key):
        """Sequential search"""
        for index, item in enumerate(items):
            if item == key:
                return index
        return -1
    def bin_search(items, key):
        """Binary search"""
        start, end = 0, len(items) - 1
        while start <= end:
            mid = (start + end) // 2
            if key > items[mid]:
                start = mid + 1
            elif key < items[mid]:
                end = mid - 1
            else:
                return mid
        return -1
  • Common algorithms:

    • Exhaustive method, also called brute-force cracking. Check all possibilities until the correct answer is found.
    • Greedy method. When solving a problem, always make what seems to be the best choice at the current moment. It does not try to find the best solution, but quickly finds a satisfactory solution.
    • Divide-and-conquer method. Break a complex problem into two or more same or similar subproblems, then keep breaking the subproblems into smaller subproblems until they can be solved directly. At last, combine the solutions of the subproblems to get the solution of the original problem.
    • Backtracking method. The backtracking method is also called the trial method. Search forward according to the better-choice condition. When the search reaches some step and finds that the original choice is not good enough or cannot reach the goal, go back one step and choose again.
    • Dynamic programming. The basic idea is also to break the problem to be solved into several subproblems, solve these subproblems first and save their solutions, so that a large amount of repeated computation can be avoided.

    Example of the exhaustive method: the hundred-money hundred-chickens problem and the five people dividing fish problem.

    # A rooster costs 5 yuan, a hen costs 3 yuan, and 3 chicks cost 1 yuan
    # Use 100 yuan to buy 100 chickens, how many roosters / hens / chicks are there
    for x in range(20):
        for y in range(33):
            z = 100 - x - y
            if 5 * x + 3 * y + z // 3 == 100 and z % 3 == 0:
                print(x, y, z)
    
    # Five people A, B, C, D, and E caught fish together one night and then fell asleep
    # The next morning, A woke up first. He divided the fish into 5 parts, threw away 1 extra fish,
    # and took his own part. B woke up second and also divided the fish in the same way.
    # Then C, D, and E woke up one after another and divided the fish in the same way.
    # Ask how many fish they caught at least
    fish = 6
    while True:
        total = fish
        enough = True
        for _ in range(5):
            if (total - 1) % 5 == 0:
                total = (total - 1) // 5 * 4
            else:
                enough = False
                break
        if enough:
            print(fish)
            break
        fish += 5

    Example of the greedy method: suppose a thief has a backpack that can hold at most 20 kilograms of stolen goods. He breaks into a house and finds the items shown in the table below. Clearly, he cannot put all the items into the backpack, so he must decide which items to take away and which items to leave.

    Name Price (USD) Weight (kg)
    Computer 200 20
    Radio 20 4
    Clock 175 10
    Vase 50 2
    Book 10 1
    Oil Painting 90 9
    """
    Greedy method: when solving a problem, always make what seems to be the best
    choice at the current moment. It does not try to find the best solution, but
    quickly finds a satisfactory solution.
    Input:
    20 6
    Computer 200 20
    Radio 20 4
    Clock 175 10
    Vase 50 2
    Book 10 1
    OilPainting 90 9
    """
    class Thing(object):
        """Item"""
    
        def __init__(self, name, price, weight):
            self.name = name
            self.price = price
            self.weight = weight
    
        @property
        def value(self):
            """Price-weight ratio"""
            return self.price / self.weight
    
    
    def input_thing():
        """Input item information"""
        name_str, price_str, weight_str = input().split()
        return name_str, int(price_str), int(weight_str)
    
    
    def main():
        """Main function"""
        max_weight, num_of_things = map(int, input().split())
        all_things = []
        for _ in range(num_of_things):
            all_things.append(Thing(*input_thing()))
        all_things.sort(key=lambda x: x.value, reverse=True)
        total_weight = 0
        total_price = 0
        for thing in all_things:
            if total_weight + thing.weight <= max_weight:
                print(f'The thief took {thing.name}')
                total_weight += thing.weight
                total_price += thing.price
        print(f'Total value: {total_price} dollars')
    
    
    if __name__ == '__main__':
        main()

    Example of the divide-and-conquer method: quicksort.

    """
    Quicksort - choose a pivot to divide the elements. The left side is all
    smaller than the pivot, and the right side is all larger than the pivot
    """
    def quick_sort(items, comp=lambda x, y: x <= y):
        items = list(items)[:]
        _quick_sort(items, 0, len(items) - 1, comp)
        return items
    
    
    def _quick_sort(items, start, end, comp):
        if start < end:
            pos = _partition(items, start, end, comp)
            _quick_sort(items, start, pos - 1, comp)
            _quick_sort(items, pos + 1, end, comp)
    
    
    def _partition(items, start, end, comp):
        pivot = items[end]
        i = start - 1
        for j in range(start, end):
            if comp(items[j], pivot):
                i += 1
                items[i], items[j] = items[j], items[i]
        items[i + 1], items[end] = items[end], items[i + 1]
        return i + 1

    Example of the backtracking method: Knight's tour.

    """
    Recursive backtracking: also called the trial method. Search forward according
    to the better-choice condition. When the search reaches some step and finds
    that the original choice is not good enough or cannot reach the goal, then go
    back one step and choose again. Classic problems include knight's tour, eight
    queens, and maze path finding.
    """
    import sys
    import time
    
    SIZE = 5
    total = 0
    
    
    def print_board(board):
        for row in board:
            for col in row:
                print(str(col).center(4), end='')
            print()
    
    
    def patrol(board, row, col, step=1):
        if row >= 0 and row < SIZE and \
            col >= 0 and col < SIZE and \
            board[row][col] == 0:
            board[row][col] = step
            if step == SIZE * SIZE:
                global total
                total += 1
                print(f'The {total}th way: ')
                print_board(board)
            patrol(board, row - 2, col - 1, step + 1)
            patrol(board, row - 1, col - 2, step + 1)
            patrol(board, row + 1, col - 2, step + 1)
            patrol(board, row + 2, col - 1, step + 1)
            patrol(board, row + 2, col + 1, step + 1)
            patrol(board, row + 1, col + 2, step + 1)
            patrol(board, row - 1, col + 2, step + 1)
            patrol(board, row - 2, col + 1, step + 1)
            board[row][col] = 0
    
    
    def main():
        board = [[0] * SIZE for _ in range(SIZE)]
        patrol(board, SIZE - 1, SIZE - 1)
    
    
    if __name__ == '__main__':
        main()

    Example of dynamic programming: the maximum value of the sum of elements in a sublist.

    Note: A sublist means a list made up of elements with continuous indexes in the list. The elements in the list are of type int, and they may include positive integers, 0, and negative integers. The program inputs the elements in the list and outputs the maximum value of the sum of elements in a sublist. For example:

    Input: 1 -2 3 5 -3 2

    Output: 8

    Input: 0 -2 3 5 -1 2

    Output: 9

    Input: -9 -2 -3 -5 -3

    Output: -2

    def main():
        items = list(map(int, input().split()))
        overall = partial = items[0]
        for i in range(1, len(items)):
            partial = max(items[i], partial + items[i])
            overall = max(partial, overall)
        print(overall)
    
    
    if __name__ == '__main__':
        main()

    Note: The easiest solution to think of for this problem is to use two nested loops, but the time performance of the code becomes very bad. By using the idea of dynamic programming and only two more variables, the original problem with O(N^2) complexity becomes O(N).

Ways to Use Functions

  • Treat functions as "first-class citizens"

    • Functions can be assigned to variables
    • Functions can be used as parameters of functions
    • Functions can be used as return values of functions
  • How to use higher-order functions (filter, map, and their replacements)

    items1 = list(map(lambda x: x ** 2, filter(lambda x: x % 2, range(1, 10))))
    items2 = [x ** 2 for x in range(1, 10) if x % 2]
  • Positional parameters, variable parameters, keyword parameters, named keyword parameters

  • Metadata of parameters (a code readability issue)

  • How to use anonymous functions and inline functions (lambda functions)

  • Closures and the scope problem

    • Python's LEGB order for searching variables (Local >>> Embedded >>> Global >>> Built-in)

    • The roles of the global and nonlocal keywords

      global: declare or define a global variable (either directly use an existing variable in the global scope, or define a variable and put it into the global scope).

      nonlocal: declare that you are using a variable in a nested scope (the variable must exist in the nested scope, otherwise an error is raised).

  • Decorator functions (using decorators and removing decorators)

    Example: a decorator that outputs the execution time of a function.

    from functools import wraps
    from time import time
    
    
    def record_time(func):
        """Custom decorator for decorating a function"""
    
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time()
            result = func(*args, **kwargs)
            print(f'{func.__name__}: {time() - start} seconds')
            return result
    
        return wrapper

    If the decorator does not want to be coupled with the print function, we can write a parameterized decorator.

    from functools import wraps
    from time import time
    
    
    def record(output):
        """Parameterized decorator"""
    
        def decorate(func):
    
            @wraps(func)
            def wrapper(*args, **kwargs):
                start = time()
                result = func(*args, **kwargs)
                output(func.__name__, time() - start)
                return result
    
            return wrapper
    
        return decorate
    from functools import wraps
    from time import time
    
    
    class Record():
        """Define a decorator by defining a class"""
    
        def __init__(self, output):
            self.output = output
    
        def __call__(self, func):
    
            @wraps(func)
            def wrapper(*args, **kwargs):
                start = time()
                result = func(*args, **kwargs)
                self.output(func.__name__, time() - start)
                return result
    
            return wrapper

    Note: Because the function with decorator function is decorated with @wraps, we can get the function or class before decoration by func.__wrapped__, and then remove the effect of the decorator.

    Example: use a decorator to implement the singleton pattern.

    from functools import wraps
    
    
    def singleton(cls):
        """Decorator for decorating a class"""
        instances = {}
    
        @wraps(cls)
        def wrapper(*args, **kwargs):
            if cls not in instances:
                instances[cls] = cls(*args, **kwargs)
            return instances[cls]
    
        return wrapper
    
    
    @singleton
    class President:
        """President, singleton class"""
        pass

    Tip: The code above uses a closure. I do not know whether you have noticed it. There is also one small problem. The code above does not implement a thread-safe singleton. How should it be done if we want to implement a thread-safe singleton?

    Thread-safe singleton decorator.

    from functools import wraps
    from threading import RLock
    
    
    def singleton(cls):
        """Thread-safe singleton decorator"""
        instances = {}
        locker = RLock()
    
        @wraps(cls)
        def wrapper(*args, **kwargs):
            if cls not in instances:
                with locker:
                    if cls not in instances:
                        instances[cls] = cls(*args, **kwargs)
            return instances[cls]
    
        return wrapper

    Tip: The code above uses with context syntax to do the lock operation, because the lock object itself is a context manager object, it supports the magic methods __enter__ and __exit__. In the wrapper function, we first do one check without lock, and then do one check with lock. This does better in performance than checking directly with lock. If the object has already been created, there is no need to lock again. We can directly return the object.

Object-Oriented Knowledge

  • The three pillars: encapsulation, inheritance, polymorphism

    Example: salary settlement system.

    """
    Monthly salary settlement system:
    department manager 15000 every month
    programmer 200 per hour
    salesperson 1800 base salary plus 5% commission of sales
    """
    from abc import ABCMeta, abstractmethod
    
    
    class Employee(metaclass=ABCMeta):
        """Employee, abstract class"""
    
        def __init__(self, name):
            self.name = name
    
        @abstractmethod
        def get_salary(self):
            """Settle monthly salary, abstract method"""
            pass
    
    
    class Manager(Employee):
        """Department manager"""
    
        def get_salary(self):
            return 15000.0
    
    
    class Programmer(Employee):
        """Programmer"""
    
        def __init__(self, name, working_hour=0):
            self.working_hour = working_hour
            super().__init__(name)
    
        def get_salary(self):
            return 200.0 * self.working_hour
    
    
    class Salesman(Employee):
        """Salesperson"""
    
        def __init__(self, name, sales=0.0):
            self.sales = sales
            super().__init__(name)
    
        def get_salary(self):
            return 1800.0 + self.sales * 0.05
    
    
    class EmployeeFactory:
        """Factory for creating employees.
        Factory pattern: use a factory to decouple the object user and the object.
        """
    
        @staticmethod
        def create(emp_type, *args, **kwargs):
            """Create employee"""
            all_emp_types = {'M': Manager, 'P': Programmer, 'S': Salesman}
            cls = all_emp_types[emp_type.upper()]
            return cls(*args, **kwargs) if cls else None
    
    
    def main():
        """Main function"""
        emps = [
            EmployeeFactory.create('M', 'Cao Cao'),
            EmployeeFactory.create('P', 'Xun Yu', 120),
            EmployeeFactory.create('P', 'Guo Jia', 85),
            EmployeeFactory.create('S', 'Dian Wei', 123000),
        ]
        for emp in emps:
            print(f'{emp.name}: {emp.get_salary():.2f} yuan')
    
    
    if __name__ == '__main__':
        main()
  • The relationship between classes and classes

    • is-a relationship: inheritance
    • has-a relationship: association / aggregation / composition
    • use-a relationship: dependency

    Example: poker game.

    """
    Experience: symbolic constants are always better than literal constants.
    Enumeration type is the best choice to define symbolic constants.
    """
    from enum import Enum, unique
    
    import random
    
    
    @unique
    class Suite(Enum):
        """Suit"""
    
        SPADE, HEART, CLUB, DIAMOND = range(4)
    
        def __lt__(self, other):
            return self.value < other.value
    
    
    class Card:
        """Card"""
    
        def __init__(self, suite, face):
            """Initialization method"""
            self.suite = suite
            self.face = face
    
        def show(self):
            """Show the face of the card"""
            suites = ['♠︎', '♥︎', '♣︎', '♦︎']
            faces = ['', 'A', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K']
            return f'{suites[self.suite.value]}{faces[self.face]}'
    
        def __repr__(self):
            return self.show()
    
    
    class Poker:
        """Poker"""
    
        def __init__(self):
            self.index = 0
            self.cards = [Card(suite, face)
                          for suite in Suite
                          for face in range(1, 14)]
    
        def shuffle(self):
            """Shuffle cards, random order"""
            random.shuffle(self.cards)
            self.index = 0
    
        def deal(self):
            """Deal cards"""
            card = self.cards[self.index]
            self.index += 1
            return card
    
        @property
        def has_more(self):
            return self.index < len(self.cards)
    
    
    class Player:
        """Player"""
    
        def __init__(self, name):
            self.name = name
            self.cards = []
    
        def get_one(self, card):
            """Get one card"""
            self.cards.append(card)
    
        def sort(self, comp=lambda card: (card.suite, card.face)):
            """Arrange the cards in hand"""
            self.cards.sort(key=comp)
    
    
    def main():
        """Main function"""
        poker = Poker()
        poker.shuffle()
        players = [Player('East Heretic'), Player('West Poison'),
                   Player('South Emperor'), Player('North Beggar')]
        while poker.has_more:
            for player in players:
                player.get_one(poker.deal())
        for player in players:
            player.sort()
            print(player.name, end=': ')
            print(player.cards)
    
    
    if __name__ == '__main__':
        main()

    Note: The code above uses Emoji characters to represent the four suits of poker cards. On some systems that do not support Emoji characters, they may not display.

  • Object copying, deep copy / deep clone and shallow copy / shadow clone

  • Garbage collection, circular reference, and weak reference

    Python uses automatic memory management. This mechanism is based on reference counting, and at the same time introduces mark-sweep and generational collection as supporting strategies.

    typedef struct _object {
        /* reference count */
        int ob_refcnt;
        /* object pointer */
        struct _typeobject *ob_type;
    } PyObject;
    /* macro to increase reference count */
    #define Py_INCREF(op)   ((op)->ob_refcnt++)
    /* macro to decrease reference count */
    #define Py_DECREF(op) \
        if (--(op)->ob_refcnt != 0) \
            ; \
        else \
            __Py_Dealloc((PyObject *)(op))

    Situations that cause reference count +1:

    • The object is created, for example a = 23
    • The object is referenced, for example b = a
    • The object is passed as a parameter into a function, for example f(a)
    • The object is stored in a container as an element, for example list1 = [a, a]

    Situations that cause reference count -1:

    • The alias of the object is explicitly destroyed, for example del a
    • The alias of the object is assigned a new object, for example a = 24
    • An object leaves its scope, for example when function f finishes execution, local variables in f will, global variables will not
    • The container containing the object is destroyed, or the object is deleted from the container

    Reference counting may cause the problem of circular reference, and circular reference will cause memory leak, as shown in the code below. To solve this problem, Python introduces mark-sweep and generational collection. When an object is created, it is put in generation 1. If the object survives the garbage check in generation 1, then the object is put into generation 2. In the same way, if the object survives the garbage check in generation 2, then the object is put into generation 3.

    # Circular reference causes memory leak.
    # Besides reference counting, Python also introduces mark cleanup and generational collection.
    # Before Python 3.6, if __del__ magic method was overridden, circular reference handling failed.
    # If you do not want to cause circular reference, you can use weak reference.
    list1 = []
    list2 = []
    list1.append(list2)
    list2.append(list1)

    The following situations will cause garbage collection:

    • Call gc.collect()
    • The counter of the gc module reaches the threshold
    • The program exits

    If two objects in a circular reference both define the __del__ method, the gc module will not destroy these unreachable objects, because the gc module does not know which object's __del__ method should be called first. This problem was solved in Python 3.6.

    We can also solve the circular reference problem by creating weak references through the weakref module.

  • Magic attributes and methods, please refer to A Guide to Python Magic Methods

    There are a few small questions for everyone to think about:

    • Can custom objects use operators to do operations?
    • Can custom objects be put into set? Can they remove duplicates?
    • Can custom objects be used as keys of dict?
    • Can custom objects use context syntax?
  • Mixin

    Example: custom dictionary limits that key-value pairs can be set only when the specified key does not exist.

    class SetOnceMappingMixin:
        """Custom mixin class"""
        __slots__ = ()
    
        def __setitem__(self, key, value):
            if key in self:
                raise KeyError(str(key) + ' already set')
            return super().__setitem__(key, value)
    
    
    class SetOnceDict(SetOnceMappingMixin, dict):
        """Custom dictionary"""
        pass
    
    
    my_dict = SetOnceDict()
    try:
        my_dict['username'] = 'jackfrued'
        my_dict['username'] = 'hellokitty'
    except KeyError:
        pass
    print(my_dict)
  • Metaprogramming and metaclass

    Objects are created through classes. Classes are created through metaclasses. Metaclasses provide the meta information for creating classes. All classes directly or indirectly inherit from object. All metaclasses directly or indirectly inherit from type.

    Example: use metaclass to implement singleton pattern.

    import threading
    
    
    class SingletonMeta(type):
        """Custom metaclass"""
    
        def __init__(cls, *args, **kwargs):
            cls.__instance = None
            cls.__lock = threading.RLock()
            super().__init__(*args, **kwargs)
    
        def __call__(cls, *args, **kwargs):
            if cls.__instance is None:
                with cls.__lock:
                    if cls.__instance is None:
                        cls.__instance = super().__call__(*args, **kwargs)
            return cls.__instance
    
    
    class President(metaclass=SingletonMeta):
        """President, singleton class"""
        pass
  • Object-oriented design principles

    • Single responsibility principle, SRP: a class only does what it should do, the design of the class should have high cohesion
    • Open-closed principle, OCP: software entities should be open for extension and closed for modification
    • Dependency inversion principle, DIP: programming facing abstraction, this has already been weakened in weakly typed languages
    • Liskov substitution principle, LSP: at any time, a subclass object can replace a parent class object
    • Interface segregation principle, ISP: interfaces should be small and specialized, not big and complete. In Python there is no concept of interface
    • Composition aggregate reuse principle, CARP: prefer using strong association relationship instead of inheritance relationship to reuse code
    • Least knowledge principle, Law of Demeter, LoD: do not send messages to objects that have no necessary connection

    Note: The bold letters above together are called the SOLID principles in object-oriented programming.

  • GoF design patterns

    • Creational patterns: singleton, factory, builder, prototype
    • Structural patterns: adapter, facade, proxy
    • Behavioral patterns: iterator, observer, state, strategy

    Example: pluggable hash algorithm, strategy pattern.

    class StreamHasher:
        """Hash digest generator"""
    
        def __init__(self, alg='md5', size=4096):
            self.size = size
            alg = alg.lower()
            self.hasher = getattr(__import__('hashlib'), alg.lower())()
    
        def __call__(self, stream):
            return self.to_digest(stream)
    
        def to_digest(self, stream):
            """Generate digest in hexadecimal form"""
            for buf in iter(lambda: stream.read(self.size), b''):
                self.hasher.update(buf)
            return self.hasher.hexdigest()
    
    
    def main():
        """Main function"""
        hasher1 = StreamHasher()
        with open('Python-3.7.6.tgz', 'rb') as stream:
            print(hasher1.to_digest(stream))
        hasher2 = StreamHasher('sha1')
        with open('Python-3.7.6.tgz', 'rb') as stream:
            print(hasher2(stream))
    
    
    if __name__ == '__main__':
        main()

Iterators and Generators

  • An iterator is an object that implements the iterator protocol.

    • In Python there is no keyword like protocol or interface to define a protocol.
    • In Python, magic methods are used to represent protocols.
    • The magic methods __iter__ and __next__ are the iterator protocol.
    class Fib(object):
        """Iterator"""
    
        def __init__(self, num):
            self.num = num
            self.a, self.b = 0, 1
            self.idx = 0
    
        def __iter__(self):
            return self
    
        def __next__(self):
            if self.idx < self.num:
                self.a, self.b = self.b, self.a + self.b
                self.idx += 1
                return self.a
            raise StopIteration()
  • A generator is an iterator in a syntax-simplified version.

    def fib(num):
        """Generator"""
        a, b = 0, 1
        for _ in range(num):
            a, b = b, a + b
            yield a
  • Generators evolve into coroutines.

    A generator object can use the send() method to send data, and the sent data becomes the value obtained by the yield expression in the generator function. In this way, generators can be used as coroutines. Simply speaking, coroutines are subprograms that can cooperate with each other.

    def calc_avg():
        """Calculate average value in a stream way"""
        total, counter = 0, 0
        avg_value = None
        while True:
            value = yield avg_value
            total, counter = total + value, counter + 1
            avg_value = total / counter
    
    
    gen = calc_avg()
    next(gen)
    print(gen.send(10))
    print(gen.send(20))
    print(gen.send(30))

Concurrent Programming

Python has three solutions for concurrent programming: multi-threading, multi-process, and asynchronous I/O. The good side of concurrent programming is that it can improve the execution efficiency of the program and improve user experience. The bad side is that concurrent programs are not easy to develop and debug, and to other programs they are not friendly either.

  • Multi-threading: Python provides the Thread class, together with Lock, Condition, Event, Semaphore, and Barrier. Python has GIL to prevent multiple threads from executing local bytecode at the same time. This lock is necessary for CPython, because the memory management of CPython is not thread-safe. Because of the existence of GIL, multi-threading cannot use the multi-core features of CPU.

    """
    Interview question: what are the difference and relationship between process and thread?
    Process: the basic unit for the operating system to allocate memory.
    One process can contain one or more threads.
    Thread: the basic unit for the operating system to allocate CPU.
    Concurrent programming
    1. Improve execution performance: let parts in the program that have no causal relationship execute concurrently.
    2. Improve user experience: let time-consuming operations not cause fake death of the program.
    """
    import glob
    import os
    import threading
    
    from PIL import Image
    
    PREFIX = 'thumbnails'
    
    
    def generate_thumbnail(infile, size, format='PNG'):
        """Generate thumbnail of specified image file"""
        file, ext = os.path.splitext(infile)
        file = file[file.rfind('/') + 1:]
        outfile = f'{PREFIX}/{file}_{size[0]}_{size[1]}.{ext}'
        img = Image.open(infile)
        img.thumbnail(size, Image.ANTIALIAS)
        img.save(outfile, format)
    
    
    def main():
        """Main function"""
        if not os.path.exists(PREFIX):
            os.mkdir(PREFIX)
        for infile in glob.glob('images/*.png'):
            for size in (32, 64, 128):
                # Create and start thread
                threading.Thread(
                    target=generate_thumbnail,
                    args=(infile, (size, size))
                ).start()
    
    
    if __name__ == '__main__':
        main()

    The situation where multiple threads compete for resources.

    """
    Multi-threaded programs are usually easier if there is no competition for resources.
    When multiple threads compete for critical resources,
    if there is no necessary protection measure, data disorder will happen.
    Note: critical resource means the resource competed for by multiple threads.
    """
    import time
    import threading
    
    from concurrent.futures import ThreadPoolExecutor
    
    
    class Account(object):
        """Bank account"""
    
        def __init__(self):
            self.balance = 0.0
            self.lock = threading.Lock()
    
        def deposit(self, money):
            # Use a lock to protect critical resources
            with self.lock:
                new_balance = self.balance + money
                time.sleep(0.001)
                self.balance = new_balance
    
    
    def main():
        """Main function"""
        account = Account()
        # Create thread pool
        pool = ThreadPoolExecutor(max_workers=10)
        futures = []
        for _ in range(100):
            future = pool.submit(account.deposit, 1)
            futures.append(future)
        # Close thread pool
        pool.shutdown()
        for future in futures:
            future.result()
        print(account.balance)
    
    
    if __name__ == '__main__':
        main()

    Modify the program above. Start 5 threads to deposit money into the account, and 5 threads to take money from the account. When taking money, if the balance is not enough, then pause the thread and wait. To reach the goal above, we need to schedule the deposit and withdraw threads. When the balance is not enough, the withdraw thread pauses and releases the lock, and after the deposit thread puts money in, it should notify the withdraw thread, so that it wakes up from the paused state. We can use Condition in the threading module to implement thread scheduling. This object is also created based on a lock. The code is shown below:

    """
    Multiple threads compete for one resource: protect critical resource, use lock, Lock / RLock
    Multiple threads compete for multiple resources, thread count > resource count: semaphore, Semaphore
    Scheduling of multiple threads: pause thread execution / wake up waiting thread, Condition
    """
    from concurrent.futures import ThreadPoolExecutor
    from random import randint
    from time import sleep
    
    import threading
    
    
    class Account:
        """Bank account"""
    
        def __init__(self, balance=0):
            self.balance = balance
            lock = threading.RLock()
            self.condition = threading.Condition(lock)
    
        def withdraw(self, money):
            """Withdraw money"""
            with self.condition:
                while money > self.balance:
                    self.condition.wait()
                new_balance = self.balance - money
                sleep(0.001)
                self.balance = new_balance
    
        def deposit(self, money):
            """Deposit money"""
            with self.condition:
                new_balance = self.balance + money
                sleep(0.001)
                self.balance = new_balance
                self.condition.notify_all()
    
    
    def add_money(account):
        while True:
            money = randint(5, 10)
            account.deposit(money)
            print(threading.current_thread().name,
                  ':', money, '====>', account.balance)
            sleep(0.5)
    
    
    def sub_money(account):
        while True:
            money = randint(10, 30)
            account.withdraw(money)
            print(threading.current_thread().name,
                  ':', money, '<====', account.balance)
            sleep(1)
    
    
    def main():
        account = Account()
        with ThreadPoolExecutor(max_workers=15) as pool:
            for _ in range(5):
                pool.submit(add_money, account)
            for _ in range(10):
                pool.submit(sub_money, account)
    
    
    if __name__ == '__main__':
        main()
  • Multi-process: multi-process can effectively solve the problem of GIL. The main class for implementing multi-process is Process. Other helper classes are similar to those in the threading module. For sharing data between processes, pipes and sockets can be used. In the multiprocessing module there is a Queue class. It provides a queue shared by multiple processes based on pipe and lock mechanism. Below is one example from the official documentation about multi-process and process pool.

    """
    Use of multi-process and process pool.
    Because of GIL, multi-threading cannot use the multi-core features of CPU.
    For compute-intensive tasks, we should consider using multi-process.
    time python3 example22.py
    real    0m11.512s
    user    0m39.319s
    sys     0m0.169s
    After using multi-process, the actual execution time is 11.512 seconds,
    while the user time 39.319 seconds is about 4 times the actual execution time.
    This proves that our program used the multi-core features of CPU through multi-process,
    and this computer has a 4-core CPU.
    """
    import concurrent.futures
    import math
    
    PRIMES = [
        1116281,
        1297337,
        104395303,
        472882027,
        533000389,
        817504243,
        982451653,
        112272535095293,
        112582705942171,
        112272535095293,
        115280095190773,
        115797848077099,
        1099726899285419
    ] * 5
    
    
    def is_prime(n):
        """Judge prime number"""
        if n % 2 == 0:
            return False
    
        sqrt_n = int(math.floor(math.sqrt(n)))
        for i in range(3, sqrt_n + 1, 2):
            if n % i == 0:
                return False
        return True
    
    
    def main():
        """Main function"""
        with concurrent.futures.ProcessPoolExecutor() as executor:
            for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
                print('%d is prime: %s' % (number, prime))
    
    
    if __name__ == '__main__':
        main()

    Key point: Comparison between multi-threading and multi-process.

    The following situations need multi-threading:

    1. The program needs to maintain many shared states, especially mutable states. Lists, dictionaries, and sets in Python are all thread-safe, so using threads instead of processes to maintain shared state has relatively small cost.
    2. The program will spend a lot of time on I/O operations, and does not have much need for parallel computing, and does not need to occupy too much memory.

    The following situations need multi-process:

    1. The program executes compute-intensive tasks, for example bytecode operations, data processing, scientific computing.
    2. The input of the program can be split into blocks in parallel, and the operation results can be merged.
    3. The program has no limit at all in memory use, and does not strongly depend on I/O operations, for example reading and writing files and sockets.
  • Asynchronous processing: choose tasks from the task queue of the scheduler. The scheduler executes these tasks in an interleaving way. We cannot guarantee the tasks will execute in some specific order, because the execution order depends on whether one task in the queue is willing to give CPU processing time to another task. Asynchronous tasks are usually implemented through multi-task cooperative processing. Because the execution time and order are uncertain, we need callback style programming or a future object to get the execution result of the task. Python 3 supports asynchronous processing through the asyncio module and the await and async keywords, they officially became keywords in Python 3.7.

    """
    Asynchronous I/O, async / await
    """
    import asyncio
    
    
    def num_generator(m, n):
        """Number generator in specified range"""
        yield from range(m, n + 1)
    
    
    async def prime_filter(m, n):
        """Prime number filter"""
        primes = []
        for i in num_generator(m, n):
            flag = True
            for j in range(2, int(i ** 0.5 + 1)):
                if i % j == 0:
                    flag = False
                    break
            if flag:
                print('Prime =>', i)
                primes.append(i)
    
            await asyncio.sleep(0.001)
        return tuple(primes)
    
    
    async def square_mapper(m, n):
        """Square mapper"""
        squares = []
        for i in num_generator(m, n):
            print('Square =>', i * i)
            squares.append(i * i)
    
            await asyncio.sleep(0.001)
        return squares
    
    
    def main():
        """Main function"""
        loop = asyncio.get_event_loop()
        future = asyncio.gather(prime_filter(2, 100), square_mapper(1, 100))
        future.add_done_callback(lambda x: print(x.result()))
        loop.run_until_complete(future)
        loop.close()
    
    
    if __name__ == '__main__':
        main()

    Note: The code above uses the get_event_loop function to get the system default event loop. Through the gather function, we can get a future object. The add_done_callback of the future object can add a callback function to run when execution is completed. The run_until_complete method of the loop object can wait for the coroutine execution result obtained through the future object.

    Python has a third-party library named aiohttp. It provides asynchronous HTTP client and server. This third-party library can work together with the asyncio module, and provides support for Future objects. Python 3.6 introduced async and await to define asynchronously executed functions and create asynchronous context, and in Python 3.7 they officially became keywords. The code below asynchronously gets pages from 5 URLs and uses named capture groups in regular expressions to extract the website titles.

    import asyncio
    import re
    
    import aiohttp
    
    PATTERN = re.compile(r'\<title\>(?P<title>.*)\<\/title\>')
    
    
    async def fetch_page(session, url):
        async with session.get(url, ssl=False) as resp:
            return await resp.text()
    
    
    async def show_title(url):
        async with aiohttp.ClientSession() as session:
            html = await fetch_page(session, url)
            print(PATTERN.search(html).group('title'))
    
    
    def main():
        urls = ('https://www.python.org/',
                'https://git-scm.com/',
                'https://www.jd.com/',
                'https://www.taobao.com/',
                'https://www.douban.com/')
        loop = asyncio.get_event_loop()
        cos = [show_title(url) for url in urls]
        loop.run_until_complete(asyncio.wait(cos))
        loop.close()
    
    
    if __name__ == '__main__':
        main()

    Key point: Comparison between asynchronous I/O and multi-process.

    When the program does not need real concurrency or parallelism, but depends more on asynchronous processing and callback, asyncio is a very good choice. If there is a large amount of waiting and sleeping in the program, asyncio should also be considered. It is very suitable for writing Web application servers that do not need real-time data processing.

    Python also has many third-party libraries for handling parallel tasks, such as joblib and PyMP. In actual development, to improve the scalability and concurrency of the system, there are usually two ways: vertical scaling, increase the processing ability of a single node, and horizontal scaling, change one single node into multiple nodes. We can use message queues to decouple the application. A message queue is equal to an extended version of the multi-thread synchronous queue. Applications on different machines are equal to threads, and the shared distributed message queue is the original Queue in the program. The most popular and standardized implementation of message queue, message-oriented middleware, is AMQP, Advanced Message Queuing Protocol. AMQP comes from the financial industry, and provides queueing, routing, reliable transmission, security, and other functions. The most famous implementations include Apache ActiveMQ, RabbitMQ, and so on.

    To make tasks asynchronous, we can use the third-party library named Celery. Celery is a distributed task queue written in Python. It works by using distributed messages, and can use RabbitMQ or Redis as the message broker in the backend.