openmlsys
diff --git a/‎chapter_introduction/Application_Scenarios_of_Machine_Learning_Systems.md‎
Lines changed: 78 additions & 0 deletions b/‎chapter_introduction/Application_Scenarios_of_Machine_Learning_Systems.md‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎chapter_introduction/Book_Organization_and_Intended_Audience.md‎
Lines changed: 33 additions & 0 deletions b/‎chapter_introduction/Book_Organization_and_Intended_Audience.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎chapter_introduction/Design_Objectives_of_Machine_Learning_Frameworks.md‎
Lines changed: 86 additions & 0 deletions b/‎chapter_introduction/Design_Objectives_of_Machine_Learning_Frameworks.md‎
Lines changed: 86 additions & 0 deletions
diff --git a/‎chapter_introduction/Index.md‎
Lines changed: 6 additions & 0 deletions b/‎chapter_introduction/Index.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎chapter_introduction/Machine_Learning_Applications.md‎
Lines changed: 53 additions & 0 deletions b/‎chapter_introduction/Machine_Learning_Applications.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎chapter_introduction/Machine_Learning_Framework_Architecture.md‎
Lines changed: 72 additions & 0 deletions b/‎chapter_introduction/Machine_Learning_Framework_Architecture.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎img/intro/framework-architecture.png‎
32.8 KB b/‎img/intro/framework-architecture.png‎
32.8 KB
diff --git a/‎img/intro/framework_position.png‎
174 KB b/‎img/intro/framework_position.png‎
174 KB
diff --git a/‎img/intro/mlbook-en.pptx‎
47 KB b/‎img/intro/mlbook-en.pptx‎
47 KB
diff --git a/‎img/intro/system-ecosystem.png‎
199 KB b/‎img/intro/system-ecosystem.png‎
199 KB
@@ -0,0 +1,78 @@
+# Application Scenarios of Machine Learning Systems
+
+A machine learning framework is commonly utilized in diverse scenarios,
+giving rise to a range of *machine learning systems*. In a broader
+context, a machine learning system refers to a collective term
+encompassing a variety of software and hardware systems that facilitate
+and execute machine learning applications. Figure
+:numref:`intro/system-ecosystem` provides an overview of the
+various application scenarios for machine learning systems.
+
+![Application scenarios of machine learningsystems](../img/intro/system-ecosystem.png)
+:label:`intro/system-ecosystem`
+
+1.  **Federated learning:** Laws and regulations on user privacy
+    protection and data protection prevent many machine learning
+    applications from accessing user data directly for model training
+    purposes. This is where federated learning --- based on a machine
+    learning framework --- benefits such applications.
+
+2.  **Recommender system:** Incorporating machine learning (especially
+    deep learning) into recommender systems have achieved major success
+    over the past few years. Compared with traditional rule-based
+    recommender systems, those based on deep learning can analyze
+    massive feature data of users more effectively, thereby bringing
+    huge improvements to the accuracy and timeliness of recommendations.
+
+3.  **Reinforcement learning:** Because reinforcement learning is
+    special in terms of the way it collects data and trains models, it
+    is therefore necessary to develop dedicated reinforcement learning
+    systems based on a machine learning framework.
+
+4.  **Explainable AI:** As machine learning becomes more and more
+    popular in many key areas, including finance, healthcare, and
+    governmental affairs, developing explainable AI systems based on a
+    machine learning framework is gaining wider attention.
+
+5.  **Robotics:** Robotics is another area where the use of machine
+    learning frameworks is gaining popularity. Compared with traditional
+    robot vision methods, machine learning methods have achieved
+    enormous success in several robot tasks, such as automatic feature
+    extraction, target recognition, and path planning.
+
+6.  **Graph learning:** Graphs are the most widely used data structure
+    and are used to express large volumes of Internet data, for
+    instance, social network graphs and product relationship graphs.
+    Machine learning algorithms have been proven effective for analyzing
+    large-scale graph data. A machine learning system designed to
+    process graph data is referred to as a graph learning system.
+
+7.  **Scientific computing:** Scientific computing covers a wide range
+    of traditional fields (such as electromagnetic simulation, graphics,
+    and weather forecast), in which many large-scale problems can be
+    effectively solved by machine learning methods. Therefore,
+    developing special machine learning systems for scientific computing
+    is becoming an increasingly common practice.
+
+8.  **Scheduling of a machine learning cluster:** A machine learning
+    cluster consists of heterogeneous processors, heterogeneous
+    networks, and even heterogeneous storage devices. But in a machine
+    learning cluster, computing tasks often have common characteristics
+    during their execution (e.g., iterative execution based on the
+    collective communication operator AllReduce). In order to take
+    account of the cluster's heterogeneity of devices and the common
+    characteristics in task execution, a machine learning cluster is
+    often designed to use a special scheduling method.
+
+9.  **Quantum computing:** Quantum computers are generally realized
+    through a hybrid architecture, in which quantum computing is
+    performed by quantum computers and the simulation of quantum
+    computers is performed by classical computers. Many simulation
+    systems (such as TensorFlow Quantum and MindQuantum) are realized on
+    the basis of a machine learning framework because the simulation
+    often requires massive matrix computations and gradient computation.
+
+There are too many machine learning systems for this book to cover them
+all in depth. Instead, we aim to provide a system designer's perspective
+on several core systems used in federated learning, recommenders,
+reinforcement learning, explainable AI, and robotics.
@@ -0,0 +1,33 @@
+# Book Organization and Intended Audience
+
+This book adopts a level-by-level approach to discuss design principles
+and implementation practices of machine learning systems. The
+**Framework Design** part starts with introducing key concepts that
+framework users need to understand, including programming interface
+design and computational graph. This part then describes the frontend
+and backend techniques used in AI compilers as well as key techniques
+for processing data, deploying models, and distributing training to
+multiple machines. The **Application Scenarios** part elaborates on
+several important types of machine learning systems, such as federated
+learning and recommender systems, in an attempt to provide readers with
+useful knowledge for both deploying and operating machine learning
+frameworks in different application scenarios.
+
+This book is intended for the following readers:
+
+1.  **Students:** This book provides a wealth of design principles and
+    hands-on experience of machine learning systems. Such knowledge will
+    help students better understand the theoretical pros and cons and
+    practical challenges of machine learning algorithms.
+
+2.  **Researchers:** This book aims to help researchers tackle various
+    challenges in machine learning implementation and guide them through
+    the design of next-generation machine learning algorithms meant to
+    solve large-scale practical problems.
+
+3.  **Developers:** We also hope this book will allow developers to gain
+    a profound understanding on the internal architecture of a machine
+    learning system. Such knowledge will move them a step further in
+    developing new functions for their applications, debugging system
+    performance issues, and even customizing a machine learning system
+    based on their business needs.
@@ -0,0 +1,86 @@
+# Design Objectives of Machine Learning Frameworks
+
+*Machine learning frameworks* (e.g., TensorFlow, PyTorch, and MindSpore)
+were designed and implemented so that machine learning algorithms could
+be developed efficiently for different applications. In a broad sense,
+these frameworks achieved the following common design objectives.
+
+1.  **Neural network programming:** The huge success of deep learning
+    has solidified neural networks as the core of many machine learning
+    applications. People need to customize neural networks to meet their
+    specific application requirements --- such customization typically
+    results in the creation of convolutional neural networks (CNNs) and
+    self-attention neural networks. In order to develop, train, and
+    deploy these networks, we need a generic system software.
+
+2.  **Automatic differentiation:** The training of neural networks
+    involves continuously computing gradients through the combined use
+    of training data, data annotation, and a loss function to
+    iteratively improve model parameters. Computing gradients manually
+    is a complex and time-consuming task. Consequently, a machine
+    learning framework is expected to compute gradients automatically
+    based on a neural network application provided by developers. This
+    computation process is called automatic differentiation.
+
+3.  **Data management and processing:** Data is the key to machine
+    learning. There are several types of data, including training,
+    validation, and test datasets, as well as model parameters. A
+    machine learning system should be able to read, store, and
+    preprocess (data augmentation and cleansing are examples of
+    preprocessing) these types of data by itself.
+
+4.  **Model training and deployment:** A machine learning model is
+    expected to provide optimal performance. In order to achieve this,
+    we need to use an optimization method --- for example, mini-batch
+    stochastic gradient descent (SGD) --- to repeatedly compute
+    gradients through multi-step iteration. This process is called
+    training. Once the training is complete, we can then deploy the
+    trained model to the inference device.
+
+5.  **Hardware accelerators:** Many core operations in machine learning
+    can be deemed as matrix computation. To accelerate such computation,
+    machine learning developers leverage many specially designed
+    hardware components referred to as hardware accelerators or AI
+    chips.
+
+6.  **Distributed training:** As the volume of training data and the
+    number of neural network parameters increase, the amount of memory
+    used by a machine learning system far exceeds what a single machine
+    can provide. Therefore, a machine learning framework should be able
+    to train models on distributed machines.
+
+Early attempts by developers to design such a framework employed
+traditional methods such as *neural network libraries* (e.g., Theano and
+Caffe) and *data processing frameworks* (e.g., Apache Spark and Google's
+Pregel), but the results were disappointing. At that time, neural
+network libraries lacked the ability to manage and process large
+datasets, deploy models, or perform distributed model execution, meaning
+they were not qualified enough for developing today's product-level
+machine learning applications even though they supported neural network
+development, automatic differentiation, and hardware accelerators.
+Furthermore, data-parallel computing frameworks were not suitable for
+developing neural network--centered machine learning applications
+because they lacked support for neural networks, automatic
+differentiation, and accelerators, although such frameworks were already
+mature in supporting distributed running and data management.
+
+These drawbacks led many enterprise developers and university
+researchers to design and implement their own software frameworks for
+machine learning from scratch. In only a few short years, numerous
+machine learning frameworks emerged --- well-known examples of these
+include TensorFlow, PyTorch, MindSpore, MXNet, PaddlePaddle, OneFlow,
+and CNTK. These frameworks boosted the development of AI significantly
+in both upstream and downstream industries. Table
+:numref:`intro-comparison` lists the differences between machine
+learning frameworks and other related systems.
+
+
+:Differences between machine learning frameworks and related systems
+
+|Design Method               | Neural Network | Automatic Differentiation  | Data Management   | Training and Deployment   | Accelerator   | Distributed Training |
+|----------------------------|----------------|----------------------------|-------------------|---------------------------|---------------|----------------------|
+|Neural network libraries    | Yes            | Yes                        | No                | No                        | Yes           | No |
+|Data processing frameworks  | No             | No                         | Yes               | No                        | No            | Yes |
+|Machine learning frameworks | Yes            | Yes                        | Yes               | Yes                       | Yes           | Yes |
+:label:intro-comparison
+
@@ -0,0 +1,6 @@
+# Introduction
+
+This chapter aims to provide readers with a comprehensive understanding
+of machine learning systems by describing the applications of machine
+learning and summarizing the design objectives and basic composition
+principles of such systems.
@@ -0,0 +1,53 @@
+# Machine Learning Applications
+
+In general terms, machine learning is a technology that learns useful
+knowledge from data. There are a variety of machine learning methods,
+including supervised learning, unsupervised learning, and reinforcement
+learning.
+
+1.  In supervised learning, the mapping relationships between inputs and
+    outputs are known to machines. For example, a discrete label can be
+    assigned to an input image.
+
+2.  In unsupervised learning, input data is provided to machines without
+    any labels assigned. For example, to distinguish cats and dogs among
+    a group of images, a machine needs to learn by itself the
+    characteristics of cats and dogs in order to classify them. This
+    unsupervised classification is also called clustering.
+
+3.  In reinforcement learning, an algorithm that runs on the machine
+    automatically improves itself to achieve the task objective in a
+    given learning environment. A well-known example of this is AlphaGo,
+    in which the rules of Go serve as the learning environment and the
+    victory score is set as the task objective.
+
+Machine learning is applied in a variety of fields --- computer vision,
+natural language processing (NLP), and intelligent decision-making, to
+name just a few. Computer vision, in a narrow sense, includes all
+image-based applications, such as facial recognition, object
+recognition, target tracking, human pose estimation, and image
+understanding. It is widely used in autonomous driving, smart city,
+smart security, and other scenarios.
+
+NLP involves both text- and speech-related applications, including
+language translation, text-to-speech and speech-to-text conversion, text
+understanding, and image style transfer. NLP and computer vision overlap
+in many aspects. For instance, in order to generate text description for
+images, or to generate or process images based on texts, machines need
+to handle both language and image data.
+
+Intelligent decision-making is usually achieved through technical means
+such as computer vision, NLP, reinforcement learning, and cybernetics.
+It is widely used in many scenarios, such as robotics, autonomous
+driving, games, recommender systems, smart factories, and smart grids.
+
+These machine learning applications use different underlying algorithms
+--- such as support vector machine (SVM), logistic regression, and naive
+Bayes --- based on the needs and characteristics of the applications. In
+recent years, deep learning has progressed significantly thanks to the
+availability of massive data, development of neural network algorithms,
+and maturity of hardware accelerators. But despite a wide variety of
+machine learning algorithms, the vast majority of computation work still
+relies on vector and matrix operations, regardless of whether classical
+or deep learning algorithms are employed. In this book, we therefore
+discuss machine learning systems that employ neural networks.
@@ -0,0 +1,72 @@
+# Machine Learning Framework Architecture
+
+Figure :numref:`intro/framework-architecture` shows the basic
+architecture of a typical, complete machine learning framework.
+
+![Architecture of a machine learningframework](../img/intro/framework-architecture.png)
+:label:`intro/framework-architecture`
+
+1.  **Programming interfaces:** A machine learning framework needs to
+    provide programming interfaces, usually those of high-level
+    programming languages (like Python), to cater for the diversified
+    backgrounds of machine learning developers. At the same time, the
+    framework also needs to support a system implementation that is
+    mainly based on low-level programming languages (e.g., C and C++) so
+    that operating system features (e.g., thread management and network
+    communication) and various hardware accelerators can be utilized
+    efficiently for optimized performance.
+
+2.  **Computational graph:** Machine learning applications, though
+    implemented through different programming interfaces, need to share
+    the same backend when the applications run. The computational graph
+    technology is key to realizing this backend. A computational graph,
+    which defines a user's machine learning application, includes many
+    graph nodes that represent computational operations. These nodes are
+    connected by edges, which represent computational dependencies.
+
+3.  **Compiler frontend:** Once a computational graph is built, the
+    machine learning framework analyzes and optimizes it (or the
+    corresponding application) through the compiler frontend. The
+    compiler frontend provides key functions such as intermediate
+    representation, automatic differentiation, type derivation, and
+    static analysis.
+
+4.  **Compiler backend and runtime:** After analyzing and optimizing the
+    computational graph, the machine learning framework uses the
+    compiler backend and runtime to optimize different types of
+    underlying hardware. In addition to optimizing the selection or
+    scheduling sequence of operators, common optimization technologies
+    usually analyze the L2/L3 cache size and the instruction pipeline
+    length to match hardware specifications.
+
+5.  **Heterogeneous processors:** A machine learning application is
+    co-executed by central processing units (CPUs) and hardware
+    accelerators (such as NVIDIA GPUs, Huawei Ascend processors, and
+    Google TPUs). During the execution, non-matrix operations (e.g.,
+    complex data preprocessing and computational graph scheduling) are
+    handled by CPUs, whereas matrix operations and certain frequently
+    used machine learning operators (e.g., Transformer operators and
+    convolution operators) are performed by hardware accelerators.
+
+6.  **Data processing:** A machine learning application needs to perform
+    complex preprocessing on raw data and manage a large number of
+    training, validation, and test datasets. The data processing module
+    (e.g., the tf.data module of TensorFlow, or the DataLoader module of
+    PyTorch) is responsible for such data-centered operations.
+
+7.  **Model deployment:** In addition to model training, model
+    deployment is another key function needed in a machine learning
+    framework. Model compression technologies --- such as model
+    conversion, quantization, and distillation --- enable us to run
+    models on hardware with limited memory. It is also necessary to
+    optimize model operators for specific hardware inference platforms
+    (e.g., NVIDIA Orin). Furthermore, in order to ensure the security of
+    a model (e.g., to deny unauthorized user reads), model obfuscation
+    must be considered in the framework's design.
+
+8.  **Distributed training:** A machine learning model is usually
+    trained in parallel on distributed compute nodes. Common parallel
+    training methods include data parallelism, model parallelism, hybrid
+    parallelism, and pipeline parallelism, all of which are usually
+    implemented through the remote procedure call (RPC), collective
+    communication, or parameter server.