Skip to content

Commit 601ade6

Browse files
committed
Upload introduction chapter
1 parent 93ea2ba commit 601ade6

10 files changed

Lines changed: 328 additions & 0 deletions
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Application Scenarios of Machine Learning Systems
2+
3+
A machine learning framework is commonly utilized in diverse scenarios,
4+
giving rise to a range of *machine learning systems*. In a broader
5+
context, a machine learning system refers to a collective term
6+
encompassing a variety of software and hardware systems that facilitate
7+
and execute machine learning applications. Figure
8+
:numref:`intro/system-ecosystem` provides an overview of the
9+
various application scenarios for machine learning systems.
10+
11+
![Application scenarios of machine learningsystems](../img/intro/system-ecosystem.png)
12+
:label:`intro/system-ecosystem`
13+
14+
1. **Federated learning:** Laws and regulations on user privacy
15+
protection and data protection prevent many machine learning
16+
applications from accessing user data directly for model training
17+
purposes. This is where federated learning --- based on a machine
18+
learning framework --- benefits such applications.
19+
20+
2. **Recommender system:** Incorporating machine learning (especially
21+
deep learning) into recommender systems have achieved major success
22+
over the past few years. Compared with traditional rule-based
23+
recommender systems, those based on deep learning can analyze
24+
massive feature data of users more effectively, thereby bringing
25+
huge improvements to the accuracy and timeliness of recommendations.
26+
27+
3. **Reinforcement learning:** Because reinforcement learning is
28+
special in terms of the way it collects data and trains models, it
29+
is therefore necessary to develop dedicated reinforcement learning
30+
systems based on a machine learning framework.
31+
32+
4. **Explainable AI:** As machine learning becomes more and more
33+
popular in many key areas, including finance, healthcare, and
34+
governmental affairs, developing explainable AI systems based on a
35+
machine learning framework is gaining wider attention.
36+
37+
5. **Robotics:** Robotics is another area where the use of machine
38+
learning frameworks is gaining popularity. Compared with traditional
39+
robot vision methods, machine learning methods have achieved
40+
enormous success in several robot tasks, such as automatic feature
41+
extraction, target recognition, and path planning.
42+
43+
6. **Graph learning:** Graphs are the most widely used data structure
44+
and are used to express large volumes of Internet data, for
45+
instance, social network graphs and product relationship graphs.
46+
Machine learning algorithms have been proven effective for analyzing
47+
large-scale graph data. A machine learning system designed to
48+
process graph data is referred to as a graph learning system.
49+
50+
7. **Scientific computing:** Scientific computing covers a wide range
51+
of traditional fields (such as electromagnetic simulation, graphics,
52+
and weather forecast), in which many large-scale problems can be
53+
effectively solved by machine learning methods. Therefore,
54+
developing special machine learning systems for scientific computing
55+
is becoming an increasingly common practice.
56+
57+
8. **Scheduling of a machine learning cluster:** A machine learning
58+
cluster consists of heterogeneous processors, heterogeneous
59+
networks, and even heterogeneous storage devices. But in a machine
60+
learning cluster, computing tasks often have common characteristics
61+
during their execution (e.g., iterative execution based on the
62+
collective communication operator AllReduce). In order to take
63+
account of the cluster's heterogeneity of devices and the common
64+
characteristics in task execution, a machine learning cluster is
65+
often designed to use a special scheduling method.
66+
67+
9. **Quantum computing:** Quantum computers are generally realized
68+
through a hybrid architecture, in which quantum computing is
69+
performed by quantum computers and the simulation of quantum
70+
computers is performed by classical computers. Many simulation
71+
systems (such as TensorFlow Quantum and MindQuantum) are realized on
72+
the basis of a machine learning framework because the simulation
73+
often requires massive matrix computations and gradient computation.
74+
75+
There are too many machine learning systems for this book to cover them
76+
all in depth. Instead, we aim to provide a system designer's perspective
77+
on several core systems used in federated learning, recommenders,
78+
reinforcement learning, explainable AI, and robotics.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Book Organization and Intended Audience
2+
3+
This book adopts a level-by-level approach to discuss design principles
4+
and implementation practices of machine learning systems. The
5+
**Framework Design** part starts with introducing key concepts that
6+
framework users need to understand, including programming interface
7+
design and computational graph. This part then describes the frontend
8+
and backend techniques used in AI compilers as well as key techniques
9+
for processing data, deploying models, and distributing training to
10+
multiple machines. The **Application Scenarios** part elaborates on
11+
several important types of machine learning systems, such as federated
12+
learning and recommender systems, in an attempt to provide readers with
13+
useful knowledge for both deploying and operating machine learning
14+
frameworks in different application scenarios.
15+
16+
This book is intended for the following readers:
17+
18+
1. **Students:** This book provides a wealth of design principles and
19+
hands-on experience of machine learning systems. Such knowledge will
20+
help students better understand the theoretical pros and cons and
21+
practical challenges of machine learning algorithms.
22+
23+
2. **Researchers:** This book aims to help researchers tackle various
24+
challenges in machine learning implementation and guide them through
25+
the design of next-generation machine learning algorithms meant to
26+
solve large-scale practical problems.
27+
28+
3. **Developers:** We also hope this book will allow developers to gain
29+
a profound understanding on the internal architecture of a machine
30+
learning system. Such knowledge will move them a step further in
31+
developing new functions for their applications, debugging system
32+
performance issues, and even customizing a machine learning system
33+
based on their business needs.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Design Objectives of Machine Learning Frameworks
2+
3+
*Machine learning frameworks* (e.g., TensorFlow, PyTorch, and MindSpore)
4+
were designed and implemented so that machine learning algorithms could
5+
be developed efficiently for different applications. In a broad sense,
6+
these frameworks achieved the following common design objectives.
7+
8+
1. **Neural network programming:** The huge success of deep learning
9+
has solidified neural networks as the core of many machine learning
10+
applications. People need to customize neural networks to meet their
11+
specific application requirements --- such customization typically
12+
results in the creation of convolutional neural networks (CNNs) and
13+
self-attention neural networks. In order to develop, train, and
14+
deploy these networks, we need a generic system software.
15+
16+
2. **Automatic differentiation:** The training of neural networks
17+
involves continuously computing gradients through the combined use
18+
of training data, data annotation, and a loss function to
19+
iteratively improve model parameters. Computing gradients manually
20+
is a complex and time-consuming task. Consequently, a machine
21+
learning framework is expected to compute gradients automatically
22+
based on a neural network application provided by developers. This
23+
computation process is called automatic differentiation.
24+
25+
3. **Data management and processing:** Data is the key to machine
26+
learning. There are several types of data, including training,
27+
validation, and test datasets, as well as model parameters. A
28+
machine learning system should be able to read, store, and
29+
preprocess (data augmentation and cleansing are examples of
30+
preprocessing) these types of data by itself.
31+
32+
4. **Model training and deployment:** A machine learning model is
33+
expected to provide optimal performance. In order to achieve this,
34+
we need to use an optimization method --- for example, mini-batch
35+
stochastic gradient descent (SGD) --- to repeatedly compute
36+
gradients through multi-step iteration. This process is called
37+
training. Once the training is complete, we can then deploy the
38+
trained model to the inference device.
39+
40+
5. **Hardware accelerators:** Many core operations in machine learning
41+
can be deemed as matrix computation. To accelerate such computation,
42+
machine learning developers leverage many specially designed
43+
hardware components referred to as hardware accelerators or AI
44+
chips.
45+
46+
6. **Distributed training:** As the volume of training data and the
47+
number of neural network parameters increase, the amount of memory
48+
used by a machine learning system far exceeds what a single machine
49+
can provide. Therefore, a machine learning framework should be able
50+
to train models on distributed machines.
51+
52+
Early attempts by developers to design such a framework employed
53+
traditional methods such as *neural network libraries* (e.g., Theano and
54+
Caffe) and *data processing frameworks* (e.g., Apache Spark and Google's
55+
Pregel), but the results were disappointing. At that time, neural
56+
network libraries lacked the ability to manage and process large
57+
datasets, deploy models, or perform distributed model execution, meaning
58+
they were not qualified enough for developing today's product-level
59+
machine learning applications even though they supported neural network
60+
development, automatic differentiation, and hardware accelerators.
61+
Furthermore, data-parallel computing frameworks were not suitable for
62+
developing neural network--centered machine learning applications
63+
because they lacked support for neural networks, automatic
64+
differentiation, and accelerators, although such frameworks were already
65+
mature in supporting distributed running and data management.
66+
67+
These drawbacks led many enterprise developers and university
68+
researchers to design and implement their own software frameworks for
69+
machine learning from scratch. In only a few short years, numerous
70+
machine learning frameworks emerged --- well-known examples of these
71+
include TensorFlow, PyTorch, MindSpore, MXNet, PaddlePaddle, OneFlow,
72+
and CNTK. These frameworks boosted the development of AI significantly
73+
in both upstream and downstream industries. Table
74+
:numref:`intro-comparison` lists the differences between machine
75+
learning frameworks and other related systems.
76+
77+
78+
:Differences between machine learning frameworks and related systems
79+
80+
|Design Method | Neural Network | Automatic Differentiation | Data Management | Training and Deployment | Accelerator | Distributed Training |
81+
|----------------------------|----------------|----------------------------|-------------------|---------------------------|---------------|----------------------|
82+
|Neural network libraries | Yes | Yes | No | No | Yes | No |
83+
|Data processing frameworks | No | No | Yes | No | No | Yes |
84+
|Machine learning frameworks | Yes | Yes | Yes | Yes | Yes | Yes |
85+
:label:intro-comparison
86+

chapter_introduction/Index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Introduction
2+
3+
This chapter aims to provide readers with a comprehensive understanding
4+
of machine learning systems by describing the applications of machine
5+
learning and summarizing the design objectives and basic composition
6+
principles of such systems.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Machine Learning Applications
2+
3+
In general terms, machine learning is a technology that learns useful
4+
knowledge from data. There are a variety of machine learning methods,
5+
including supervised learning, unsupervised learning, and reinforcement
6+
learning.
7+
8+
1. In supervised learning, the mapping relationships between inputs and
9+
outputs are known to machines. For example, a discrete label can be
10+
assigned to an input image.
11+
12+
2. In unsupervised learning, input data is provided to machines without
13+
any labels assigned. For example, to distinguish cats and dogs among
14+
a group of images, a machine needs to learn by itself the
15+
characteristics of cats and dogs in order to classify them. This
16+
unsupervised classification is also called clustering.
17+
18+
3. In reinforcement learning, an algorithm that runs on the machine
19+
automatically improves itself to achieve the task objective in a
20+
given learning environment. A well-known example of this is AlphaGo,
21+
in which the rules of Go serve as the learning environment and the
22+
victory score is set as the task objective.
23+
24+
Machine learning is applied in a variety of fields --- computer vision,
25+
natural language processing (NLP), and intelligent decision-making, to
26+
name just a few. Computer vision, in a narrow sense, includes all
27+
image-based applications, such as facial recognition, object
28+
recognition, target tracking, human pose estimation, and image
29+
understanding. It is widely used in autonomous driving, smart city,
30+
smart security, and other scenarios.
31+
32+
NLP involves both text- and speech-related applications, including
33+
language translation, text-to-speech and speech-to-text conversion, text
34+
understanding, and image style transfer. NLP and computer vision overlap
35+
in many aspects. For instance, in order to generate text description for
36+
images, or to generate or process images based on texts, machines need
37+
to handle both language and image data.
38+
39+
Intelligent decision-making is usually achieved through technical means
40+
such as computer vision, NLP, reinforcement learning, and cybernetics.
41+
It is widely used in many scenarios, such as robotics, autonomous
42+
driving, games, recommender systems, smart factories, and smart grids.
43+
44+
These machine learning applications use different underlying algorithms
45+
--- such as support vector machine (SVM), logistic regression, and naive
46+
Bayes --- based on the needs and characteristics of the applications. In
47+
recent years, deep learning has progressed significantly thanks to the
48+
availability of massive data, development of neural network algorithms,
49+
and maturity of hardware accelerators. But despite a wide variety of
50+
machine learning algorithms, the vast majority of computation work still
51+
relies on vector and matrix operations, regardless of whether classical
52+
or deep learning algorithms are employed. In this book, we therefore
53+
discuss machine learning systems that employ neural networks.
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Machine Learning Framework Architecture
2+
3+
Figure :numref:`intro/framework-architecture` shows the basic
4+
architecture of a typical, complete machine learning framework.
5+
6+
![Architecture of a machine learningframework](../img/intro/framework-architecture.png)
7+
:label:`intro/framework-architecture`
8+
9+
1. **Programming interfaces:** A machine learning framework needs to
10+
provide programming interfaces, usually those of high-level
11+
programming languages (like Python), to cater for the diversified
12+
backgrounds of machine learning developers. At the same time, the
13+
framework also needs to support a system implementation that is
14+
mainly based on low-level programming languages (e.g., C and C++) so
15+
that operating system features (e.g., thread management and network
16+
communication) and various hardware accelerators can be utilized
17+
efficiently for optimized performance.
18+
19+
2. **Computational graph:** Machine learning applications, though
20+
implemented through different programming interfaces, need to share
21+
the same backend when the applications run. The computational graph
22+
technology is key to realizing this backend. A computational graph,
23+
which defines a user's machine learning application, includes many
24+
graph nodes that represent computational operations. These nodes are
25+
connected by edges, which represent computational dependencies.
26+
27+
3. **Compiler frontend:** Once a computational graph is built, the
28+
machine learning framework analyzes and optimizes it (or the
29+
corresponding application) through the compiler frontend. The
30+
compiler frontend provides key functions such as intermediate
31+
representation, automatic differentiation, type derivation, and
32+
static analysis.
33+
34+
4. **Compiler backend and runtime:** After analyzing and optimizing the
35+
computational graph, the machine learning framework uses the
36+
compiler backend and runtime to optimize different types of
37+
underlying hardware. In addition to optimizing the selection or
38+
scheduling sequence of operators, common optimization technologies
39+
usually analyze the L2/L3 cache size and the instruction pipeline
40+
length to match hardware specifications.
41+
42+
5. **Heterogeneous processors:** A machine learning application is
43+
co-executed by central processing units (CPUs) and hardware
44+
accelerators (such as NVIDIA GPUs, Huawei Ascend processors, and
45+
Google TPUs). During the execution, non-matrix operations (e.g.,
46+
complex data preprocessing and computational graph scheduling) are
47+
handled by CPUs, whereas matrix operations and certain frequently
48+
used machine learning operators (e.g., Transformer operators and
49+
convolution operators) are performed by hardware accelerators.
50+
51+
6. **Data processing:** A machine learning application needs to perform
52+
complex preprocessing on raw data and manage a large number of
53+
training, validation, and test datasets. The data processing module
54+
(e.g., the tf.data module of TensorFlow, or the DataLoader module of
55+
PyTorch) is responsible for such data-centered operations.
56+
57+
7. **Model deployment:** In addition to model training, model
58+
deployment is another key function needed in a machine learning
59+
framework. Model compression technologies --- such as model
60+
conversion, quantization, and distillation --- enable us to run
61+
models on hardware with limited memory. It is also necessary to
62+
optimize model operators for specific hardware inference platforms
63+
(e.g., NVIDIA Orin). Furthermore, in order to ensure the security of
64+
a model (e.g., to deny unauthorized user reads), model obfuscation
65+
must be considered in the framework's design.
66+
67+
8. **Distributed training:** A machine learning model is usually
68+
trained in parallel on distributed compute nodes. Common parallel
69+
training methods include data parallelism, model parallelism, hybrid
70+
parallelism, and pipeline parallelism, all of which are usually
71+
implemented through the remote procedure call (RPC), collective
72+
communication, or parameter server.
32.8 KB
Loading

img/intro/framework_position.png

174 KB
Loading

img/intro/mlbook-en.pptx

47 KB
Binary file not shown.

img/intro/system-ecosystem.png

199 KB
Loading

0 commit comments

Comments
 (0)