|
| 1 | +# Design Objectives of Machine Learning Frameworks |
| 2 | + |
| 3 | +*Machine learning frameworks* (e.g., TensorFlow, PyTorch, and MindSpore) |
| 4 | +were designed and implemented so that machine learning algorithms could |
| 5 | +be developed efficiently for different applications. In a broad sense, |
| 6 | +these frameworks achieved the following common design objectives. |
| 7 | + |
| 8 | +1. **Neural network programming:** The huge success of deep learning |
| 9 | + has solidified neural networks as the core of many machine learning |
| 10 | + applications. People need to customize neural networks to meet their |
| 11 | + specific application requirements --- such customization typically |
| 12 | + results in the creation of convolutional neural networks (CNNs) and |
| 13 | + self-attention neural networks. In order to develop, train, and |
| 14 | + deploy these networks, we need a generic system software. |
| 15 | + |
| 16 | +2. **Automatic differentiation:** The training of neural networks |
| 17 | + involves continuously computing gradients through the combined use |
| 18 | + of training data, data annotation, and a loss function to |
| 19 | + iteratively improve model parameters. Computing gradients manually |
| 20 | + is a complex and time-consuming task. Consequently, a machine |
| 21 | + learning framework is expected to compute gradients automatically |
| 22 | + based on a neural network application provided by developers. This |
| 23 | + computation process is called automatic differentiation. |
| 24 | + |
| 25 | +3. **Data management and processing:** Data is the key to machine |
| 26 | + learning. There are several types of data, including training, |
| 27 | + validation, and test datasets, as well as model parameters. A |
| 28 | + machine learning system should be able to read, store, and |
| 29 | + preprocess (data augmentation and cleansing are examples of |
| 30 | + preprocessing) these types of data by itself. |
| 31 | + |
| 32 | +4. **Model training and deployment:** A machine learning model is |
| 33 | + expected to provide optimal performance. In order to achieve this, |
| 34 | + we need to use an optimization method --- for example, mini-batch |
| 35 | + stochastic gradient descent (SGD) --- to repeatedly compute |
| 36 | + gradients through multi-step iteration. This process is called |
| 37 | + training. Once the training is complete, we can then deploy the |
| 38 | + trained model to the inference device. |
| 39 | + |
| 40 | +5. **Hardware accelerators:** Many core operations in machine learning |
| 41 | + can be deemed as matrix computation. To accelerate such computation, |
| 42 | + machine learning developers leverage many specially designed |
| 43 | + hardware components referred to as hardware accelerators or AI |
| 44 | + chips. |
| 45 | + |
| 46 | +6. **Distributed training:** As the volume of training data and the |
| 47 | + number of neural network parameters increase, the amount of memory |
| 48 | + used by a machine learning system far exceeds what a single machine |
| 49 | + can provide. Therefore, a machine learning framework should be able |
| 50 | + to train models on distributed machines. |
| 51 | + |
| 52 | +Early attempts by developers to design such a framework employed |
| 53 | +traditional methods such as *neural network libraries* (e.g., Theano and |
| 54 | +Caffe) and *data processing frameworks* (e.g., Apache Spark and Google's |
| 55 | +Pregel), but the results were disappointing. At that time, neural |
| 56 | +network libraries lacked the ability to manage and process large |
| 57 | +datasets, deploy models, or perform distributed model execution, meaning |
| 58 | +they were not qualified enough for developing today's product-level |
| 59 | +machine learning applications even though they supported neural network |
| 60 | +development, automatic differentiation, and hardware accelerators. |
| 61 | +Furthermore, data-parallel computing frameworks were not suitable for |
| 62 | +developing neural network--centered machine learning applications |
| 63 | +because they lacked support for neural networks, automatic |
| 64 | +differentiation, and accelerators, although such frameworks were already |
| 65 | +mature in supporting distributed running and data management. |
| 66 | + |
| 67 | +These drawbacks led many enterprise developers and university |
| 68 | +researchers to design and implement their own software frameworks for |
| 69 | +machine learning from scratch. In only a few short years, numerous |
| 70 | +machine learning frameworks emerged --- well-known examples of these |
| 71 | +include TensorFlow, PyTorch, MindSpore, MXNet, PaddlePaddle, OneFlow, |
| 72 | +and CNTK. These frameworks boosted the development of AI significantly |
| 73 | +in both upstream and downstream industries. Table |
| 74 | +:numref:`intro-comparison` lists the differences between machine |
| 75 | +learning frameworks and other related systems. |
| 76 | + |
| 77 | + |
| 78 | +:Differences between machine learning frameworks and related systems |
| 79 | + |
| 80 | +|Design Method | Neural Network | Automatic Differentiation | Data Management | Training and Deployment | Accelerator | Distributed Training | |
| 81 | +|----------------------------|----------------|----------------------------|-------------------|---------------------------|---------------|----------------------| |
| 82 | +|Neural network libraries | Yes | Yes | No | No | Yes | No | |
| 83 | +|Data processing frameworks | No | No | Yes | No | No | Yes | |
| 84 | +|Machine learning frameworks | Yes | Yes | Yes | Yes | Yes | Yes | |
| 85 | +:label:intro-comparison |
| 86 | + |
0 commit comments