Paddle Refactorization Overall Design

Computation Graph

PaddlePaddle represent the training and inference of DL models by computation graphs.
Graphs are constructed by a Python program.
A graph is composed of variabels and operators.
A graph should be able to be serialized for distributed training.
There are two stages to process the Graph:
1. compile: runs a Python program to generate a protobuf message representation of the graph and send it to the C++ library/binaries, and
2. run: construct class Variable and OperatorBase instances and run them.

Concepts of Computation Graph

	compile time	runtime
Data	VarDesc(proto)	Variable(cpp)
Operation	OpDesc(proto)	Operator(cpp)

Training Process

User Use Python code to describe the Computation.
Compile Time: generates Graph.
Compile Time: check, optimize, and transform Graph.
1. Check data size and attribute.
2. Infer the shape of data.
3. Do memory plan and reuse.
4. Generate backward and optimization part of the Graph.
5. split the graph for distributed training.
Runtime: Run Graph.

Intermediate Representation (IR)

Compile Time -> IR -> Runtime

Benefit

Optimization

Compile Time -> IR -> Optimized IR -> Runtime

Send automatically partitioned IR to different nodes.

Automatic data parallel

Compile Time
|-> Single GPU IR
    |-> [trainer-IR-0, trainer-IR-1, pserver-IR]
        |-> Node-0 (runs trainer-IR-0)
        |-> Node-1 (runs trainer-IR-1)
        |-> Node-2 (runs pserver-IR)

Automatic model parallel (planned for future)

Operator/OpWithKernel/OpKernel

class_diagram

Operator

class_diagram

Operator is the fundamental building block as the user interface.
- Operator stores input/output variable name, and attributes.
- The InferShape interface is used to infer output variable shapes by its input shapes.
- Use Run to compute input variables to output variables.

OpWithKernel/Kernel

class_diagram

OpWithKernel inherits Operator.
OpWithKernel contains a Kernel map.
- OpWithKernel::Run get device's kernel, and invoke OpKernel::Compute.
- OpKernelKey is the map key. Only device place now, but may be data type later.

Why separate Kernel and Operator

Separate GPU and CPU code.
- Make Paddle can run without GPU.
Make one operator (which is user interface) can contain many implementations.
- Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.

Libraries for Kernel development

Eigen::Tensor contains basic math and element-wise functions.
- Note that Eigen::Tensor has broadcast implementation.
- Limit number of tensor.device(dev) = in your code.
thrust::tranform and std::transform.
- thrust has the same API as C++ standard library. Using transform can quickly implement a customized elementwise kernel.
- thrust has more complex API, like scan, reduce, reduce_by_key.
Hand-writing GPUKernel and CPU code
- Do not write .h. CPU Kernel should be in .cc. CPU kernel should be in .cu. (GCC cannot compile GPU code.)

Operator Register

Why register is necessary?

We need a method to build mappings between Op type names and Op classes.

How to do the register?

Maintain a map, whose key is the type name and value is corresponding Op constructor.

The Registry Map

`OpInfoMap`

op_type(string) -> OpInfo

OpInfo:

creator: The Op constructor.
grad_op_type: The type of the gradient Op.
proto: The Op's Protobuf, including inputs, outputs and required attributes.
checker: Used to check attributes.

Related Concepts

Op_Maker

It's constructor takes proto and checker. They are compeleted during Op_Maker's construction. (ScaleOpMaker)

Register Macros

REGISTER_OP(op_type, op_class, op_maker_class, grad_op_type, grad_op_class)
REGISTER_OP_WITHOUT_GRADIENT(op_type, op_class, op_maker_class)

`USE` Macros

make sure the registration process is executed and linked.

Register Process

Write Op class, as well as its gradient Op class if there is.
Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.
Invoke macro REGISTER_OP. The macro will
1. call maker class to complete proto and checker
2. with the completed proto and checker, build a new key-value pair in the OpInfoMap
Invoke USE macro in where the Op is used to make sure it is linked.

Backward Module (1/2)

Create Backward Operator

Mapping from forwarding Op to backward Op

Backward Module (2/2)

Build Backward Network

Input graph of forwarding operators
Output graph of backward operators
corner case in construction
- shared variable => insert Add operator
- no gradient => insert fill_zero_grad operator
- recursive netOp => call Backward recursively
- RNN Op => recursively call Backward on stepnet

Scope, Variable, Tensor

Tensor is an n-dimension array with type.
- Only dims and data pointers are stored in Tensor.
- All operators on Tensor is written in Operator or global functions.
- variable length Tensor design LoDTensor
Variable is the inputs and outputs of an operator. Not just Tensor.
- step_scopes in RNN is a variable and not a tensor.
Scope is where variables store at.
- map<string/*var name */, Variable>
- Scope has a hierarchical structure. The local scope can get variable from its parent scope.

Block (in design)

the difference with original RNNOp

as an operator is more intuitive than RNNOp,
offers new interface Eval(targets) to deduce the minimal block to Run,
fits the compile-time/ runtime separation design.
- during the compilation, SymbolTable stores VarDescs and OpDescs and serialize to a BlockDesc
- when graph executes, a Block with BlockDesc passed in creates Op and Var then Run

Milestone

take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
model migration
- framework development gives priority support to model migration, for example,
  - the MNIST demo needs a Python interface,
  - the RNN models require the framework to support LoDTensor.
- determine some timelines,
- heavily-relied Ops need to be migrated first,
- different models can be migrated parallelly.
improve the framework at the same time
accept imperfection, concentrated on solving the specific problem at the right price.

Control the migration quality

compare the performance of migrated models with old ones.
follow google C style
build the automatic workflow of generating Python/C++ documentations
- the documentation of layers and ops should be written inside the code
- take the documentation quality into account when doing PR
- preview the documentations, read and improve them from users' perspective

Release Notes

Uh oh!

Paddle Refactorization Overall Design

Computation Graph

Concepts of Computation Graph

Training Process

Intermediate Representation (IR)

Benefit

Operator/OpWithKernel/OpKernel

Operator

OpWithKernel/Kernel

Why separate Kernel and Operator

Libraries for Kernel development

Operator Register

Why register is necessary?

How to do the register?

The Registry Map

OpInfoMap

Related Concepts

Op_Maker

Register Macros

USE Macros

Register Process

Backward Module (1/2)

Create Backward Operator

Backward Module (2/2)

Build Backward Network

Scope, Variable, Tensor

Block (in design)

the difference with original RNNOp

Milestone

Control the migration quality

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`OpInfoMap`

`USE` Macros