-
Couldn't load subscription status.
- Fork 5.9k
Paddle Refactorization Overall Design
-
PaddlePaddle represent the training and inference of DL models by computation graphs.
-
Graphs are constructed by a Python program.
-
A graph is composed of variabels and operators.
-
A graph should be able to be serialized for distributed training.
-
There are two stages to process the Graph:
- compile: runs a Python program to generate a protobuf message representation of the graph and send it to the C++ library/binaries, and
- run: construct class Variable and OperatorBase instances and run them.
| compile time | runtime | |
|---|---|---|
| Data | VarDesc(proto) | Variable(cpp) |
| Operation | OpDesc(proto) | Operator(cpp) |
-
User Use Python code to describe the Computation.
-
Compile Time: generates Graph. -
Compile Time: check, optimize, and transform Graph.- Check data size and attribute.
- Infer the shape of data.
- Do memory plan and reuse.
- Generate backward and optimization part of the Graph.
- split the graph for distributed training.
-
Runtime: Run Graph.
Compile Time -> IR -> Runtime
- Optimization
Compile Time -> IR -> Optimized IR -> Runtime - Send automatically partitioned IR to different nodes.
- Automatic data parallel
Compile Time |-> Single GPU IR |-> [trainer-IR-0, trainer-IR-1, pserver-IR] |-> Node-0 (runs trainer-IR-0) |-> Node-1 (runs trainer-IR-1) |-> Node-2 (runs pserver-IR) - Automatic model parallel (planned for future)
- Automatic data parallel
-
Operatoris the fundamental building block as the user interface.- Operator stores input/output variable name, and attributes.
- The
InferShapeinterface is used to infer output variable shapes by its input shapes. - Use
Runto computeinput variablestooutput variables.
-
OpWithKernelinheritsOperator. -
OpWithKernelcontains a Kernel map.-
OpWithKernel::Runget device's kernel, and invokeOpKernel::Compute. -
OpKernelKeyis the map key. Only device place now, but may be data type later.
-
- Separate GPU and CPU code.
- Make Paddle can run without GPU.
- Make one operator (which is user interface) can contain many implementations.
- Same mul op, different FP16, FP32 Kernel. different MKL, eigen kernel.
-
Eigen::Tensorcontains basic math and element-wise functions.- Note that
Eigen::Tensorhas broadcast implementation. - Limit number of
tensor.device(dev) =in your code.
- Note that
-
thrust::tranformandstd::transform.-
thrusthas the same API as C++ standard library. Usingtransformcan quickly implement a customized elementwise kernel. -
thrusthas more complex API, likescan,reduce,reduce_by_key.
-
- Hand-writing
GPUKernelandCPUcode- Do not write
.h. CPU Kernel should be in.cc. CPU kernel should be in.cu. (GCCcannot compile GPU code.)
- Do not write
We need a method to build mappings between Op type names and Op classes.
Maintain a map, whose key is the type name and value is corresponding Op constructor.
op_type(string) -> OpInfo
OpInfo:
-
creator: The Op constructor. -
grad_op_type: The type of the gradient Op. -
proto: The Op's Protobuf, including inputs, outputs and required attributes. -
checker: Used to check attributes.
It's constructor takes proto and checker. They are compeleted during Op_Maker's construction. (ScaleOpMaker)
REGISTER_OP(op_type, op_class, op_maker_class, grad_op_type, grad_op_class)
REGISTER_OP_WITHOUT_GRADIENT(op_type, op_class, op_maker_class)make sure the registration process is executed and linked.
-
Write Op class, as well as its gradient Op class if there is.
-
Write Op maker class. In the constructor, describe its inputs, outputs, and attributes.
-
Invoke macro
REGISTER_OP. The macro will- call maker class to complete
protoandchecker - with the completed
protoandchecker, build a new key-value pair in theOpInfoMap
- call maker class to complete
-
Invoke
USEmacro in where the Op is used to make sure it is linked.
- Mapping from forwarding Op to backward Op
- Input graph of forwarding operators
- Output graph of backward operators
-
corner case in construction
- shared variable => insert
Addoperator - no gradient => insert
fill_zero_gradoperator - recursive netOp => call
Backwardrecursively - RNN Op => recursively call
Backwardon stepnet
- shared variable => insert
-
Tensoris an n-dimension array with type.- Only dims and data pointers are stored in
Tensor. - All operators on
Tensoris written inOperatoror global functions. - variable length Tensor design LoDTensor
- Only dims and data pointers are stored in
-
Variableis the inputs and outputs of an operator. Not justTensor.- step_scopes in RNN is a variable and not a tensor.
-
Scopeis where variables store at.- map<string/*var name */, Variable>
-
Scopehas a hierarchical structure. The local scope can get variable from its parent scope.
- as an operator is more intuitive than
RNNOp, - offers new interface
Eval(targets)to deduce the minimal block toRun, - fits the compile-time/ runtime separation design.
- during the compilation,
SymbolTablestoresVarDescs andOpDescs and serialize to aBlockDesc - when graph executes, a Block with
BlockDescpassed in createsOpandVarthenRun
- during the compilation,
- take Paddle/books as the main line, the requirement of the models motivates framework refactoring,
- model migration
- framework development gives priority support to model migration, for example,
- the MNIST demo needs a Python interface,
- the RNN models require the framework to support
LoDTensor.
- determine some timelines,
- heavily-relied Ops need to be migrated first,
- different models can be migrated parallelly.
- framework development gives priority support to model migration, for example,
- improve the framework at the same time
- accept imperfection, concentrated on solving the specific problem at the right price.
- compare the performance of migrated models with old ones.
- follow google C style
- build the automatic workflow of generating Python/C++ documentations
- the documentation of layers and ops should be written inside the code
- take the documentation quality into account when doing PR
- preview the documentations, read and improve them from users' perspective