intel
diff --git a/‎csrc/include/xpu/Stream.h
+4-4 b/‎csrc/include/xpu/Stream.h
+4-4
diff --git a/‎docs/index.rst
+1-1 b/‎docs/index.rst
+1-1
diff --git a/‎docs/tutorials/api_doc.rst
+21 b/‎docs/tutorials/api_doc.rst
+21
@@ -21,10 +21,10 @@
 
 namespace xpu {
 
-/// Get a sycl queue from a c10 stream. Generate a dpcpp stream from c10 stream,
-/// and get dpcpp queue.
+/// Get a sycl queue from a c10 stream. Generate a sycl stream from c10 stream,
+/// and get sycl queue.
 /// @param stream: c10 stream.
-/// @returns: dpcpp queue.
+/// @returns: sycl queue.
 IPEX_API sycl::queue& get_queue_from_stream(c10::Stream stream);
 
-} // namespace xpu
+} // namespace xpu
@@ -13,7 +13,7 @@ Intel® Extension for PyTorch* is structured as shown in the following figure:
   :align: center
   :alt: Architecture of Intel® Extension for PyTorch*
 
-PyTorch components are depicted with white boxes and Intel extensions are with blue boxes. Extra performance of the extension comes from optimizations for both eager mode and graph mode. In eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization API. Further performance boosting is available by converting the eager-mode model into graph mode via extended graph fusion passes. For the XPU backend, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. In graph mode, further operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
+PyTorch components are depicted with white boxes and Intel extensions are with blue boxes. Extra performance of the extension comes from optimizations for both eager mode and graph mode. In eager mode, the PyTorch frontend is extended with custom Python modules (such as fusion modules), optimal optimizers, and INT8 quantization API. Further performance boosting is available by converting the eager-mode model into graph mode via extended graph fusion passes. For the XPU device, optimized operators and kernels are implemented and registered through PyTorch dispatching mechanism. These operators and kernels are accelerated from native vectorization feature and matrix calculation feature of Intel GPU hardware. In graph mode, further operator fusions are supported to reduce operator/kernel invocation overheads, and thus increase performance.
 
 Intel® Extension for PyTorch* utilizes the `DPC++ <https://github.com/intel/llvm#oneapi-dpc-compiler>`_ compiler that supports the latest `SYCL* <https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html>`_ standard and also a number of extensions to the SYCL* standard, which can be found in the `sycl/doc/extensions <https://github.com/intel/llvm/tree/sycl/sycl/doc/extensions>`_ directory. Intel® Extension for PyTorch* also integrates `oneDNN <https://github.com/oneapi-src/oneDNN>`_ and `oneMKL <https://github.com/oneapi-src/oneMKL>`_ libraries and provides kernels based on that. The oneDNN library is used for computation intensive operations. The oneMKL library is used for fundamental mathematical operations.
 
 
@@ -6,6 +6,27 @@ General
 
 .. currentmodule:: intel_extension_for_pytorch
 .. autofunction:: optimize
+
+
+
+    `torch.xpu.optimize` is an alternative of optimize API in Intel® Extension for PyTorch*, to provide identical usage for XPU device only.
+    The motivation of adding this alias is to unify the coding style in user scripts base on torch.xpu modular.
+
+    .. code-block:: python
+        >>> # bfloat16 inference case.
+        >>> model = ...
+        >>> model.load_state_dict(torch.load(PATH))
+        >>> model.eval()
+        >>> optimized_model = torch.xpu.optimize(model, dtype=torch.bfloat16)
+        >>> # running evaluation step.
+        >>> # bfloat16 training case.
+        >>> optimizer = ...
+        >>> model.train()
+        >>> optimized_model, optimized_optimizer = torch.xpu.optimize(model, dtype=torch.bfloat16, optimizer=optimizer)
+        >>> # running training step.
+
+
+
 .. currentmodule:: intel_extension_for_pytorch.xpu
 .. StreamContext
 .. can_device_access_peer