DeepRec-AI
diff --git a/‎README.md‎
Lines changed: 15 additions & 20 deletions b/‎README.md‎
Lines changed: 15 additions & 20 deletions
diff --git a/‎hybridbackend/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎hybridbackend/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎hybridbackend/run.py‎
Lines changed: 7 additions & 0 deletions b/‎hybridbackend/run.py‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎hybridbackend/tensorflow/benchmarks/data_benchmark_parquet.py‎
Lines changed: 2 additions & 2 deletions b/‎hybridbackend/tensorflow/benchmarks/data_benchmark_parquet.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎hybridbackend/tensorflow/common/slice_sum.cu.cc‎
Lines changed: 103 additions & 0 deletions b/‎hybridbackend/tensorflow/common/slice_sum.cu.cc‎
Lines changed: 103 additions & 0 deletions
diff --git a/‎hybridbackend/tensorflow/data/__init__.py‎
Lines changed: 1 addition & 2 deletions b/‎hybridbackend/tensorflow/data/__init__.py‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎hybridbackend/tensorflow/data/rebatch/buffer.h‎
Lines changed: 79 additions & 0 deletions b/‎hybridbackend/tensorflow/data/rebatch/buffer.h‎
Lines changed: 79 additions & 0 deletions
diff --git a/‎hybridbackend/tensorflow/data/rebatch/dataset.py‎
Lines changed: 7 additions & 14 deletions b/‎hybridbackend/tensorflow/data/rebatch/dataset.py‎
Lines changed: 7 additions & 14 deletions
@@ -11,11 +11,8 @@ recommender systems on heterogeneous cluster.
 ## Features
 
 - Memory-efficient loading of categorical data
-
 - GPU-efficient orchestration of embedding layers
-
 - Communication-efficient training and evaluation at scale
-
 - Easy to use with existing AI workflows
 
 ## Usage
@@ -26,8 +23,8 @@ A minimal example:
 import tensorflow as tf
 import hybridbackend.tensorflow as hb
 
-ds = hb.data.ParquetDataset(filenames, batch_size=batch_size)
-ds = ds.apply(hb.data.parse())
+ds = hb.data.Dataset.from_parquet(filenames)
+ds = ds.batch(batch_size)
 # ...
 
 with tf.device('/gpu:0'):
@@ -44,16 +41,16 @@ more information.
 
 `pip install {PACKAGE}`
 
-`{PACKAGE}` | Dependency | Python  | CUDA | GLIBC | Data Opt. | Embedding Opt. | Parallelism Opt.
------------ | ---------- | ------- | ---- | ----- | --------- | -------------- | -----------------
-[hybridbackend-deeprec2208-cu114](https://pypi.org/project/hybridbackend-deeprec2208-cu114/) | [DeepRec 22.08](https://github.com/alibaba/DeepRec/tree/deeprec2208) `1` | 3.6 | 11.4 | >=2.27 | &check; | &check; | &check;
-[hybridbackend-tf115-cu118](https://pypi.org/project/hybridbackend-tf115-cu118/) | [TensorFlow 1.15](https://github.com/NVIDIA/tensorflow) `2` | 3.8 | 11.8 | >=2.31 | &check; | &check; | &check;
-[hybridbackend-tf115-cu100](https://pypi.org/project/hybridbackend-tf115-cu100/) | [TensorFlow 1.15](https://github.com/tensorflow/tensorflow/tree/r1.15) | 3.6 | 10.0 | >=2.27 | &check; | &check; | &cross;
-[hybridbackend-tf115-cpu](https://pypi.org/project/hybridbackend-tf115-cpu/) | [TensorFlow 1.15](https://github.com/tensorflow/tensorflow/tree/r1.15) | 3.6 | - | >=2.24 | &check; | &cross; | &cross;
+| `{PACKAGE}`                                                                             | Dependency                                                              | Python | CUDA | GLIBC  | Data Opt. | Embedding Opt. | Parallelism Opt. |
+| ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ------ | ---- | ------ | --------- | -------------- | ---------------- |
+| [hybridbackend-tf115-cu118](https://pypi.org/project/hybridbackend-tf115-cu118/)             | [TensorFlow 1.15](https://github.com/NVIDIA/tensorflow) `1`              | 3.8    | 11.8 | >=2.31 | &check;   | &check;        | &check;          |
+| [hybridbackend-tf115-cu100](https://pypi.org/project/hybridbackend-tf115-cu100/)             | [TensorFlow 1.15](https://github.com/tensorflow/tensorflow/tree/r1.15)     | 3.6    | 10.0 | >=2.27 | &check;   | &check;        | &cross;          |
+| [hybridbackend-tf115-cpu](https://pypi.org/project/hybridbackend-tf115-cpu/)                 | [TensorFlow 1.15](https://github.com/tensorflow/tensorflow/tree/r1.15)     | 3.6    | -    | >=2.24 | &check;   | &cross;        | &cross;          |
+| [hybridbackend-deeprec2208-cu114](https://pypi.org/project/hybridbackend-deeprec2208-cu114/) | [DeepRec 22.08](https://github.com/alibaba/DeepRec/tree/deeprec2208) `2` | 3.6    | 11.4 | >=2.27 | &check;   | &check;        | &check;          |
 
-> `1`: Suggested docker image: `dsw-registry.cn-shanghai.cr.aliyuncs.com/pai/tensorflow-training:1.15PAI-gpu-py36-cu114-ubuntu18.04`
+> `1`: Suggested docker image: `nvcr.io/nvidia/tensorflow:22.12-tf1-py3`
 
-> `2`: Suggested docker image: `nvcr.io/nvidia/tensorflow:22.12-tf1-py3`
+> `2`: Suggested docker image: `dsw-registry.cn-shanghai.cr.aliyuncs.com/pai/tensorflow-training:1.15PAI-gpu-py36-cu114-ubuntu18.04`
 
 ### Method 2: Build from source
 
@@ -66,13 +63,11 @@ HybridBackend is licensed under the [Apache 2.0 License](LICENSE).
 ## Community
 
 - Please see [Contributing Guide](https://github.com/alibaba/HybridBackend/blob/main/CONTRIBUTING.md)
-before your first contribution.
-
+  before your first contribution.
 - Please [register as an adopter](https://github.com/alibaba/HybridBackend/blob/main/ADOPTERS.md)
-if your organization is interested in adoption. We will discuss
-[RoadMap](https://github.com/alibaba/HybridBackend/blob/main/ROADMAP.md) with
-registered adopters in advance.
-
+  if your organization is interested in adoption. We will discuss
+  [RoadMap](https://github.com/alibaba/HybridBackend/blob/main/ROADMAP.md) with
+  registered adopters in advance.
 - Please cite [HybridBackend](https://ieeexplore.ieee.org/document/9835450) in your publications if it helps:
 
   ```text
@@ -90,4 +85,4 @@ registered adopters in advance.
 If you would like to share your experiences with others, you are welcome to
 contact us in DingTalk:
 
-[![dingtalk](https://github.com/alibaba/HybridBackend/raw/main/docs/images/dingtalk.png)](https://qr.dingtalk.com/action/joingroup?code=v1,k1,VouhbeuTwXYEgaLzSOE8o6VF2kTHVJ8lw5h93WbZW8o=&_dt_no_comment=1&origin=11)
+[![dingtalk](https://github.com/alibaba/HybridBackend/raw/main/docs/images/dingtalk.png)](https://qr.dingtalk.com/action/joingroup?code=v1,k1,VouhbeuTwXYEgaLzSOE8o6VF2kTHVJ8lw5h93WbZW8o=&_dt_no_comment=1&origin=11)
@@ -20,6 +20,6 @@
 from __future__ import division
 from __future__ import print_function
 
-__version__ = '0.7.0a2'
+__version__ = '0.8.0'
 __author__ = 'Alibaba Group Holding Limited'
 __copyright__ = '2021 Alibaba Group Holding Limited'
@@ -69,6 +69,8 @@ def run(command):
     command: Function or command to run
   '''
   visible_devices = _query_visible_devices()
+  local_world_size_str = str(len(visible_devices))
+
   port = int(os.getenv('HB_RUN_BASE_PORT', '20001'))
   device_to_ports = []
   for d in visible_devices:
@@ -126,6 +128,8 @@ def run(command):
     new_tf_config['task']['type'] = task_type
     new_tf_config['task']['index'] = task_id
     os.environ['TF_CONFIG'] = json.dumps(new_tf_config)
+    os.environ['TF_TASK_TYPE'] = str(task_type)
+    os.environ['TF_TASK_INDEX'] = str(task_id)
     os.environ['CUDA_VISIBLE_DEVICES'] = ''
     os.environ['HB_OP_OPTIMIZATION_DISABLED'] = '1'
     if callable(command):
@@ -165,7 +169,10 @@ def run(command):
       gpu_tf_config['task']['index'] = gpu_index
     gpu_env = os.environ.copy()
     gpu_env['TF_CONFIG'] = json.dumps(gpu_tf_config)
+    gpu_env['TF_TASK_TYPE'] = gpu_tf_config['task']['type']
+    gpu_env['TF_TASK_INDEX'] = str(gpu_tf_config['task']['index'])
     gpu_env['CUDA_VISIBLE_DEVICES'] = device
+    gpu_env['LOCAL_WORLD_SIZE'] = local_world_size_str
     if interop_threads_gpu:
       gpu_env['TF_NUM_INTEROP_THREADS'] = str(interop_threads_gpu)
     if intraop_threads_gpu:
 
@@ -51,13 +51,13 @@ def benchmark(params):
   with tf.Graph().as_default():
     step = tf.train.get_or_create_global_step()
     if params.baseline:
-      ds = hb.data.TabularDataset.from_parquet(params.filenames)
+      ds = hb.data.Dataset.from_parquet(params.filenames)
       ds = ds.map(lambda data: data)  # Prevent fusion
       if params.shuffle:
         ds = ds.shuffle(params.batch_size * 10)
       ds = ds.batch(params.batch_size, drop_remainder=True)
     else:
-      ds = hb.data.TabularDataset.from_parquet(params.filenames)
+      ds = hb.data.Dataset.from_parquet(params.filenames)
       if params.shuffle:
         ds = ds.shuffle_batch(
           params.batch_size, drop_remainder=True,
 
@@ -0,0 +1,103 @@
+/* Copyright 2021 Alibaba Group Holding Limited. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#if HYBRIDBACKEND_TENSORFLOW
+
+#if GOOGLE_CUDA
+#define EIGEN_USE_GPU
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <limits>
+
+#include <tensorflow/core/framework/register_types.h>
+#include <tensorflow/core/framework/tensor.h>
+#include <tensorflow/core/public/version.h>
+
+#include "hybridbackend/common/atomic.cu.h"
+#include "hybridbackend/tensorflow/common/device_functions.h"
+#include "hybridbackend/tensorflow/common/slice_sum.h"
+
+namespace tensorflow {
+
+using GPUDevice = Eigen::GpuDevice;
+
+namespace hybridbackend {
+
+namespace functor {
+
+template <typename T, int32 N = 256>
+__global__ void SliceSumKernel(const int32 num_rows, const int32 num_cols,
+                               const int32 col, const T* input, T* output_total,
+                               T* output) {
+  for (int32 idx : CudaGridRangeX(num_rows)) {
+    const T v = input[idx * num_cols + col];
+    output[idx] = v;
+    atomicAdd(output_total, v);
+  }
+}
+
+template <typename T>
+struct SliceSum<GPUDevice, T> {
+  void operator()(const int32 num_rows, const int32 num_cols, const int32 col,
+                  const T* input, T* output_total, T* output,
+                  const Eigen::GpuDevice& d) {
+    CudaLaunch(SliceSumKernel<T>, num_rows, 0, d, nullptr, num_rows, num_cols,
+               col, input, output_total, output);
+  }
+};
+
+template struct SliceSum<GPUDevice, int32>;
+template struct SliceSum<GPUDevice, int64>;
+template struct SliceSum<GPUDevice, uint32>;
+template struct SliceSum<GPUDevice, uint64>;
+
+template <typename T, int32 N = 256>
+__global__ void GroupSliceSumKernel(const int32 num_rows, const int32 num_cols,
+                                    const int32 col, const int32 num_inputs,
+                                    const T* inputs, T* output_totals,
+                                    T** outputs) {
+  for (int32 idx : CudaGridRangeX(num_inputs * num_rows)) {
+    const int32 s = idx / num_rows;
+    const int32 sidx = idx % num_rows;
+    const T v = inputs[idx * num_cols + col];
+    outputs[s][sidx] = v;
+    atomicAdd(output_totals + s, v);
+  }
+}
+
+template <typename T>
+struct SliceSumN<GPUDevice, T> {
+  void operator()(const int32 num_rows, const int32 num_cols, const int32 col,
+                  const int32 num_inputs, const T* inputs, T* output_totals,
+                  T** outputs, const Eigen::GpuDevice& d) {
+    CudaLaunch(GroupSliceSumKernel<T>, num_inputs * num_rows, 0, d, nullptr,
+               num_rows, num_cols, col, num_inputs, inputs, output_totals,
+               outputs);
+  }
+};
+
+template struct SliceSumN<GPUDevice, int32>;
+template struct SliceSumN<GPUDevice, int64>;
+template struct SliceSumN<GPUDevice, uint32>;
+template struct SliceSumN<GPUDevice, uint64>;
+
+}  // namespace functor
+}  // namespace hybridbackend
+}  // namespace tensorflow
+
+#endif  // GOOGLE_CUDA
+#endif  // HYBRIDBACKEND_TENSORFLOW
@@ -29,8 +29,7 @@
 from hybridbackend.tensorflow.data.prefetch.ops import Iterator
 from hybridbackend.tensorflow.data.rebatch.dataset import RebatchDataset
 from hybridbackend.tensorflow.data.rebatch.dataset import rebatch
-from hybridbackend.tensorflow.data.rectify.dataset import rectify
-from hybridbackend.tensorflow.data.tabular.dataset import TabularDataset
+from hybridbackend.tensorflow.data.tabular.dataset import Dataset
 
 # HybridBackend operators must be loaded before TensorFlow operators to
 # make AWS SDK implementation correct.
 
@@ -0,0 +1,79 @@
+/* Copyright 2021 Alibaba Group Holding Limited. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+==============================================================================*/
+
+#ifndef HYBRIDBACKEND_TENSORFLOW_DATA_REBATCH_BUFFER_H_
+#define HYBRIDBACKEND_TENSORFLOW_DATA_REBATCH_BUFFER_H_
+
+#include <deque>
+#include <vector>
+
+#include <tensorflow/core/framework/tensor.h>
+#include <tensorflow/core/lib/random/philox_random.h>
+#include <tensorflow/core/lib/random/random.h>
+#include <tensorflow/core/lib/random/random_distributions.h>
+
+namespace tensorflow {
+namespace hybridbackend {
+
+struct RebatchBufferItem {
+ public:
+  RebatchBufferItem(int64 batch_size, const std::vector<Tensor>& components)
+      : batch_size(batch_size), components(components) {}
+  int64 batch_size;
+  std::vector<Tensor> components;
+};
+
+class RebatchBuffer {
+ public:
+  RebatchBuffer(const DataTypeVector& output_dtypes,
+                const std::vector<PartialTensorShape>& output_shapes,
+                const std::vector<int32>& field_ranks);
+
+  int64 size() const { return size_; }
+
+  Status Put(const std::vector<Tensor>& input_tensors, const int64 num_rows);
+
+  Status PutSlice(const std::vector<Tensor>& input_tensors,
+                  const int64 row_start, const int64 row_limit);
+
+  Status Shuffle(random::SingleSampleAdapter<random::PhiloxRandom>& generator,
+                 const int64 num_rows);
+
+  Status Take(Allocator* alloc, std::vector<Tensor>* output_tensors,
+              const int64 num_rows);
+
+ private:
+  Status TakeDense(Allocator* alloc, std::vector<Tensor>* output_tensors,
+                   std::vector<Tensor>* residual_tensors, const int64 num_rows,
+                   const int64 remained_rows, const int64 rank,
+                   const int64 col);
+
+  Status TakeSparse(Allocator* alloc, std::vector<Tensor>* output_tensors,
+                    std::vector<Tensor>* residual_tensors, const int64 num_rows,
+                    const int64 remained_rows, const int64 rank,
+                    const int64 col);
+
+  const DataTypeVector& output_dtypes_;
+  const std::vector<PartialTensorShape>& output_shapes_;
+  const std::vector<int32> field_ranks_;
+
+  int64 size_;
+  std::deque<RebatchBufferItem> items_;
+};
+
+}  // namespace hybridbackend
+}  // namespace tensorflow
+
+#endif  // HYBRIDBACKEND_TENSORFLOW_DATA_REBATCH_BUFFER_H_
@@ -23,6 +23,8 @@
 import inspect
 
 # pylint: disable=ungrouped-imports
+from hybridbackend.tensorflow.data.dataframe import input_fields
+
 try:
   from tensorflow.python.data.ops.dataset_ops import DatasetV2 as _dataset  # pylint: disable=unused-import
 
@@ -43,28 +45,19 @@
 
 def rebatch(
     batch_size,
-    min_batch_size=None,
-    fields=None,
     drop_remainder=False,
-    num_parallel_scans=1):
+    fields=None):
   r'''Create a `RebatchDataset`.
 
   Args:
     batch_size: Maxium number of samples in an output batch.
-    min_batch_size: (Optional.) Minimum number of samples in a non-final
-      batch. Same to `batch_size` by default.
-    fields: (Optional.) List of DataFrame fields. Fetched from `input_dataset`
-      by default.
     drop_remainder: (Optional.) If True, smaller final batch is dropped.
       `False` by default.
-    num_parallel_scans: (Optional.) Number of concurrent scans against fields
-        of input dataset.
+    fields: (Optional.) List of DataFrame fields. Fetched from `input_dataset`
+      by default.
   '''
   def _apply_fn(dataset):
     return RebatchDataset(
-      dataset, batch_size,
-      min_batch_size=min_batch_size,
-      fields=fields,
-      drop_remainder=drop_remainder,
-      num_parallel_scans=num_parallel_scans)
+      dataset, input_fields(dataset, fields), batch_size,
+      drop_remainder=drop_remainder)
   return _apply_fn