Skip to content

Commit 3bee923

Browse files
[feat] add vocab file for features (#97)
1 parent bb604c4 commit 3bee923

19 files changed

+247
-51
lines changed

data/test/id_vocab_dict_2

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
0
2+
<OOV> 1
3+
abc 2
4+
efg 2

data/test/id_vocab_dict_3

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
xyz 0
2+
<OOV> 1
3+
abc 2
4+
efg 2

data/test/id_vocab_list_0

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
<BLK>
2+
<OOV>
3+
abc
4+
efg

data/test/id_vocab_list_1

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
xyz
2+
<OOV>
3+
abc
4+
efg

docs/source/feature/data.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -159,11 +159,11 @@ sample_weight_fields: 'col_name'
159159
- --ODPS_CONFIG_FILE_PATH: 该环境变量指向的是odpscmd的配置文件
160160
- 在[DataWorks](https://workbench.data.aliyun.com/)的独享资源组中安装pyfg,「资源组列表」- 在一个调度资源组的「操作」栏 点「运维助手」-「创建命令」(选手动输入)-「运行命令」
161161
```shell
162-
/home/tops/bin/pip3 install http://tzrec.oss-cn-beijing.aliyuncs.com/third_party/pyfg039-0.3.9-cp37-cp37m-linux_x86_64.whl
162+
/home/tops/bin/pip3 install http://tzrec.oss-cn-beijing.aliyuncs.com/third_party/pyfg044-0.4.4-cp37-cp37m-linux_x86_64.whl
163163
```
164164
- 在DataWorks中建立`PyODPS 3`节点运行FG,节点调度参数中配置好bizdate参数
165165
```
166-
from pyfg039 import offline_pyfg
166+
from pyfg044 import offline_pyfg
167167
offline_pyfg.run(
168168
o,
169169
input_table="YOU_PROJECT.TABLE_NAME",

docs/source/feature/feature.md

+10-20
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ TorchEasyRec多种类型的特征,包括IdFeature、RawFeature、ComboFeature
1414

1515
- **init_fn**: 特征嵌入初始化方式,默认不需要设置,如需自定义,可以设置任意的torch内置初始化函数,如`nn.init.uniform_,a=-0.01,b=0.01`
1616

17-
- **default_value**: 特征默认值。如果默认值为"",则没有默认值,后续模型中对于空特征的嵌入为零向量。注意: 该默认值为`bucktize`前的默认值。`bucktize`的配置包括`hash_bucket_size`/`vocab_list`/`boundaries`
17+
- **default_value**: 特征默认值。如果默认值为"",则没有默认值,后续模型中对于空特征的嵌入为零向量。注意: 该默认值为`bucketize`前的默认值。`bucketize`的配置包括`num_buckets`/`hash_bucket_size`/`vocab_list`/`vocab_dict`/`vocab_file`/`boundaries`
1818

1919
- **separator**: FG在输入为string类型时的多值分隔符,默认为`\x1d`。更建议用数组(ARRAY)类型来表示多值,训练和推理性能更好
2020

@@ -86,6 +86,11 @@ feature_configs {
8686

8787
- **vocab_dict**: 指定字典形式词表,适合多个词需要编码到同一个编号情况,**编号需要从2开始**,编码0预留给默认值,编码1预留给超出词表的词
8888

89+
- **vocab_file**: 指定词表或字典形式词表的文件路径,适合取值比较多兵可以枚举的特征,编码未预留,必须设置**default_bucketize_value**参数
90+
91+
- 词表形式:一行一个词
92+
- 字典词表形式:一行一个词和编号,词和编号间用空格分隔
93+
8994
- **zch**: 零冲突hash,可设置Id的准入和驱逐策略,详见[文档](../zch.md)
9095

9196
- **weighted**: 是否为带权重的Id特征,输入形式为`k1:v1\x1dk2:v2`
@@ -240,21 +245,13 @@ feature_configs: {
240245

241246
如果Map的值为离散值 或 `need_key=true`,可设置:
242247

243-
- **hash_bucket_size**: hash bucket的大小。
244-
- **num_buckets**: buckets数量, 仅仅当输入是integer类型时,可以使用num_buckets
245-
- **vocab_list**: 指定词表,适合取值比较少可以枚举的特征。
246-
- **vocab_dict**: 指定字典形式词表,适合多个词需要编码到同一个编号情况,**编号需要从2开始**,编码0预留给默认值,编码1预留给超出词表的词
247-
- **zch**: 零冲突hash,可设置Id的准入和驱逐策略,详见[文档](../zch.md)
248248
- **value_dim**: 默认值是1,可以设置0,value_dim=0时支持多值ID输出
249+
- 其余配置同IdFeature
249250

250251
如果Map的值为连续值,可设置:
251252

252-
- **boundaries**: 分箱/分桶的值。
253-
- **normalizer**: 连续值特征的变换方式,同RawFeature
254253
- **value_dim**: 默认值是1,连续值输出维度
255-
- **value_separator**: 连续值分隔符
256-
- **mlp**: 由一层MLP变换特征到`embedding_dim`维度
257-
- **autodis**: 由AutoDis模块变换特征到`embedding_dim`维度,详见[AutoDis文档](../autodis.md)
254+
- 其余配置同RawFeature
258255

259256
## MatchFeature: 主从键字典查询特征
260257

@@ -283,20 +280,13 @@ feature_configs: {
283280

284281
如果Map的值为离散值 或 `show_pkey=true``show_skey=true`,可设置:
285282

286-
- **hash_bucket_size**: hash bucket的大小。
287-
- **num_buckets**: buckets数量, 仅仅当输入是integer类型时,可以使用num_buckets
288-
- **vocab_list**: 指定词表,适合取值比较少可以枚举的特征。
289-
- **vocab_dict**: 指定字典形式词表,适合多个词需要编码到同一个编号情况,**编号需要从2开始**,编码0预留给默认值,编码1预留给超出词表的词
290-
- **zch**: 零冲突hash,可设置Id的准入和驱逐策略,详见[文档](../zch.md)
291283
- **value_dim**: 默认值是1,可以设置0,value_dim=0时支持多值ID输出
284+
- 其余配置同IdFeature
292285

293286
如果Map的值为连续值,可设置:
294287

295-
- **boundaries**: 分箱/分桶的值。
296-
- **normalizer**: 连续值特征的变换方式,同RawFeature
297288
- **value_dim**: 目前只支持value_dim=1
298-
- **mlp**: 由一层MLP变换特征到`embedding_dim`维度
299-
- **autodis**: 由AutoDis模块变换特征到`embedding_dim`维度,详见[AutoDis文档](../autodis.md)
289+
- 其余配置同RawFeature
300290

301291
## ExprFeature: 表达式特征
302292

docs/source/usage/serving.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ cat << EOF > tzrec_rank.json
4343
}
4444
}
4545
],
46-
"processor":"easyrec-torch-0.7"
46+
"processor":"easyrec-torch-1.0"
4747
}
4848
EOF
4949

requirements/runtime.txt

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ graphlearn @ https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/graphlearn-1.
77
graphlearn @ https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/graphlearn-1.3.3-cp310-cp310-linux_x86_64.whl ; python_version=="3.10"
88
grpcio-tools<1.63.0
99
pandas
10-
pyfg @ https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/pyfg-0.3.9-cp311-cp311-linux_x86_64.whl ; python_version=="3.11"
11-
pyfg @ https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/pyfg-0.3.9-cp310-cp310-linux_x86_64.whl ; python_version=="3.10"
10+
pyfg @ https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/pyfg-0.4.4-cp311-cp311-linux_x86_64.whl ; python_version=="3.11"
11+
pyfg @ https://tzrec.oss-cn-beijing.aliyuncs.com/third_party/pyfg-0.4.4-cp310-cp310-linux_x86_64.whl ; python_version=="3.10"
1212
pyodps>=0.12.0
1313
scikit-learn
1414
tensorboard

tzrec/features/combo_feature.py

+6
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ def num_embeddings(self) -> int:
6262
num_embeddings = len(self.vocab_list)
6363
elif len(self.vocab_dict) > 0:
6464
num_embeddings = max(list(self.vocab_dict.values())) + 1
65+
elif len(self.vocab_file) > 0:
66+
self.init_fg()
67+
num_embeddings = self._fg_op.vocab_list_size()
6568
else:
6669
raise ValueError(
6770
f"{self.__class__.__name__}[{self.name}] must set hash_bucket_size"
@@ -126,4 +129,7 @@ def fg_json(self) -> List[Dict[str, Any]]:
126129
elif len(self.vocab_dict) > 0:
127130
fg_cfg["vocab_dict"] = self.vocab_dict
128131
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
132+
elif len(self.vocab_file) > 0:
133+
fg_cfg["vocab_file"] = self.vocab_file
134+
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
129135
return [fg_cfg]

tzrec/features/feature.py

+15
Original file line numberDiff line numberDiff line change
@@ -656,6 +656,21 @@ def vocab_dict(self) -> Dict[str, int]:
656656
self._vocab_dict = {}
657657
return self._vocab_dict
658658

659+
@property
660+
def vocab_file(self) -> str:
661+
"""Vocab file."""
662+
if self.config.HasField("vocab_file"):
663+
if not self.config.HasField("default_bucketize_value"):
664+
raise ValueError(
665+
"default_bucketize_value must be set when use vocab_file."
666+
)
667+
vocab_file = self.config.vocab_file
668+
if self.config.HasField("asset_dir"):
669+
vocab_file = os.path.join(self.config.asset_dir, vocab_file)
670+
return vocab_file
671+
else:
672+
return ""
673+
659674
@property
660675
def default_bucketize_value(self) -> int:
661676
"""Default bucketize value."""

tzrec/features/id_feature.py

+14
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,9 @@ def num_embeddings(self) -> int:
8787
num_embeddings = len(self.vocab_list)
8888
elif len(self.vocab_dict) > 0:
8989
num_embeddings = max(list(self.vocab_dict.values())) + 1
90+
elif len(self.vocab_file) > 0:
91+
self.init_fg()
92+
num_embeddings = self._fg_op.vocab_list_size()
9093
else:
9194
raise ValueError(
9295
f"{self.__class__.__name__}[{self.name}] must set hash_bucket_size"
@@ -175,6 +178,10 @@ def fg_json(self) -> List[Dict[str, Any]]:
175178
fg_cfg["vocab_dict"] = self.vocab_dict
176179
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
177180
fg_cfg["value_type"] = "string"
181+
elif len(self.vocab_file) > 0:
182+
fg_cfg["vocab_file"] = self.vocab_file
183+
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
184+
fg_cfg["value_type"] = "string"
178185
elif self.config.HasField("num_buckets"):
179186
fg_cfg["num_buckets"] = self.config.num_buckets
180187
if self.config.default_value:
@@ -188,3 +195,10 @@ def fg_json(self) -> List[Dict[str, Any]]:
188195
else:
189196
fg_cfg["value_dim"] = 0
190197
return [fg_cfg]
198+
199+
def assets(self) -> Dict[str, str]:
200+
"""Asset file paths."""
201+
assets = {}
202+
if len(self.vocab_file) > 0:
203+
assets["vocab_file"] = self.vocab_file
204+
return assets

tzrec/features/id_feature_test.py

+53
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,59 @@ def test_id_feature_with_num_buckets(
413413
np.testing.assert_allclose(parsed_feat.values, np.array(expected_values))
414414
np.testing.assert_allclose(parsed_feat.lengths, np.array(expected_lengths))
415415

416+
@parameterized.expand(
417+
[
418+
["", "data/test/id_vocab_list_0", 4, [2, 3, 1], [2, 0, 1]],
419+
["xyz", "data/test/id_vocab_list_1", 4, [2, 3, 0, 1], [2, 1, 1]],
420+
["", "data/test/id_vocab_dict_2", 3, [2, 2, 1], [2, 0, 1]],
421+
["xyz", "data/test/id_vocab_dict_3", 3, [2, 2, 0, 1], [2, 1, 1]],
422+
],
423+
name_func=test_util.parameterized_name_func,
424+
)
425+
def test_id_feature_with_vocab_file(
426+
self,
427+
default_value,
428+
vocab_file,
429+
expected_num_embeddings,
430+
expected_values,
431+
expected_lengths,
432+
):
433+
id_feat_cfg = feature_pb2.FeatureConfig(
434+
id_feature=feature_pb2.IdFeature(
435+
feature_name="id_feat",
436+
embedding_dim=16,
437+
vocab_file=vocab_file,
438+
default_bucketize_value=1,
439+
expression="user:id_str",
440+
pooling="mean",
441+
default_value=default_value,
442+
)
443+
)
444+
445+
id_feat = id_feature_lib.IdFeature(id_feat_cfg, fg_mode=FgMode.FG_NORMAL)
446+
447+
expected_emb_bag_config = EmbeddingBagConfig(
448+
num_embeddings=expected_num_embeddings,
449+
embedding_dim=16,
450+
name="id_feat_emb",
451+
feature_names=["id_feat"],
452+
pooling=PoolingType.MEAN,
453+
)
454+
self.assertEqual(repr(id_feat.emb_bag_config), repr(expected_emb_bag_config))
455+
expected_emb_config = EmbeddingConfig(
456+
num_embeddings=expected_num_embeddings,
457+
embedding_dim=16,
458+
name="id_feat_emb",
459+
feature_names=["id_feat"],
460+
)
461+
self.assertEqual(repr(id_feat.emb_config), repr(expected_emb_config))
462+
463+
input_data = {"id_str": pa.array(["abc\x1defg", "", "hij"])}
464+
parsed_feat = id_feat.parse(input_data)
465+
self.assertEqual(parsed_feat.name, "id_feat")
466+
np.testing.assert_allclose(parsed_feat.values, np.array(expected_values))
467+
np.testing.assert_allclose(parsed_feat.lengths, np.array(expected_lengths))
468+
416469

417470
if __name__ == "__main__":
418471
unittest.main()

tzrec/features/lookup_feature.py

+16
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ def is_sparse(self) -> bool:
8585
or self.config.HasField("num_buckets")
8686
or len(self.vocab_list) > 0
8787
or len(self.vocab_dict) > 0
88+
or len(self.vocab_file) > 0
8889
or len(self.config.boundaries) > 0
8990
)
9091
return self._is_sparse
@@ -102,6 +103,9 @@ def num_embeddings(self) -> int:
102103
num_embeddings = len(self.vocab_list)
103104
elif len(self.vocab_dict) > 0:
104105
num_embeddings = max(list(self.vocab_dict.values())) + 1
106+
elif len(self.vocab_file) > 0:
107+
self.init_fg()
108+
num_embeddings = self._fg_op.vocab_list_size()
105109
else:
106110
num_embeddings = len(self.config.boundaries) + 1
107111
return num_embeddings
@@ -235,6 +239,11 @@ def fg_json(self) -> List[Dict[str, Any]]:
235239
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
236240
fg_cfg["value_type"] = "string"
237241
fg_cfg["needDiscrete"] = True
242+
elif len(self.vocab_file) > 0:
243+
fg_cfg["vocab_file"] = self.vocab_file
244+
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
245+
fg_cfg["value_type"] = "string"
246+
fg_cfg["needDiscrete"] = True
238247
elif len(self.config.boundaries) > 0:
239248
fg_cfg["boundaries"] = list(self.config.boundaries)
240249

@@ -247,3 +256,10 @@ def fg_json(self) -> List[Dict[str, Any]]:
247256
if raw_fg_cfg is not None:
248257
fg_cfgs.append(raw_fg_cfg)
249258
return fg_cfgs
259+
260+
def assets(self) -> Dict[str, str]:
261+
"""Asset file paths."""
262+
assets = {}
263+
if len(self.vocab_file) > 0:
264+
assets["vocab_file"] = self.vocab_file
265+
return assets

tzrec/features/match_feature.py

+16
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ def is_sparse(self) -> bool:
8787
or self.config.HasField("num_buckets")
8888
or len(self.config.vocab_list) > 0
8989
or len(self.config.vocab_dict) > 0
90+
or len(self.config.vocab_file) > 0
9091
or len(self.config.boundaries) > 0
9192
)
9293
return self._is_sparse
@@ -104,6 +105,9 @@ def num_embeddings(self) -> int:
104105
num_embeddings = len(self.vocab_list) + 1
105106
elif len(self.vocab_dict) > 0:
106107
num_embeddings = max(list(self.vocab_dict.values())) + 1
108+
elif len(self.vocab_file) > 0:
109+
self.init_fg()
110+
num_embeddings = self._fg_op.vocab_list_size()
107111
else:
108112
num_embeddings = len(self.config.boundaries) + 1
109113
return num_embeddings
@@ -208,10 +212,22 @@ def fg_json(self) -> List[Dict[str, Any]]:
208212
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
209213
fg_cfg["value_type"] = "string"
210214
fg_cfg["needDiscrete"] = True
215+
elif len(self.vocab_file) > 0:
216+
fg_cfg["vocab_file"] = self.vocab_file
217+
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
218+
fg_cfg["value_type"] = "string"
219+
fg_cfg["needDiscrete"] = True
211220
elif len(self.config.boundaries) > 0:
212221
fg_cfg["boundaries"] = list(self.config.boundaries)
213222

214223
if fg_cfg["needDiscrete"]:
215224
fg_cfg["value_dim"] = self.value_dim
216225
# del fg_cfg["combiner"]
217226
return [fg_cfg]
227+
228+
def assets(self) -> Dict[str, str]:
229+
"""Asset file paths."""
230+
assets = {}
231+
if len(self.vocab_file) > 0:
232+
assets["vocab_file"] = self.vocab_file
233+
return assets

tzrec/features/sequence_feature.py

+3
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,9 @@ def fg_json(self) -> List[Dict[str, Any]]:
327327
elif len(self.config.vocab_dict) > 0:
328328
fg_cfg["vocab_dict"] = self.vocab_dict
329329
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
330+
elif len(self.vocab_file) > 0:
331+
fg_cfg["vocab_file"] = self.vocab_file
332+
fg_cfg["default_bucketize_value"] = self.default_bucketize_value
330333
if self.config.HasField("value_dim"):
331334
fg_cfg["value_dim"] = self.config.value_dim
332335
else:

0 commit comments

Comments
 (0)