Force overwrite existing filesystem protocol #5894

baskrahmer · 2023-05-24T21:41:52Z

HuggingFaceDocBuilderDev · 2023-05-25T04:13:07Z

The documentation is not available anymore as the PR was closed or merged.

albertvillanova

Thanks for the fix, @baskrahmer.

In order to fix the quality code issue, could you please run

make style

albertvillanova

The tests are OK now. Thank you!

github-actions · 2023-05-25T06:52:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009139 / 0.011353 (-0.002214)	0.005634 / 0.011008 (-0.005374)	0.129587 / 0.038508 (0.091079)	0.038298 / 0.023109 (0.015189)	0.428149 / 0.275898 (0.152251)	0.443744 / 0.323480 (0.120264)	0.007501 / 0.007986 (-0.000485)	0.005999 / 0.004328 (0.001671)	0.100796 / 0.004250 (0.096546)	0.053236 / 0.037052 (0.016184)	0.423868 / 0.258489 (0.165379)	0.460110 / 0.293841 (0.166269)	0.041255 / 0.128546 (-0.087291)	0.013790 / 0.075646 (-0.061856)	0.438398 / 0.419271 (0.019127)	0.063086 / 0.043533 (0.019553)	0.414826 / 0.255139 (0.159687)	0.460652 / 0.283200 (0.177453)	0.121223 / 0.141683 (-0.020460)	1.754430 / 1.452155 (0.302275)	1.900037 / 1.492716 (0.407320)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.027222 / 0.018006 (0.009216)	0.617666 / 0.000490 (0.617176)	0.022443 / 0.000200 (0.022243)	0.000820 / 0.000054 (0.000766)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030397 / 0.037411 (-0.007014)	0.125732 / 0.014526 (0.111206)	0.149805 / 0.176557 (-0.026752)	0.234048 / 0.737135 (-0.503087)	0.143108 / 0.296338 (-0.153231)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.631189 / 0.215209 (0.415980)	6.182871 / 2.077655 (4.105216)	2.635730 / 1.504120 (1.131610)	2.231429 / 1.541195 (0.690235)	2.438360 / 1.468490 (0.969870)	0.861170 / 4.584777 (-3.723607)	5.785984 / 3.745712 (2.040272)	2.758358 / 5.269862 (-2.511504)	1.678095 / 4.565676 (-2.887582)	0.105961 / 0.424275 (-0.318314)	0.013659 / 0.007607 (0.006052)	0.762943 / 0.226044 (0.536898)	7.774399 / 2.268929 (5.505471)	3.319027 / 55.444624 (-52.125598)	2.700248 / 6.876477 (-4.176229)	3.008581 / 2.142072 (0.866509)	1.122522 / 4.805227 (-3.682705)	0.214832 / 6.500664 (-6.285832)	0.085281 / 0.075469 (0.009811)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.647610 / 1.841788 (-0.194177)	18.178316 / 8.074308 (10.104008)	21.199177 / 10.191392 (11.007785)	0.247063 / 0.680424 (-0.433361)	0.030443 / 0.534201 (-0.503758)	0.512527 / 0.579283 (-0.066757)	0.640758 / 0.434364 (0.206394)	0.639986 / 0.540337 (0.099649)	0.760113 / 1.386936 (-0.626823)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008293 / 0.011353 (-0.003060)	0.005360 / 0.011008 (-0.005648)	0.102932 / 0.038508 (0.064424)	0.037457 / 0.023109 (0.014347)	0.444114 / 0.275898 (0.168216)	0.512855 / 0.323480 (0.189375)	0.007030 / 0.007986 (-0.000956)	0.004954 / 0.004328 (0.000625)	0.095757 / 0.004250 (0.091507)	0.051239 / 0.037052 (0.014187)	0.471118 / 0.258489 (0.212629)	0.517764 / 0.293841 (0.223923)	0.041953 / 0.128546 (-0.086593)	0.013748 / 0.075646 (-0.061898)	0.118089 / 0.419271 (-0.301182)	0.060159 / 0.043533 (0.016626)	0.466011 / 0.255139 (0.210872)	0.489180 / 0.283200 (0.205980)	0.123250 / 0.141683 (-0.018433)	1.714738 / 1.452155 (0.262584)	1.838571 / 1.492716 (0.345855)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267792 / 0.018006 (0.249785)	0.624313 / 0.000490 (0.623824)	0.007315 / 0.000200 (0.007115)	0.000136 / 0.000054 (0.000082)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033751 / 0.037411 (-0.003661)	0.122819 / 0.014526 (0.108293)	0.148270 / 0.176557 (-0.028286)	0.198581 / 0.737135 (-0.538554)	0.144845 / 0.296338 (-0.151494)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.620631 / 0.215209 (0.405422)	6.224665 / 2.077655 (4.147010)	2.856592 / 1.504120 (1.352473)	2.525089 / 1.541195 (0.983894)	2.600198 / 1.468490 (1.131708)	0.872038 / 4.584777 (-3.712739)	5.571650 / 3.745712 (1.825937)	5.907643 / 5.269862 (0.637782)	2.348770 / 4.565676 (-2.216906)	0.111665 / 0.424275 (-0.312610)	0.013886 / 0.007607 (0.006278)	0.762154 / 0.226044 (0.536109)	7.792686 / 2.268929 (5.523758)	3.601122 / 55.444624 (-51.843503)	2.939412 / 6.876477 (-3.937064)	2.973430 / 2.142072 (0.831358)	1.065016 / 4.805227 (-3.740211)	0.221701 / 6.500664 (-6.278963)	0.088157 / 0.075469 (0.012688)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.771061 / 1.841788 (-0.070727)	18.826926 / 8.074308 (10.752618)	21.283830 / 10.191392 (11.092438)	0.239233 / 0.680424 (-0.441191)	0.026159 / 0.534201 (-0.508042)	0.487074 / 0.579283 (-0.092209)	0.623241 / 0.434364 (0.188877)	0.600506 / 0.540337 (0.060169)	0.691271 / 1.386936 (-0.695665)

baskrahmer added 2 commits May 24, 2023 23:37

create test for overwriting filesystem registries

c4c8bb3

set clobber to True and raise a warning when overwriting registry

2500f6f

albertvillanova requested changes May 25, 2023

View reviewed changes

albertvillanova changed the title ~~Incompatibility datalab~~ Force overwrite existing filesystem protocol May 25, 2023

formatting

e4e9e3b

albertvillanova approved these changes May 25, 2023

View reviewed changes

albertvillanova merged commit 1bbe2c3 into huggingface:main May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force overwrite existing filesystem protocol #5894

Force overwrite existing filesystem protocol #5894

baskrahmer commented May 24, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 25, 2023 •

edited

Loading

albertvillanova left a comment

albertvillanova left a comment

github-actions bot commented May 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Force overwrite existing filesystem protocol #5894

Force overwrite existing filesystem protocol #5894

Conversation

baskrahmer commented May 24, 2023 • edited Loading

HuggingFaceDocBuilderDev commented May 25, 2023 • edited Loading

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented May 25, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

baskrahmer commented May 24, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 25, 2023 •

edited

Loading