Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Commit da003ef

Browse files
Refactor read_csv to support converters (#993)
* Refactor read_csv to support converters Motivation: support pd.read_csv use with converters to avoid costly unboxing of string columns read with pyarrow read_csv when returning DF from objmode. * Adds StdStringView Numba type to hstr_ext Motivation: for optimization purposes (avoiding copy when creating NRT manageble unicode instances) when working with string data stored in native extensions. * Adds zip and dict builtins overloads to support easy literal dict ctor Motivation: there's no easy way to create Numba LiteralStrKeyDict objects for const dicts with many elements. This adds a special overload for dict builtin that creates LiteralStrKeyDict from tuple of pairs ('col_name', col_data). * Replacing zip overload builtin with internal sdc_tuple_zip function Details: zip builtin is already overloaded in Numba and has priority over user defined overloads, hence in cases when we want zip two single elements tuples, e.g. zip(('A', ), (1, )) builtin function will match and type inference will unliteral all tuples, producing iter objects (that are always homogeneous in Numba). That is, literality of objects will be lost. Using sdc_zip_tuples explicitly avoid this problem. * Fixing issue with literal dict ctor with single element * Moving stringlib to native * Fixing refcnt issue and adding tests * Adding rewrite for dict(zip()) calls * Fixing str_view_to_float impl and tests * Fixing refcnt problem with pyarrow table ptr * Fixing bugs found in failed tests and examples
1 parent dfaa715 commit da003ef

12 files changed

+1606
-1171
lines changed

sdc/datatypes/hpat_pandas_functions.py

+300-267
Large diffs are not rendered by default.

sdc/datatypes/hpat_pandas_series_functions.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@
7272
from sdc import sdc_autogenerated
7373
from sdc.functions import numpy_like
7474
from sdc.hiframes.api import isna
75+
from sdc.hiframes.join import setitem_arr_nan
7576
from sdc.datatypes.hpat_pandas_groupby_functions import init_series_groupby
7677
from sdc.utilities.prange_utils import parallel_chunks
7778
from sdc.extensions.indexes.indexes_generic import sdc_indexes_join_outer, sdc_fix_indexes_join
@@ -679,7 +680,7 @@ def sdc_pandas_series_setitem_idx_bool_array_align_impl(self, idx, value):
679680
if self_index_value in map_index_to_position:
680681
series_data[i] = value._data[map_index_to_position[self_index_value]]
681682
else:
682-
sdc.hiframes.join.setitem_arr_nan(series_data, i)
683+
setitem_arr_nan(series_data, i)
683684

684685
else:
685686
# if value has no index - nothing to reindex and assignment is made along positions set by idx mask
@@ -734,7 +735,7 @@ def sdc_pandas_series_setitem_idx_bool_series_align_impl(self, idx, value):
734735
value_index_pos = map_value_index_to_position[idx_index_value]
735736
self._data[self_index_pos] = value._data[value_index_pos]
736737
else:
737-
sdc.hiframes.join.setitem_arr_nan(self._data, map_self_index_to_position[idx_index_value])
738+
setitem_arr_nan(self._data, map_self_index_to_position[idx_index_value])
738739
else:
739740
# use filtered index values to create a set mask, then make assignment to self
740741
# using this mask (i.e. the order of filtered indices in self.index does not matter)

0 commit comments

Comments
 (0)