You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: doc/manual.rst
+21-14
Original file line number
Diff line number
Diff line change
@@ -301,23 +301,29 @@ Other
301
301
Supported formats for these training and testing pairs are: np.ndarray,
302
302
pd.DataFrame, scipy.sparse.csr_matrix and python lists.
303
303
304
-
If your data contains categorical values (in the features or targets), autosklearn will automatically encode your
305
-
data using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
306
-
for unidimensional data and a `sklearn.preprocessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_
307
-
for multidimensional data.
308
-
309
-
Regarding the features, there are two methods to guide *auto-sklearn* to properly encode categorical columns:
304
+
Regarding the features, there are multiple things to consider:
310
305
311
306
* Providing a X_train/X_test numpy array with the optional flag feat_type. For further details, you
312
307
can check the Example :ref:`sphx_glr_examples_40_advanced_example_feature_types.py`.
313
308
* You can provide a pandas DataFrame, with properly formatted columns. If a column has numerical
314
-
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. If the
315
-
column has a categorical/boolean class, it will be encoded. If the column is of any other type
316
-
(Object or Timeseries), an error will be raised. For further details on how to properly encode
317
-
your data, you can check the Pandas Example
318
-
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_).
319
-
If you are working with time series, it is recommended that you follow this approach
309
+
dtype, *auto-sklearn* will not encode it and it will be passed directly to scikit-learn. *auto-sklearn*
310
+
supports both categorical or string as column type. Please ensure that you are using the correct
311
+
dtype for your task. By default *auto-sklearn* treats object and string columns as strings and
312
+
encodes the data using `sklearn.feature_extraction.text.CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
313
+
* If your data contains categorical values (in the features or targets), ensure that you explicitly label them as categorical.
314
+
data labeled as categorical is encoded by using a `sklearn.preprocessing.LabelEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>`_
315
+
for unidimensional data and a `sklearn.preprodcessing.OrdinalEncoder <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html>`_ for multidimensional data.
316
+
* For further details on how to properly encode your data, you can check the Pandas Example
317
+
`Working with categorical data <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_). If you are working with time series, it is recommended that you follow this approach
320
318
`Working with time data <https://stats.stackexchange.com/questions/311494/>`_.
319
+
* If you prefer not using the string option at all you can disable this option. In this case
320
+
objects, strings and categorical columns are encoded as categorical.
0 commit comments