-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Pandas dataframe input
Pull request #2426 introduces a generic extensible framework for VW to understand structured Pandas dataframes.
The class DFToVW in vowpalwabbit.pyvw takes as input the pandas.DataFrame and special types (SimpleLabel, Feature, Namespace) that specify the desired VW conversion.
These classes make extensive use of a class Col that refers to a given column in the user specified dataframe.
A simpler interface DFtoVW.from_colnames also be used for the simple use-cases. The main benefit is that the user need not use the specific types.
Below are some usages of this class. They all rely on the following pandas.DataFrame called df :
house_id need_new_roof price sqft age year_built
0 id1 0 0.23 0.25 0.05 2006
1 id2 1 0.18 0.15 0.35 1976
2 id3 0 0.53 0.32 0.87 1924Let say we want to build a VW dataset with the target need_new_roof and the feature age :
from vowpalwabbit.pyvw import DFtoVW
conv = DFtoVW.from_colnames(y="need_new_roof", x=["age", "year_built"], df=df)Then we can use the method process_df:
conv.process_df()that outputs the following list:
['0 | 0.05 2006', '1 | 0.35 1976', '0 | 0.87 1924']This list can then directly be consumed by the method pyvw.model.learn.
The class DFtoVW also allow the following patterns in its default constructor :
- tag
- (named) namespaces, with scaling factor
- (named) features, with constant feature possible
To use these more complex patterns we need to import them using:
from vowpalwabbit.pyvw import SimpleLabel, Namespace, Feature, ColLet's create a VW dataset that include a named namespace (with scaling) and a named feature:
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("need_new_roof")),
namespaces=Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm"))
)
conv.process_df()which yields:
['0 |Imperial:0.092 sqm:0.25',
'1 |Imperial:0.092 sqm:0.15',
'0 |Imperial:0.092 sqm:0.32']Let's create a more complex example with a tag and multiples namespaces with multiples features.
conv = DFtoVW(
df=df,
label=SimpleLabel(Col("need_new_roof")),
tag=Col("house_id"),
namespaces=[
Namespace(name="Imperial", value=0.092, features=Feature(value=Col("sqft"), name="sqm")),
Namespace(name="DoubleIt", value=2, features=[Feature(value=Col("price")), Feature(Col("age"))])
]
)
conv.process_df()which yields:
['0 id1|Imperial:0.092 sqm:0.25 |DoubleIt:2 0.23 0.05',
'1 id2|Imperial:0.092 sqm:0.15 |DoubleIt:2 0.18 0.35',
'0 id3|Imperial:0.092 sqm:0.32 |DoubleIt:2 0.53 0.87']- The class
DFtoVWand the specific types are located invowpalwabbit/pyvw.py. The class only depends on thepandasmodule. - the code includes docstrings
- 8 tests are included in
tests/test_pyvw.py
- This framework does not yet handle multilines and more complex label types.
- To convert very large dataset that can't fit in RAM, one can make use of the pandas import option
chunksizeand process each chunk at a time. This could be implemented functionnality directly in the class using generator. The generator would then be consumed by either a VW learning interface or could be written to external file (for conversion purpose only).
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: