Skip to content

v0.4.0 (June, 2025)

Latest
Compare
Choose a tag to compare
@jeremymanning jeremymanning released this 14 Jun 11:59

πŸš€ High-Performance Polars Backend + Simplified Text API

🎯 Key Features

⚑ NEW: High-Performance Polars Backend (2-100x faster!)

  • Dual DataFrame Support: Choose between pandas (default) or Polars backends
  • Zero Code Changes: Add backend='polars' to any operation for instant speedups
  • Comprehensive Coverage: All data types (arrays, text, files) work with both backends
  • Smart Type Preservation: DataFrames maintain their type when no backend specified
  • Global Configuration: Set default backend preference with set_dataframe_backend('polars')
  • Cross-Backend Conversion: Seamlessly convert between pandas and Polars DataFrames

πŸ“Š Performance Gains with Polars

  • Array Processing: 2-100x faster conversion for large datasets
  • Text Embeddings: 3-10x faster document processing
  • Memory Efficiency: 30-70% reduction in memory usage
  • Parallel Processing: Built-in multi-core optimization

🎨 Simplified Text Model API (80% reduction in verbosity)

  • Simple String Format: {'model': 'all-MiniLM-L6-v2'} now works everywhere
  • Automatic Normalization: All model formats converted to unified dict internally
  • List Support: Lists of models work with simplified format
  • Full Backward Compatibility: All existing verbose syntax continues working

πŸ“‹ Quick Start Examples

High-Performance Processing

import datawrangler as dw
import numpy as np

# Large dataset example
large_array = np.random.rand(50000, 20)

# Traditional pandas backend
pandas_df = dw.wrangle(large_array)  # Default

# High-performance Polars backend (2-100x faster!)
polars_df = dw.wrangle(large_array, backend='polars')

# Set global preference
from datawrangler.core.configurator import set_dataframe_backend
set_dataframe_backend('polars')  # All operations now use Polars

Simplified Text Processing

# Before v0.4.0 (verbose)
text_kwargs = {
    'model': {
        'model': 'all-MiniLM-L6-v2',
        'args': [],
        'kwargs': {}
    }
}

# After v0.4.0 (simplified!)
text_kwargs = {'model': 'all-MiniLM-L6-v2'}

# Works with Polars for 3-10x faster text processing
fast_embeddings = dw.wrangle(texts, text_kwargs=text_kwargs, backend='polars')

πŸ”§ Additional Improvements

- Google Colab Fix: Eliminated installation warning popup
- Cleaner Dependencies: Removed redundant configparser
- Enhanced Documentation: All examples updated for both backends
- API Consistency: Fixed all docstring examples to use public API

πŸ“ˆ When to Use Each Backend

- Use pandas for: Small datasets, complex index operations, maximum ecosystem compatibility
- Use Polars for: Large datasets, performance-critical applications, memory efficiency

πŸš€ Installation

pip install --upgrade pydata-wrangler

# For full ML capabilities including sentence-transformers
pip install --upgrade "pydata-wrangler[hf]"

πŸ§ͺ Verified Quality

- βœ… All 45 tests passing
- βœ… Documentation builds successfully
- βœ… Full backward compatibility maintained
- βœ… Comprehensive API examples tested

This release maintains full backward compatibility while delivering significant performance improvements and API simplification. Upgrade today to experience the power of high-performance data wrangling!