Releases: aws/aws-sdk-pandas
AWS SDK for pandas 2.19.0
Noteworthy
- Glue Data Quality now supported, checkout the tutorial 🔥
- Delta lake support by @fvaleye
- New DynamoDB
read_itemsmethod by @a-slice-of-py
Features & enhancements
- feat: add read_items to dynamodb module by @a-slice-of-py in #1877
- Add deltalake support in AWS S3 with Pandas by @fvaleye in #1834
- support for pagination for timestream.list_databases list_tables by @cnfait in #1846
- (feat) glue data quality by @kukushking in #1861
- Add unit test for evaluating two rulesets at once by @LeonLuttenberger in #1871
- (enhancement) Minor - wr.redshift.copy - pass through commit_transaction by @kukushking in #1878
- (enhancement): Extend get and update ruleset DQ methods by @jaidisido in #1882
- enhancement: Adding filter to quicksight
delete_allmethods by @malachi-constant in #1913 - enhancement: Support optional
measure_nameinwr.timestream.write()by @malachi-constant in #1925
Bug fixes
- (fix) Check if timezone is present in column metadata by @kukushking in #1840
- (fix) Include numpy==1.23.4 && poetry update by @kukushking in #1850
- Fix apply_configs decorator causing function signature to be lost by @LeonLuttenberger in #1858
- forward use_threads to _validate_schemas_from_files by @robert-schmidtke in #1869
- (fix) Minor - KeyError in wr.opensearch.seach && cleanup tests by @kukushking in #1879
- (fix): missing timestamp data type in Timestream by @jaidisido in #1881
- Fix the Athena cache unit test errors by @LeonLuttenberger in #1883
- (fix): Handle None in databases data types by @jaidisido in #1892
Documentation
- Document the create_csv_table function's sensitivity to column order by @LeonLuttenberger in #1923
- (docs) Add extension for ipython console highlighting by @kukushking in #1841
- (feat) Minor - add sphinx copy button for code blocks by @kukushking in #1854
Tests
- Test infra: Add NAT gateway IP addresses to base stack SSM parameters by @LeonLuttenberger in #1847
- Testing: Update Opensearch test output and fixture by @malachi-constant in #1848
- (test-infra) Enable SSE, enforce HTTPS, enable node-to-node encryption by @kukushking in #1851
- (tests) add workaround to enable deltalake to use AWS profile creds by @kukushking in #1934
- Enable warn_unused_ignores for MyPy by @LeonLuttenberger in #1860
- Increase coverage for dynamodb write by @LeonLuttenberger in #1893
- Add tests for S3 wait functions by @LeonLuttenberger in #1896
- Increase coverage for s3.delete* by @LeonLuttenberger in #1897
- Increase S3 tests coverage by @jaidisido in #1909
- Add coverage report to tox by @LeonLuttenberger in #1874
- Add coverage section to pyproject by @jaidisido in #1911
- Deps: Update wheel 0.37.1 -> 0.38.1 by @malachi-constant in #1904
- Add minimum coverage by @LeonLuttenberger in #1927
- refactor: quicksight test resources as fixtures by @malachi-constant in #1928
New Contributors
- @fvaleye made their first contribution in #1834
- @robert-schmidtke made their first contribution in #1869
- @a-slice-of-py made their first contribution in #1877
Thanks
We thank the following contributors/users for their work on this release:
@jaidisido, @kukushking, @LeonLuttenberger, @cnfait, @malachi-constant, @mdavis-xyz, @dydc, @enricomarchesin
Full Changelog: 2.18.0...2.19.0
AWS SDK for pandas 2.18.0
Noteworthy
- Pyarrow 10 support 🔥 by @kukushking in #1731
- Lambda layers now available in
af-south-1(Cape Town) 🌍 by @malachi-constant
Features & enhancements
- Add unload_approach to athena.read_sql_table by @jaidisido in #1634
- Pass additional partition projection params to wr.s3.to_parquet & cat… by @kukushking in #1627
- Regenerate poetry.lock with no update by @cnfait in #1663
- Upgrading poetry installed in workflow by @cnfait in #1677
- Improve bucketing series generation by casting only the required columns by @kukushking in #1664
- Add get_query_executions generating DataFrames from Athena query executions detail by @KhueNgocDang in #1676
- Dependency: Set Pandas Version != 1.5.0 bue to memory leak by @malachi-constant in #1688
- read_csv: read file as binary when encoding_errors is set to ignore by @cnfait in #1723
- Deps: Remove upper bound limit on 'python' version by @malachi-constant in #1720
- (enhancement) Redshift: Adding 'primary_keys' to parameter validation by @malachi-constant in #1728
- Add describe_log_streams and filter_log_events to the CloudWatch module by @KhueNgocDang in #1785
- Update lambda layers with pyarrow 10 by @kukushking in #1758
- Add ctas_write_compression argument to athena.read_sql_query by @LeonLuttenberger in #1795
- Add auto termination policy to EMR by @vikramsg in #1818
- timestream.query: add QueryId and NextToken to df attributes by @cnfait in #1821
- Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
- Adding args to submit spark step by @vikramsg in #1826
Bug fixes
- Fix athena.read_sql_query for empty table and chunk size not returning an empty frame generator by @LeonLuttenberger in #1685
- Fixing index column validation in
s3.read.parquet()validate schema by @malachi-constant in #1735 - Bug: Replace extra_registries with extra_public_registries by @vikramsg in #1757
- Fix: map datatype issue of athena by @pal0064 in #1753
- Fix Redshift commands breaking with hyphenated table names by @LeonLuttenberger in #1762
- Add correct service names for timestream boto3 clients by @malachi-constant in #1716
- Allow read partitions with extra = in the value by @kukushking in #1779
Documentation
- Update install page in docs with screenshot of new managed layer name by @LeonLuttenberger in #1636
- Remove semicolon from python code eol in s3 tutorial by @cnfait in #1673
- Consistent kernel for jupyter notebooks by @cnfait in #1674
- Correct a few typos in our ipynb tutorials by @cnfait in #1694
- Fix broken links in readme by @lucasasmith in #1702
- Typos in comments and docs by @mycaule in #1761
Tests
- Support for test infrastructure in private subnets by @cnfait in #1698
- Upgrade engine versions to match defaults from aws console by @cnfait in #1709
- Set redshift and Neptune clusters removal policy to destroy by @cnfait in #1675
- Upgrade pytest-xdist by @LeonLuttenberger in #1760
- Fix timestream endpoint tests by @LeonLuttenberger in #1781
New Contributors
- @lucasasmith made their first contribution in #1702
- @vikramsg made their first contribution in #1757
- @mycaule made their first contribution in #1761
- @pal0064 made their first contribution in #1753
Thanks
We thank the following contributors/users for their work on this release:
@lucasasmith, @vikramsg, @mycaule, @pal0064, @LeonLuttenberger, @cnfait, @malachi-constant, @kukushking, @jaidisido
Full Changelog: 2.17.0...2.18.0
3.0.0rc2
What's Changed
- (enhancement): Enable missing unit tests and Redshift, Athena, LF load tests by @jaidisido in #1736
- (enhancement): configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
- (testing): Enable Athena and Redshift tests, and address errors by @LeonLuttenberger in #1721
- (feat): Make tqdm progress reporting opt-in by @kukushking in #1741
Full Changelog: 3.0.0rc1...3.0.0rc2
3.0.0rc1
What's Changed
- (enhancement): Move RayLogger out of non-distributed modules by @jaidisido in #1686
- (perf): Distribute data types inference by @jaidisido in #1692
- (docs): Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
- (fix): partition block overwriting by @kukushking in #1695
- (refactor): Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
- (docs): Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
- (enhancement): Apply modin repartitioning where required only by @jaidisido in #1701
- (enhancement): Remove local from ray.init call by @jaidisido in #1708
- (feat): Validate partitions along row axis, add warning by @kukushking in #1700
- (feat): Expand SQL formatter to LakeFormation by @LeonLuttenberger in #1684
- (feat): Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
- (convention): Add Arrow prefix to parquet datasource for consistency by @jaidisido in #1724
- (perf): Distribute Timestream write with executor by @jaidisido in #1715
Full Changelog: 3.0.0b3...3.0.0rc1
3.0.0b3
What's Changed
- (feat): Add partitioning on block level by @kukushking in #1653
- (refactor): Make room for additional distributed engines by @jaidisido in #1646
- (feat): Distribute s3 write text by @LeonLuttenberger in #1631
- (docs): Add "Introduction to Ray" Tutorial by @LeonLuttenberger in #1661
- (fix): Return address config param by @kukushking in #1660
- (refactor): Enable new engines with custom dispatching and other constructs by @jaidisido in #1666
- (deps): Uptick modin to 0.16 by @jaidisido in #1659
Full Changelog: 3.0.0b2...3.0.0b3
3.0.0b2
What's Changed
- (feat) Update to Ray 2.0 by @kukushking in #1635
- (feat) Ray logging by @malachi-constant in #1623
- (enhancement): Reduce LOC in S3 write methods create_table by @jaidisido in #1626
- (docs) Tutorial: Run SDK for pandas job on ray cluster by @malachi-constant in #1616
Full Changelog: 3.0.0b1...3.0.0b2
3.0.0b1
What's Changed
- (test) Consolidate unit and load tests by @jaidisido in #1525
- (feat) Distribute S3 read text by @LeonLuttenberger in #1567
- (feat) Distribute s3 wait_objects by @LeonLuttenberger in #1539
- (test) Ray Load Tests CDK Stack and Instructions for Load Testing by @malachi-constant in #1583
- (fix) Fix S3 read text with version ID was not working by @LeonLuttenberger in #1587
- (feat) Add distributed s3 write parquet by @kukushking in #1526
- (fix) Distribute write text regression, change to singledispatch, add repartitioning utility by @kukushking in #1611
- (enhancement) Optimise distributed s3.read_text to load data in chunks by @LeonLuttenberger in #1607
Full Changelog: 3.0.0a2...3.0.0b1
AWS SDK for pandas 2.17.0
New Functionalities
- RedshiftDataAPI serverless support 🔥 #1530
- Check out the tutorial
- Add
get_query_resultsto the Athena module #1496- Check out the function documentation
- Add
generate_create_queryto the Athena module #1514- Check out the function documentation
Enhancements
- Returning empty DataFrame for empty TimeStream query #1430
- Added support for
INSERT IGNOREformysql.to_sql#1429 - Added
use_column_namestoredshift.copyakin toredshift.to_sql#1437 - Enable passing kwargs to
redshift.connect#1467 - Add
timestream_endpoint_urlproperty to the config #1483 - Add support for upserting to an empty Glue table #1579
Documentation
- Fix typos in documentation #1434
Bug Fix
validate_schema=Trueforwr.s3.read_parquetbreaks with partition columns anddataset=True#1426wr.neptune.to_property_graphfailing for Neptune version 1.1.1.0 #1407- ValueError when using opensearch.index_df with documents with an array field #1444
- Missing
catalog_idinwr.catalog.create_database#1480 - Check for pair of brackets in query preparation for Athena cache #1529
- Fix wrong type hint for
TagColumnOperationinquicksight.create_athena_dataset#1570 s3.to_jsoncompression parameters is passed twice whendataset=True#1585- Cast Athena array, map & struct types to pandas object #1581
- In the OpenSearch module, use SSL only for HTTPS (port 443) #1603
Noteworthy
AWS Lambda Managed Layers
Since the last release, the library has been accepted as an official SDK for AWS, and rebranded as AWS SDK for pandas 🚀. The module names in Python will remain the same. One noteworthy change, however, is that the AWS Lambda Manager layer name has been renamed from AWSDataWrangler to AWSSDKPandas.
You can view the ARN value for the layers here.
PyArrow 7 Support
pip install pyarrow==2 awswrangler
Thanks
We thank the following contributors/users for their work on this release:
@bechbd, @maxispeicher, @timgates42, @aeeladawy, @KhueNgocDang, @szemek, @malachi-constant, @cnfait, @jaidisido, @LeonLuttenberger, @kukushking
3.0.0a2
This is a pre-release for the Wrangler@Scale project
What's Changed
- (feat): Add directory for Distributed Wrangler Load Tests by @malachi-constant in #1464
- (CI): Distribute tests in tox config by @malachi-constant in #1469
- (feat): Distribute s3 delete objects by @malachi-constant in #1474
- (CI): Enable new CI pipeline for standard & distributed tests by @malachi-constant in #1481
- (feat): Refactor to distribute s3.read_parquet by @jaidisido in #1513
- (bug): s3 delete tests failing in distributed codebase by @malachi-constant in #1517
Full Changelog: 3.0.0a1...3.0.0a2
3.0.0a1
This is a pre-release for the Wrangler@Scale project
What's Changed
- (feat): Add distributed config flag and initialise method by @jaidisido in #1389
- (feat): Add distributed Lake Formation read by @jaidisido in #1397
- (feat): Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
- (refactor): Refactor threading/ray; add single-path distributed s3 select impl by @kukushking in #1446
Full Changelog: 2.16.1...3.0.0a1