Skip to content

Conversation

@WilliamJudge94
Copy link
Contributor

@WilliamJudge94 WilliamJudge94 commented Sep 10, 2025

This PR modernizes the project's dependencies and expands Python version support, addressing compatibility issues and updating deprecated libraries.

Changes Made

Python Version Support

  • FEAT: adding python 3.13 to tests - Added Python 3.13 to CI/CD test matrix for future compatibility

Dependency Modernization

  • FEAT: updating imghdr to use filetype instead - Replaced deprecated imghdr with modern filetype library
  • BUG: removing old imghdr tests - Cleaned up obsolete test cases for removed dependency
  • FEAT: using minify-html - Replaced deprecated htmlmin with modern minify-html library

HTML Processing Improvements

  • BUG: use minify-html instead of htmlmin - Migrated from htmlmin to more efficient minify-html
  • BUG: kwargs difference from minify-html - Fixed parameter compatibility issues with new HTML minifier
  • BUG: change html search param for valid output - Updated HTML parameter handling for correct output
  • BUG: updating html checks due to minify-html - Adjusted HTML validation logic for new minification library

Testing Strategy

  • Verify Python 3.13 compatibility across all test suites
  • Confirm HTML processing works correctly with minify-html
  • Validate filetype library functionality replaces imghdr completely

Impact

  • Compatibility: Extends Python support to include 3.13
  • Performance: Improved HTML minification with more efficient library
  • Maintenance: Removes deprecated dependencies (imghdr, htmlmin)

Breaking Changes

None - all changes maintain backward compatibility while modernizing the underlying dependencies.

@WilliamJudge94
Copy link
Contributor Author

WilliamJudge94 commented Sep 17, 2025

@fabclmnt I apologize for not running the pecommit beforehand. The commit messages and pre-commit checks should now be passing with the updates in 778d0aa and 712732a.

@WilliamJudge94
Copy link
Contributor Author

WilliamJudge94 commented Sep 17, 2025

@fabclmnt, based on the latest test results, there seems to be a couple of remaining issues:

  1. typing.io does not exist anymore in Python 3.13. This affects the PySpark module. I can work towards updating the PySpark version for ydata-profiling.

  2. It appears as though a test data download link is broken. How would you like me to proceed? https://github.com/ydataai/ydata-profiling/actions/runs/17631150209/job/50623808947#step:9:3659

@fabclmnt
Copy link
Collaborator

@fabclmnt, based on the latest test results, there seems to be a couple of remaining issues:

  1. typing.io does not exist anymore in Python 3.13. This affects the PySpark module. I can work towards updating the PySpark version for ydata-profiling.
  2. It appears as though a test data download link is broken. How would you like me to proceed? https://github.com/ydataai/ydata-profiling/actions/runs/17631150209/job/50623808947#step:9:3659

Hi @WilliamJudge94 ,

Thanks for your contribution!

  1. Yes, let’s go ahead and update the PySpark version and see if that resolves the issue. Since we need to support PySpark 4.0.0, some code adjustments may be required.

  2. NASA has updated its dataset API. The current URL should be replaced with:
    https://data.nasa.gov/docs/legacy/meteorite_landings/Meteorite_Landings.csv
    instead of:
    https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD

Really excited about this PR, thanks for pushing this forward!

@WilliamJudge94
Copy link
Contributor Author

@fabclmnt, based on the latest test results, there seems to be a couple of remaining issues:

  1. typing.io does not exist anymore in Python 3.13. This affects the PySpark module. I can work towards updating the PySpark version for ydata-profiling.
  2. It appears as though a test data download link is broken. How would you like me to proceed? https://github.com/ydataai/ydata-profiling/actions/runs/17631150209/job/50623808947#step:9:3659

Hi @WilliamJudge94 ,

Thanks for your contribution!

  1. Yes, let’s go ahead and update the PySpark version and see if that resolves the issue. Since we need to support PySpark 4.0.0, some code adjustments may be required.
  2. NASA has updated its dataset API. The current URL should be replaced with:
    https://data.nasa.gov/docs/legacy/meteorite_landings/Meteorite_Landings.csv
    instead of:
    https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD

Really excited about this PR, thanks for pushing this forward!

  1. Updated PySpark to version ≥ 4.0. I couldn’t run the PySpark tests locally yet, but I’ll debug and add support for running them inside the .devcontainer.
  2. Updated all URLs.
  3. All tests pass locally, except for the PySpark tests that haven’t been run.
image

@WilliamJudge94
Copy link
Contributor Author

@fabclmnt

  1. I got the .devcontainer working with Spark—the issue was that Java wasn’t installed. Since .devcontainer is in your .gitignore, I didn’t push the changes. If you’d like the container to auto-install Java, you can add this feature:
"features": {
  "jupyterlab": "latest",
  "ghcr.io/devcontainers/features/java:1": {
    "version": "17"
  }
}
  1. I updated the Spark config to disable ANSI mode. In PySpark 4.0, ANSI mode is enabled by default, which makes “silent” conversions (e.g., NaN → decimal) raise errors. Disabling it allows these conversions to pass, and this change makes the Spark backend tests succeed.
image

@emcek
Copy link

emcek commented Sep 19, 2025

@fabclmnt Looking forward for this! Any plans for new ydata-profiling release when this PR will be merged?

@fabclmnt
Copy link
Collaborator

@WilliamJudge94 thank you for your contribution!
Already approved the workflows to run again.

@emcek If everything is ok with the tests, we will be releasing this changes until Monday (Sep 22nd).

Copy link
Collaborator

@fabclmnt fabclmnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@emcek
Copy link

emcek commented Sep 19, 2025

@WilliamJudge94 Some tests are failing (one timeout to nasa.gov and ImportError of 'typing')

@fabclmnt
Copy link
Collaborator

@WilliamJudge94 Some tests are failing (one timeout to nasa.gov and ImportError of 'typing')

Updated develop branch with the required fixes. We will be moving with the merge of this PR.

@WilliamJudge94 Thank you for your contribution 🚀

@fabclmnt fabclmnt merged commit de97bd4 into ydataai:develop Sep 19, 2025
15 checks passed
@WilliamJudge94
Copy link
Contributor Author

@fabclmnt You are very welcome! Thank you so much for the merge approval. Looking forward to the next release!

@fabclmnt
Copy link
Collaborator

@WilliamJudge94 and @emcek this is now released.

ydata-profiling 4.17.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants