Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-78][Data Quality] Remove data quality module #16794

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

SbloodyS
Copy link
Member

Purpose of the pull request

close #16728

Brief change log

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

Pull Request Notice

Pull Request Notice

If your pull request contain incompatible change, you should also add it to docs/docs/en/guide/upgrede/incompatible.md

Copy link

sonarcloud bot commented Nov 13, 2024

Please retry analysis of this Pull-Request directly on SonarQube Cloud

@davidzollo
Copy link
Contributor

-1,there're no conclusion for now, please don't do this op

@SbloodyS
Copy link
Member Author

-1,there're no conclusion for now, please don't do this op

Based on the discussion of issue #16728, we can draw the following conclusions.

  1. More than half of the maintainers agreed.
  2. Half of the users who disagree don't understand the difference between DS and Hadoop and data-quality. It's meaningless.
  3. The remaining half users want to refactor the data-quality module instead of removing it.

No one among the maintainers on the current community is willing to refactor this module. And data-quality module has seriously blocked the progress of #16098 which is is very important for the next version's release. So I think the conclusion of removal is obvious.

@davidzollo
Copy link
Contributor

-1,there're no conclusion for now, please don't do this op

Based on the discussion of issue #16728, we can draw the following conclusions.

  1. More than half of the maintainers agreed.
  2. Half of the users who disagree don't understand the difference between DS and Hadoop and data-quality. It's meaningless.
  3. The remaining half users want to refactor the data-quality module instead of removing it.

No one among the maintainers on the current community is willing to refactor this module. And data-quality module has seriously blocked the progress of #16098 which is is very important for the next version's release. So I think the conclusion of removal is obvious.

Apache emphasizes achieving consensus. If there are significant differences and we cannot be resolved in the short term, we may choose to temporarily shelve the proposal, allowing much more time for more information before re-discussing.

Considering this is a major decision, I suggest you send a vote email to the dev mailing list.

@SbloodyS
Copy link
Member Author

Apache emphasizes achieving consensus.

First of all, the current consensus does not require all people to reach an agreement, but only more than half. For more info you can take a look at apache/comdev-site#189

If there are significant differences and we cannot be resolved in the short term, we may choose to temporarily shelve the proposal, allowing much more time for more information before re-discussing.

Like I said, no one among the maintainers on the current community is willing to refactor this module. How to achieve this without anyone willing to take responsibility?

Considering this is a major decision, I suggest you send a vote email to the dev mailing list.

I think issue and dev mail list have the same meaning on this issue. And all active PMC/Committer have been included in this issue.

@davidzollo
Copy link
Contributor

Apache emphasizes achieving consensus.

First of all, the current consensus does not require all people to reach an agreement, but only more than half. For more info you can take a look at apache/comdev-site#189

If there are significant differences and we cannot be resolved in the short term, we may choose to temporarily shelve the proposal, allowing much more time for more information before re-discussing.

Like I said, no one among the maintainers on the current community is willing to refactor this module. How to achieve this without anyone willing to take responsibility?

Considering this is a major decision, I suggest you send a vote email to the dev mailing list.

I think issue and dev mail list have the same meaning on this issue. And all active PMC/Committer have been included in this issue.

Github Issue can't instead of mail, especially for big event.
I think we can try to find some contributors who'd like to refactor this module if this function is finally kept.

By the way, i don't disagree with removing this module; I’m just concerned about the significant impact it will have on users. What I see is that many users want to keep this module, while the maintainers are inclined to remove it to make refactoring easier. Therefore, this decision requires great caution. If a vote does not take place in the dev mailing list, then it should happen in private, not in an issue thread.

@SbloodyS
Copy link
Member Author

SbloodyS commented Nov 14, 2024

Github Issue can't instead of mail, especially for big event. I think we can try to find some contributors who'd like to refactor this module if this function is finally kept.

By the way, i don't disagree with removing this module; I’m just concerned about the significant impact it will have on users. What I see is that many users want to keep this module, while the maintainers are inclined to remove it to make refactoring easier. Therefore, this decision requires great caution. If a vote does not take place in the dev mailing list, then it should happen in private, not in an issue thread.

I don't think this module is a big event. This feature was introduced in PR #4830 Since Feb 21, 2021 without any mailing or github issue discussion. For more than three years, apart from this PR author, no contributor has contributed to this function, and there are endless issue of bugs and improvement, and there is no substantial code change in this function. This author also doesn't want to maintain this function and vote +1 for removal.

@davidzollo
Copy link
Contributor

davidzollo commented Nov 14, 2024

At that time, there was no DSIP mechanism in place, and much of the communication took place during community meetings. The feature I’m referring to took 8-9 months to develop and implement. If this is not a major feature, I believe that statement would be inaccurate.

@SbloodyS
Copy link
Member Author

At that time, there was no DSIP mechanism in place, and much of the communication took place during community meetings. The feature I’m referring to took 8-9 months to develop and implement. If this is not a major feature, I believe that statement would be inaccurate.

I don't agree with this opinion. Since dolphinscheduler is focusing the modern data orchestration platform. Data-Quality is focusing accuracy and consistency of data. The relationship between two of them is equivalent to Flink and Flink-CDC. There are many mature examples at present.

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add ddl to drop the dq table.

@SbloodyS
Copy link
Member Author

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

@zhongjiajie
Copy link
Member

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

How about giving entry point script to remove exists table and package it into binary tarball instead of document, cause we support only two of databases in prod, and we should keep thing easy to use

@SbloodyS
Copy link
Member Author

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

How about giving entry point script to remove exists table and package it into binary tarball instead of document, cause we support only two of databases in prod, and we should keep thing easy to use

I think it will be more traversal for some users to provide users with operating DLLs intuitively since this is a dangerous operation for drop table...

@zhongjiajie
Copy link
Member

For proposal only, it seems inconsistency exists, although some of the PMCs in #16728 already agreed to remove it, David is right and he could challenge it and ask to vote in the dev mailing list. So maybe should vote in dev mail thread, And I think the vote result should be the ONLY result of this PR continue or not instead of personal emotions.
I found out author @zixi0825 also voted +1 in #16728 (comment), and @qingwli, who has maintained data quality for some time, voted -1 for removing it.

And for this feature, I think it should better act as a plugin instead of a built-in function. Especially since it has many bugs and CVEs and not team member want to maintain it. So I personally would vote +1 for removing it.

And BTW, the time cost for the feature development should not as a standard to measure the importance or not.

@zhongjiajie
Copy link
Member

zhongjiajie commented Nov 14, 2024

We should add ddl to drop the dq table.

We can remove the create table ddl in the init sql and give some drop table ddl to users in docs to let users decide whether or not to execute it instead of execute it by default. WDYT?

How about giving entry point script to remove exists table and package it into binary tarball instead of document, cause we support only two of databases in prod, and we should keep thing easy to use

I think it will be more traversal for some users to provide users with operating DLLs intuitively since this is a dangerous operation for drop table...

I mean we should add new bash script like https://github.com/apache/dolphinscheduler/blob/dev/dolphinscheduler-tools/src/main/bin/migrate-lineage.sh which is separated form upgrade-schema.sh, for easy use of end users. But it not important thing, the exists data should be keep and we can provide document to notice our users

@SbloodyS
Copy link
Member Author

I mean we should add new bash script like https://github.com/apache/dolphinscheduler/blob/dev/dolphinscheduler-tools/src/main/bin/migrate-lineage.sh which is separated form upgrade-schema.sh, for easy use of end users. But it not important thing, the exists data should be keep and we can provide document to notice our users

Ok.

@SbloodyS
Copy link
Member Author

For proposal only, it seems inconsistency exists, although some of the PMCs in #16728 already agreed to remove it, David is right and he could challenge it and ask to vote in the dev mailing list. So maybe should vote in dev mail thread, And I think the vote result should be the ONLY result of this PR continue or not instead of personal emotions. I found out author @zixi0825 also voted +1 in #16728 (comment), and @qingwli, who has maintained data quality for some time, voted -1 for removing it.

And for this feature, I think it should better act as a plugin instead of a built-in function. Especially since it has many bugs and CVEs and not team member want to maintain it. So I personally would vote +1 for removing it.

And BTW, the time cost for the feature development should not as a standard to measure the importance or not.

I will raise a vote in dev mail list.

@zhongjiajie
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DSIP-78][Data Quality] Suggest remove data quality module
4 participants