Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-78][Data Quality] Suggest remove data quality module #16728

Open
2 tasks done
Tracked by #14102
SbloodyS opened this issue Oct 23, 2024 · 24 comments · May be fixed by #16794
Open
2 tasks done
Tracked by #14102

[DSIP-78][Data Quality] Suggest remove data quality module #16728

SbloodyS opened this issue Oct 23, 2024 · 24 comments · May be fixed by #16794
Assignees

Comments

@SbloodyS
Copy link
Member

SbloodyS commented Oct 23, 2024

Search before asking

  • I had searched in the DSIP and found no similar DSIP.

Motivation

The current data quality task type can hardly be used normally. Since version 3.2.0, this task type has been unable to be used normally, which is equivalent to leaving it vacant.

I search in the issue list found that data quality task has a lot of bug issue, and no one maintains them, and there are also a lot of CVEs.

Most importantly, data quality is seriously coupled to the current code base. So that dependencies can't be optimized, binary package size can't be reduced, and code maintenance cost is extremely high.

So I suggest remove it.

Design Detail

No response

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct

@ruanwenjun
Copy link
Member

+1, if we don't remove this, we should rewrite it, this is currently blocking the itegration of the code base.

@zhongjiajie
Copy link
Member

zhongjiajie commented Oct 23, 2024

I generally agree with it, data quality is an important module of our project we not only have a data quality type task but also have a data quality result sub-module in UI, if we want to remove it we should earn enough +1 vote in our community

@davidzollo
Copy link
Contributor

Yes, I think this operation should earn enough +1 vote.

I generally agree with it, data quality is an important module of our project we not only have a data quality type task but also have a data quality result sub-module in UI, if we want to remove it we should earn enough +1 vote in our community

@zixi0825
Copy link
Member

+1

@BruceWong96
Copy link
Contributor

+1, I think data quality should be stripped of DS and maintained as an independent plug-in.

@Kinsonx
Copy link

Kinsonx commented Oct 24, 2024

+1, The current data quality task type is barely used in our team

@maikouliujian
Copy link

+1

@SbloodyS SbloodyS changed the title [DSIP-][Data Quality] Suggest remove data quality module [DSIP-78][Data Quality] Suggest remove data quality module Oct 24, 2024
@SbloodyS SbloodyS mentioned this issue Oct 24, 2024
77 tasks
@qingwli
Copy link
Member

qingwli commented Oct 24, 2024

-1 For me

We use data quality in our scene, I can maintain or rewrite this module

@SbloodyS
Copy link
Member Author

-1 For me

We use data quality in our scene, I can maintain or rewrite this module

You can create a new DSIP issue and put full design detail in it for discussion.

@davidzollo
Copy link
Contributor

Reminder: this is not a vote now, it's currently a discussion. Please provide your detailed opinions as far as possible, and this will help the community make a better choice. By the way, voting should take place in the dev mailing list.

@fuchanghai
Copy link
Member

+1 ,In fact, some things can be added as plug-ins, and DS should focus on the core functions. If DS want to build an ecosystem, but DS do what other projects are good at, and then DS are not as good as them,DS will be dissed by users. Moreover, other projects will not play withDS because of the overlap of functions. This is not a good idea.

@William-GuoWei
Copy link
Contributor

I have a question that users have used this module in their production environment. If we remove DQ module, how can they upgrade their environment?

@SbloodyS
Copy link
Member Author

I have a question that users have used this module in their production environment. If we remove DQ module, how can they upgrade their environment?

We'll add it to incompatible change docs.

@victorsheng
Copy link

+1

@lishiyucn
Copy link
Contributor

+1,Supporting the removal of this module. The code quality is not good and there are multiple CVE vulnerabilities which are fatal problems. At the same time, data quality management is closely related to business data and is not considered as the core advantage function point of our platform.

@lianchaoqi
Copy link

-1 For me
I think this is a more important part of big data, dolphins are mainly used by data developers, if you abandon this function, then they can only write their own programs to verify the data, but if the data developers can not write code?

@raymondchen-byte
Copy link

+1
Scheduling should focus on scheduling functions itself, data quality should be part of the data middleware, and it is recommended to separate out a data quality platform independent of it. It should support various database engine or popular computing engine gateways, such as Kyuubi, and monitor table quality from the data middleware side. Then, it can feedback to the scheduling task through the DS API whether it is blocked or not, thus forming a complete data loop.

@sdhzwc
Copy link
Contributor

sdhzwc commented Oct 24, 2024

-1 For me
Data quality is a very good function for me. I don't need to integrate third-party open source plug-ins to do this, reducing the complexity of system use and operation and maintenance.

@yangtzelsl
Copy link

-1 For me
Data quality is a very good function for me. I don't need to integrate third-party open source plug-ins to do this, reducing the complexity of system use and operation and maintenance.

@kuangye098
Copy link

+1 Too dependent on external components, not belonging to its own field. Support removal.

@niyanchun
Copy link

niyanchun commented Oct 25, 2024

Our team use data quality, although it's hard to use. So I wish a rewrite instead of a removal.

@ChenShuai1981
Copy link

ChenShuai1981 commented Oct 25, 2024

-1 For me

I agree to move codebase of DQ from dolphinscheduler main project into another ecological project just like the relationship of apache flink cdc, flink connector kafka and apache flink.

https://github.com/apache/flink-cdc
https://github.com/apache/flink-connector-kafka
https://github.com/apache/flink

The problem is NOT voting remove DQ module or not but we should refactor DQ module to improve code quality to make it extensible to support various DQ tasks.

It cannot be denied that DQ is a good application scenarios for dolphinscheduler.

Unfortunately current DQ task can NOT transfer result to downstream task, so it can NOT fulfil the requirement of automatic data reconciliation.

@344970961
Copy link

-1,depend on hadoop/spark framework and too many bugs

@gaotong521
Copy link

I think this module should not be removed, data quality is a very important part of big data, the current mainstream big data products have the data quality module, if removed, developers may need to write their own programs to complete the quality check task, which is a very painful thing for developers.our team is using data quality, although it's hard to use. So I hope it should be rewritten instead of deleted.

@SbloodyS SbloodyS self-assigned this Nov 13, 2024
@SbloodyS SbloodyS linked a pull request Nov 13, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.