-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Supporting third-party engines for all map
and apply
methods
#61125
Comments
Or… not. As in apply, a better api would be bodo.map(…) |
Do you mean a better API for the users? I think method chaining makes it the opposite: df = (pandas.read_parquet("temperature_series.parquet")
.assign(temperature=lambda df: df["temperature"].astype(float))
df = (bodo.apply(df, to_celsius)
.mean()) compared to: df = (pandas.read_parquet("temperature_series.parquet")
.assign(temperature=lambda df: df["temperature"].astype(float))
.apply(df, to_celsius, engine=bodo.jit)
.mean()) For internal APIs, I think it's also better to have a consistent, clear, well defined and well documented API. For example, this design prevents inconsistencies such as: my_series.map(func, na_action="ignore")
bodo.map(my_series, func, skip_na=True)
numba.map(my_series, func, process_missing=False) To me it's a clearer better API for both users, and the developers of the execution engines to provide a unified way to run things. Would you also prefer that we remove the plotting functions in pandas, and let users just use the matplotlib, bokeh, plotly APIs? Seems like the same exact case to me. Also, you didn't comment on #60668 (comment), but I think the final API should mostly address your concerns on having to maintain more code (will be the opposite if we manage to move the numba engine to the numba codebase). And also minimizes your other concern on users reporting problems to us. Surely we may get some, but I think for most users it will be obvious that if you are importing another module, and passing it as the execution engine for an operation, and you get an exception from that module, is not to pandas who you should report any issue. @jbrockmendel can you please provide more detailed and constructive feedback on the specific problem you see. As I'm sure you are aware, these short negative comments aren't being helpful in finding the best solution and moving forward with it. |
Short negative comments signal "someone is against this", and are appropriate especially when typing on a tablet. What is not appropriate is self-merging PRs like your apply PR. The best solution here is to do nothing. Adding keywords that no one will use makes the API more complicated solely for the marketing benefit of bodo. We wouldn't even be considering it if they weren't paying us. I find that corrupt and gross. |
Thank you for your candid feedback @jbrockmendel, I think we are making progress and getting closer to the root causes of your concerns. You want to do nothing as adding the So, this seems to be first about not finding this functionality useful. Which I don't think it's also the main concern, as you didn't express it when the So I guess all really is about your last point. First, Bodo doesn't pay me. Bodo made a generous contribution to pandas so we could work on things that are mutually beneficial to the project and to them. And I'll be working paid with those funds as much as you do and are planning to do. Not sure what's your thought process to think you'll work in tasks that you consider ethical and in the interest of pandas, and assume I will sale my soul and just represent the interest of Bodo by implementing something that isn't good for pandas. I do think allowing executors to be implemented as third-party projects is a great addition to pandas. In the same way I thought implementing the plotting backends was a good improvement and I implemented that (as a volunteer, in the pure interest on pandas), and in the same way I thought it was a great idea to allow implementing IO connector as third-party packages and I proposed PDEP-9 to work on that, again as volunteer work. I see the work here as equivalent to those, and in line with what I think pandas needs the most, move complexity elsewhere by implementing internal APIs, and allow innovation to happen in projects more efficient than pandas. I'm totally fine with your distrust for my ethics, and your hatred for Bodo. You surely have your reasons for that. But we need to find a solution for the disagreement here, and . Would you be happy with the changes here if we don't mention Bodo in the API does of |
A wall of text doesn't change the fact that there are zero use cases that are not well-served by |
I'm still confused, sorry. What exactly is getting more complicated in the pandas API? The only user facing change will be And why is What users asked for the numba engine? I don't see any reference in #54666 to an issue, and my guess is that @lithomas1 implemented it as work paid by Quansight. If that's the case, again, I fail to see the double standards in your reasoning. There is literally zero Bodo specific code in the work being done or proposed. Just a minor reference to bodo next to numba when listing the known engined. We are talking about a refactor that should make our code better, make IMHO clearer to users that jit errors aren't pandas errors, by making the |
You're right that my objection on the code is not that strong. If this didn't feel corrupt, I would be -1 on it but not care enough to engage. On that note, I'm going to back off on this. -1 if a vote comes up. Enjoy your weekend. |
That makes sense. Probably worth getting the new steering committee involved, at least to get some other opinions. @Dr-Irv @WillAyd @mroeschke @rhshadrach @jorisvandenbossche, do you mind sharing your opiniom here, and if appropriate, advice on how to move forward please? |
If we want to standardize using One concern I have is how we test this type of change that's dependent on an external engine, but maybe we already do that for Also, changes to the API like this are really a core-team decision - the steering committee doesn't necessarily need to get involved. |
Thanks @Dr-Irv for your feedback Writing a PDEP is not a problem, but personally I think it's not needed. This i not by any means a big change to pandas, it's just adding an optional keywork argument to some methods, with minimal code additions. For context, this was implemented in a significantly more complex way to I personally don't understand the double standards here, and I don't think this work would get more attention than all the PRs we constantly merged, if it was not for Brock's concerns that this is "corrupt" since a company gave pandas funds, and this is one of the features they are interested on. So, personally, I don't think we need a PDEP to discuss and approve the techincal details here, which are trivial amd has already been merged without much discussion in the past. And I don't think a discussion about morality and corruption with the whole pandas team will be helpful in any way. So, I think just an opinion or recommendation from the steering committee would be more helpful. |
PDEP-1 states:
So by that definition I don't think the proposal here warrants a PDEP If the objection here is not about the technical aspects but on the nature of funding, then it would be helpful to understand why we think this is corrupt. I don't think that the fact that this is funded and partially serves the desires of the funder is enough to make that claim. Marc already cited the Quansight work as an example, but many team members have been employed at Microsoft, Anaconda, Nvidia, Intel, etc... over the years. I don't recall any of those engagements having this level of scrutiny. |
I also do not believe that this decision requires a PDEP. I understand Brock's perspective and do agree that the motivation for this feature is slightly different than team members who have been paid to maintain & develop pandas. The difference, as I and maybe Brock sees it, is that the development of this feature is directly self-serving to a technology created by the funding source, a for-profit company I assume, especially if it had been implemented like in #60622. However, I believe Marc's implementation, #61032, struck a good balance of making this feature agnostic to Bodo by allowing any developer/organizaiton to accelerate UDF operations in pandas. |
Based on comments above, I'm convinced otherwise that a PDEP isn't needed. |
In #54666 and #61032 we introduce the
engine
parameter toDataFrame.apply
which allows users to run the operation with a third-party engine.The rest of
apply
andmap
methods can also benefit from this.In a first phase we can do:
Series.map
Series.apply
DataFrame.map
Then we can continue with the transform and group by ones.
The text was updated successfully, but these errors were encountered: