Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Supporting third-party engines for all map and apply methods #61125

Open
datapythonista opened this issue Mar 15, 2025 · 13 comments
Open

ENH: Supporting third-party engines for all map and apply methods #61125

datapythonista opened this issue Mar 15, 2025 · 13 comments
Labels
Apply Apply, Aggregate, Transform, Map

Comments

@datapythonista
Copy link
Member

datapythonista commented Mar 15, 2025

In #54666 and #61032 we introduce the engine parameter to DataFrame.apply which allows users to run the operation with a third-party engine.

The rest of apply and map methods can also benefit from this.

In a first phase we can do:

  • Series.map
  • Series.apply
  • DataFrame.map

Then we can continue with the transform and group by ones.

@datapythonista datapythonista added the Apply Apply, Aggregate, Transform, Map label Mar 15, 2025
@jbrockmendel
Copy link
Member

Or… not. As in apply, a better api would be bodo.map(…)

@datapythonista
Copy link
Member Author

datapythonista commented Mar 15, 2025

Or… not. As in apply, a better api would be bodo.map(…)

Do you mean a better API for the users? I think method chaining makes it the opposite:

df = (pandas.read_parquet("temperature_series.parquet")
            .assign(temperature=lambda df: df["temperature"].astype(float))

df = (bodo.apply(df, to_celsius)
          .mean())

compared to:

df = (pandas.read_parquet("temperature_series.parquet")
            .assign(temperature=lambda df: df["temperature"].astype(float))
            .apply(df, to_celsius, engine=bodo.jit)
            .mean())

For internal APIs, I think it's also better to have a consistent, clear, well defined and well documented API. For example, this design prevents inconsistencies such as:

my_series.map(func, na_action="ignore")
bodo.map(my_series, func, skip_na=True)
numba.map(my_series, func, process_missing=False)

To me it's a clearer better API for both users, and the developers of the execution engines to provide a unified way to run things. Would you also prefer that we remove the plotting functions in pandas, and let users just use the matplotlib, bokeh, plotly APIs? Seems like the same exact case to me.

Also, you didn't comment on #60668 (comment), but I think the final API should mostly address your concerns on having to maintain more code (will be the opposite if we manage to move the numba engine to the numba codebase). And also minimizes your other concern on users reporting problems to us. Surely we may get some, but I think for most users it will be obvious that if you are importing another module, and passing it as the execution engine for an operation, and you get an exception from that module, is not to pandas who you should report any issue.

@jbrockmendel can you please provide more detailed and constructive feedback on the specific problem you see. As I'm sure you are aware, these short negative comments aren't being helpful in finding the best solution and moving forward with it.

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 15, 2025

these short negative comments aren't being helpful in finding the best solution and moving forward with it

Short negative comments signal "someone is against this", and are appropriate especially when typing on a tablet. What is not appropriate is self-merging PRs like your apply PR.

The best solution here is to do nothing. Adding keywords that no one will use makes the API more complicated solely for the marketing benefit of bodo. We wouldn't even be considering it if they weren't paying us. I find that corrupt and gross.

@datapythonista
Copy link
Member Author

Thank you for your candid feedback @jbrockmendel, I think we are making progress and getting closer to the root causes of your concerns.

You want to do nothing as adding the engine parameter makes the API more complicated and nobody will use it. But the engine parameter was introduced more than a year and a half ago in #54666, so this doesn't seem to make a lot of sense. I asked you before, is your proposal to remove that parameter? Because the work I've been doing and I'm planning doesn't add more parameters except for consistency with DataFrame.apply. And what it aims is to reduce complexity by leaving out of pandas the JIT compilation with numba, which we already have. Your concerns previously addressed on increased maintenance burden, and JIT errors being reported to pandas should also benefit from this.

So, this seems to be first about not finding this functionality useful. Which I don't think it's also the main concern, as you didn't express it when the engine keyword was introduced. And there are surely use cases, I'm pretty sure Thomas wouldn't implement this in the first place. I see myself using this for my use cases when using apply with a function that is jittable. And even yourself pointed out in previous comments that this could also be used for parallel. And of course Bodo has some use cases from their potential clients to want us to work on this. We can find very specific use cases if that's helpful, but I think this is clearly not the main issue. I'm pretty sure there are many other parts of the pandas API that you don't find useful, and I don't think you put the energy and time of boycotting them as you do with this one.

So I guess all really is about your last point. First, Bodo doesn't pay me. Bodo made a generous contribution to pandas so we could work on things that are mutually beneficial to the project and to them. And I'll be working paid with those funds as much as you do and are planning to do. Not sure what's your thought process to think you'll work in tasks that you consider ethical and in the interest of pandas, and assume I will sale my soul and just represent the interest of Bodo by implementing something that isn't good for pandas.

I do think allowing executors to be implemented as third-party projects is a great addition to pandas. In the same way I thought implementing the plotting backends was a good improvement and I implemented that (as a volunteer, in the pure interest on pandas), and in the same way I thought it was a great idea to allow implementing IO connector as third-party packages and I proposed PDEP-9 to work on that, again as volunteer work. I see the work here as equivalent to those, and in line with what I think pandas needs the most, move complexity elsewhere by implementing internal APIs, and allow innovation to happen in projects more efficient than pandas.

I'm totally fine with your distrust for my ethics, and your hatred for Bodo. You surely have your reasons for that. But we need to find a solution for the disagreement here, and . Would you be happy with the changes here if we don't mention Bodo in the API does of apply et. al.? That's the only of your points that kind of make sense to me.

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 16, 2025

A wall of text doesn't change the fact that there are zero use cases that are not well-served by bodo.apply(...) (a user who simply must use method chaining can use .pipe). There are zero users who have asked for any of this (unlike with the numba engine). This is purely making the pandas API/code more complicated for the sake of bodo's marketing.

@datapythonista
Copy link
Member Author

I'm still confused, sorry.

What exactly is getting more complicated in the pandas API? The only user facing change will be df.apply(func, engine="numba", engine_kwargs={"nogil": True}) by df.apply(func, engine=numba.jit(nogil=True)). Or is it adding the engine option to Series.map and others?

And why is bodo.apply the way to go, and numba.apply or numba_pandas.apply is not?

What users asked for the numba engine? I don't see any reference in #54666 to an issue, and my guess is that @lithomas1 implemented it as work paid by Quansight. If that's the case, again, I fail to see the double standards in your reasoning.

There is literally zero Bodo specific code in the work being done or proposed. Just a minor reference to bodo next to numba when listing the known engined. We are talking about a refactor that should make our code better, make IMHO clearer to users that jit errors aren't pandas errors, by making the engine parameter generic and not hardcoded in pandas. And it has the extra benefit of allowing Bodo, Blosc2 and probably others in the future to plug to pandas in a standard way. I understand you are not a big of these changes. But I fail to see your strong objection given how small they are, and how regardless of Bodo, they seem to contribute positively to pandas in all ways.

@jbrockmendel
Copy link
Member

But I fail to see your strong objection given how small they are

You're right that my objection on the code is not that strong. If this didn't feel corrupt, I would be -1 on it but not care enough to engage. On that note, I'm going to back off on this. -1 if a vote comes up. Enjoy your weekend.

@datapythonista
Copy link
Member Author

That makes sense. Probably worth getting the new steering committee involved, at least to get some other opinions. @Dr-Irv @WillAyd @mroeschke @rhshadrach @jorisvandenbossche, do you mind sharing your opiniom here, and if appropriate, advice on how to move forward please?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 18, 2025

If we want to standardize using engine for these types of things (bodo, numba, or some future engine), and because it affects different parts of the pandas API, I think a PDEP might be warranted. But I could be convinced otherwise.

One concern I have is how we test this type of change that's dependent on an external engine, but maybe we already do that for numba anyway??

Also, changes to the API like this are really a core-team decision - the steering committee doesn't necessarily need to get involved.

@datapythonista
Copy link
Member Author

Thanks @Dr-Irv for your feedback Writing a PDEP is not a problem, but personally I think it's not needed. This i not by any means a big change to pandas, it's just adding an optional keywork argument to some methods, with minimal code additions.

For context, this was implemented in a significantly more complex way to DataFrame.apply by Thomaa, and was merged by Matt without an issue, or discussion by anyome else in the team, and nobody cared. Which I think is totally fine, since it's quite a small change.

I personally don't understand the double standards here, and I don't think this work would get more attention than all the PRs we constantly merged, if it was not for Brock's concerns that this is "corrupt" since a company gave pandas funds, and this is one of the features they are interested on.

So, personally, I don't think we need a PDEP to discuss and approve the techincal details here, which are trivial amd has already been merged without much discussion in the past. And I don't think a discussion about morality and corruption with the whole pandas team will be helpful in any way.

So, I think just an opinion or recommendation from the steering committee would be more helpful.

@WillAyd
Copy link
Member

WillAyd commented Mar 19, 2025

PDEP-1 states:

Adding new methods or parameters to an existing method typically will not require a PDEP for non-core features.

So by that definition I don't think the proposal here warrants a PDEP

If the objection here is not about the technical aspects but on the nature of funding, then it would be helpful to understand why we think this is corrupt. I don't think that the fact that this is funded and partially serves the desires of the funder is enough to make that claim.

Marc already cited the Quansight work as an example, but many team members have been employed at Microsoft, Anaconda, Nvidia, Intel, etc... over the years. I don't recall any of those engagements having this level of scrutiny.

@mroeschke
Copy link
Member

I also do not believe that this decision requires a PDEP.

I understand Brock's perspective and do agree that the motivation for this feature is slightly different than team members who have been paid to maintain & develop pandas. The difference, as I and maybe Brock sees it, is that the development of this feature is directly self-serving to a technology created by the funding source, a for-profit company I assume, especially if it had been implemented like in #60622.

However, I believe Marc's implementation, #61032, struck a good balance of making this feature agnostic to Bodo by allowing any developer/organizaiton to accelerate UDF operations in pandas.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 20, 2025

If we want to standardize using engine for these types of things (bodo, numba, or some future engine), and because it affects different parts of the pandas API, I think a PDEP might be warranted. But I could be convinced otherwise.

Based on comments above, I'm convinced otherwise that a PDEP isn't needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map
Projects
None yet
Development

No branches or pull requests

5 participants