apache · Jefffrey · Oct 24, 2025 · Oct 16, 2025 · Oct 16, 2025 · Oct 17, 2025
diff --git a/docs/source/contributor-guide/howtos.md b/docs/source/contributor-guide/howtos.md
@@ -21,60 +21,86 @@
 
 ## How to update the version of Rust used in CI tests
 
-- Make a PR to update the [rust-toolchain] file in the root of the repository:
+Make a PR to update the [rust-toolchain] file in the root of the repository.
 
 [rust-toolchain]: https://github.com/apache/datafusion/blob/main/rust-toolchain.toml
 
-## How to add a new scalar function
-
-Below is a checklist of what you need to do to add a new scalar function to DataFusion:
-
-- Add the actual implementation of the function to a new module file within:
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions-nested) for arrays, maps and structs functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/crypto) for crypto functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/datetime) for datetime functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/encoding) for encoding functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/math) for math functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/regex) for regex functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/string) for string functions
-  - [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/unicode) for unicode functions
-  - create a new module [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/) for other functions.
-- New function modules - for example a `vector` module, should use a [rust feature](https://doc.rust-lang.org/cargo/reference/features.html) (for example `vector_expressions`) to allow DataFusion
-  users to enable or disable the new module as desired.
-- The implementation of the function is done via implementing `ScalarUDFImpl` trait for the function struct.
-  - See the [advanced_udf.rs] example for an example implementation
-  - Add tests for the new function
-- To connect the implementation of the function add to the mod.rs file:
-  - a `mod xyz;` where xyz is the new module file
-  - a call to `make_udf_function!(..);`
-  - an item in `export_functions!(..);`
-- In [sqllogictest/test_files], add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result.
-  - Documentation for `sqllogictest` [here](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md)
-- Add SQL reference documentation [here](https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md)
-  - An example of this being done can be seen [here](https://github.com/apache/datafusion/pull/12775)
-  - Run `./dev/update_function_docs.sh` to update docs
-
-[advanced_udf.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
-[datafusion/expr/src]: https://github.com/apache/datafusion/tree/main/datafusion/expr/src
-[sqllogictest/test_files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files
-
-## How to add a new aggregate function
-
-Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
-
-- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
-- In [datafusion/expr/src], add:
-  - a new variant to `AggregateFunction`
-  - a new entry to `FromStr` with the name of the function as called by SQL
-  - a new line in `return_type` with the expected return type of the function, given an incoming type
-  - a new line in `signature` with the signature of the function (number and types of its arguments)
-  - a new line in `create_aggregate_expr` mapping the built-in to the implementation
-  - tests to the function.
-- In [sqllogictest/test_files], add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result.
-  - Documentation for `sqllogictest` [here](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md)
-- Add SQL reference documentation [here](https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md)
-  - An example of this being done can be seen [here](https://github.com/apache/datafusion/pull/12775)
-  - Run `./dev/update_function_docs.sh` to update docs
+## Adding new functions
+
+**Implementation**
+
+| Function type | Location to implement     | Trait to implement                             | Macros to use                                    | Example              |
+| ------------- | ------------------------- | ---------------------------------------------- | ------------------------------------------------ | -------------------- |
+| Scalar        | [functions][df-functions] | [`ScalarUDFImpl`]                              | `make_udf_function!()` and `export_functions!()` | [`advanced_udf.rs`]  |
+| Nested        | [functions-nested]        | [`ScalarUDFImpl`]                              | `make_udf_expr_and_func!()`                      |                      |
+| Aggregate     | [functions-aggregate]     | [`AggregateUDFImpl`] and an [`Accumulator`]    | `make_udaf_expr_and_func!()`                     | [`advanced_udaf.rs`] |
+| Window        | [functions-window]        | [`WindowUDFImpl`] and a [`PartitionEvaluator`] | `define_udwf_and_expr!()`                        | [`advanced_udwf.rs`] |
+| Table         | [functions-table]         | [`TableFunctionImpl`] and a [`TableProvider`]  | `create_udtf_function!()`                        | [`simple_udtf.rs`]   |
+
+- The macros are to simplify some boilerplate such as ensuring a DataFrame API compatible function is also created
+- Ensure new functions are properly exported through the subproject
+  `mod.rs` or `lib.rs`.
+- Functions should preferably provide documentation via the `#[user_doc(...)]` attribute so their documentation
+  can be included in the SQL reference documentation (see below section)
+- Scalar functions are further grouped into modules for families of functions (e.g. string, math, datetime).
+  Functions should be added to the relevant module; if a new module needs to be created then a new [Rust feature]
+  should also be added to allow DataFusion users to conditionally compile the modules as needed
+- Aggregate functions can optionally implement a [`GroupsAccumulator`] for better performance
+
+Spark compatible functions are [located in separate crate][df-spark] but otherwise follow the same steps, though all
+function types (e.g. scalar, nested, aggregate) are grouped together in the single location.
+
+[df-functions]: https://github.com/apache/datafusion/tree/main/datafusion/functions
+[functions-nested]: https://github.com/apache/datafusion/tree/main/datafusion/functions-nested
+[functions-aggregate]: https://github.com/apache/datafusion/tree/main/datafusion/functions-aggregate
+[functions-window]: https://github.com/apache/datafusion/tree/main/datafusion/functions-window
+[functions-table]: https://github.com/apache/datafusion/tree/main/datafusion/functions-table
+[df-spark]: https://github.com/apache/datafusion/tree/main/datafusion/spark
+[`scalarudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
+[`aggregateudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html
+[`accumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html
+[`groupsaccumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html
+[`windowudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.WindowUDFImpl.html
+[`partitionevaluator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.PartitionEvaluator.html
+[`tablefunctionimpl`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableFunctionImpl.html
+[`tableprovider`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
+[`advanced_udf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udf.rs
+[`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
+[`advanced_udwf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udwf.rs
+[`simple_udtf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udtf.rs
+[rust feature]: https://doc.rust-lang.org/cargo/reference/features.html
+
+**Testing**
+
+Prefer adding `sqllogictest` integration tests where the function is called via SQL against
+well known data and returns an expected result. See the existing [test files][slt-test-files] if
+there is an appropriate file to add test cases to, otherwise create a new file. See the
+[`sqllogictest` documentation][slt-readme] for details on how to construct these tests.
+Ensure edge case, `null` input cases are considered in these tests.
+
+If a behaviour cannot be tested via `sqllogictest` (e.g. testing `simplify()`, needs to be
+tested in isolation from the optimizer, difficult to construct exact input via `sqllogictest`)
+then tests can be added as Rust unit tests in the implementation module, though these should be
+kept minimal where possible
+
+[slt-test-files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files
+[slt-readme]: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md
+
+**Documentation**
+
+Run documentation update script `./dev/update_function_docs.sh` which will update the relevant
+markdown document [here][fn-doc-home] (see the documents for [scalar][fn-doc-scalar],
+[aggregate][fn-doc-aggregate] and [window][fn-doc-window] functions)
+
+- You _should not_ manually update the markdown document after running the script as those manual
+  changes would be overwritten on next execution
+- Reference [GitHub issue] which introduced this behaviour
+
+[fn-doc-home]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql
+[fn-doc-scalar]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md
+[fn-doc-aggregate]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md
+[fn-doc-window]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/window_functions.md
+[github issue]: https://github.com/apache/datafusion/issues/12740
 
 ## How to display plans graphically
 
@@ -97,11 +123,13 @@ can be displayed. For example, the following command creates a
 dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
 ```
 
-## How to format `.md` document
+## How to format `.md` documents
 
-We are using `prettier` to format `.md` files.
+We use [`prettier`] to format `.md` files.
 
-You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command).
+You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary.
+Using `npx` requires a working node environment. Upgrading to the latest prettier is recommended (by adding
+`--upgrade` to the `npm` command).
 
 ```bash
 $ prettier --version
@@ -114,19 +142,19 @@ After you've confirmed your prettier version, you can format all the `.md` files
 prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md
 ```
 
+[`prettier`]: https://prettier.io/
+
 ## How to format `.toml` files
 
-We use `taplo` to format `.toml` files.
+We use [`taplo`] to format `.toml` files.
 
-For Rust developers, you can install it via:
+To install via cargo:
 
 ```sh
 cargo install taplo-cli --locked
 ```
 
-> Refer to the [Installation section][doc] on other ways to install it.
->
-> [doc]: https://taplo.tamasfe.dev/cli/installation/binary.html
+> Refer to the [taplo installation documentation][taplo-install] for other ways to install it.
 
 ```bash
 $ taplo --version
@@ -139,28 +167,24 @@ After you've confirmed your `taplo` version, you can format all the `.toml` file
 taplo fmt
 ```
 
+[`taplo`]: https://taplo.tamasfe.dev/
+[taplo-install]: https://taplo.tamasfe.dev/cli/installation/binary.html
+
 ## How to update protobuf/gen dependencies
 
-The prost/tonic code can be generated by running `./regen.sh`, which in turn invokes the Rust binary located in `./gen`
+For the `proto` and `proto-common` crates, the prost/tonic code is generated by running their respective `./regen.sh` scripts,
+which in turn invokes the Rust binary located in `./gen`.
 
 This is necessary after modifying the protobuf definitions or altering the dependencies of `./gen`, and requires a
 valid installation of [protoc] (see [installation instructions] for details).
 
 ```bash
-./regen.sh
+# From repository root
+# proto-common
+./datafusion/proto-common/regen.sh
+# proto
+./datafusion/proto/regen.sh
 ```
 
 [protoc]: https://github.com/protocolbuffers/protobuf#protocol-compiler-installation
 [installation instructions]: https://datafusion.apache.org/contributor-guide/getting_started.html#protoc-installation
-
-## How to add/edit documentation for UDFs
-
-Documentations for the UDF documentations are generated from code (related [github issue]). To generate markdown run `./update_function_docs.sh`.
-
-This is necessary after adding new UDF implementation or modifying existing implementation which requires to update documentation.
-
-```bash
-./dev/update_function_docs.sh
-```
-
-[github issue]: https://github.com/apache/datafusion/issues/12740
diff --git a/docs/source/library-user-guide/functions/adding-udfs.md b/docs/source/library-user-guide/functions/adding-udfs.md
@@ -354,7 +354,7 @@ async fn main() {
 }
 ```
 
-## Adding a Async Scalar UDF
+## Adding an Async Scalar UDF
 
 An Async Scalar UDF allows you to implement user-defined functions that support
 asynchronous execution, such as performing network or I/O operations within the
@@ -1257,7 +1257,7 @@ async fn main() -> Result<()> {
 [`create_udaf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.create_udaf.html
 [`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
 
-## Adding a User-Defined Table Function
+## Adding a Table UDF
 
 A User-Defined Table Function (UDTF) is a function that takes parameters and returns a `TableProvider`.
 
@@ -1266,8 +1266,8 @@ This is a simple struct that holds a set of RecordBatches in memory and treats t
 be replaced with your own struct that implements `TableProvider`.
 
 While this is a simple example for illustrative purposes, UDTFs have a lot of potential use cases. And can be
-particularly useful for reading data from external sources and interactive analysis. For example, see the [example][4]
-for a working example that reads from a CSV file. As another example, you could use the built-in UDTF `parquet_metadata`
+particularly useful for reading data from external sources and interactive analysis. See the [working example][simple_udtf.rs]
+which reads from a CSV file. As another example, you could use the built-in UDTF `parquet_metadata`
 in the CLI to read the metadata from a Parquet file.
 
 ```console