Replies: 1 comment 5 replies
-
@pedroerp Pedro, it would be nice to add links to papers / documentation that explains how this problem is being solved in other systems. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Context
Many engines integrating with Velox have requirements to evaluate functions (UDFs) and expressions with some degree of isolation. The requirements range from:
Proposal - OffProcessExpressionEvalOperator
Even though dynamic linking could be used to address some of these requirements, some form of off-process execution (either local/sidecar, or remote) could be supported in Velox to provide flexibility to engine developers.
The proposal is to add a new Velox operator to allow expressions to be executed off-process. This operator will consume input data, batch it, then serialize and communicate with a separate process for expression evaluation. Once the call is complete, data will be deserialized and sent downstream to the next operator. The separate process could be running in the same host (sidecar) or in a remote environment, and will communicate via a transport protocol such as thrift or protobuf - the actual transport will be built in a pluggable manner.
The pluggable transport will take a serialized payload and a serialized expression tree, and return another serialized payload with the results. The remote process can run anything, from C++ code also built using Velox, to logic in different programming languages using other frameworks.
The proposed operator will work in the following manner:
2.a. UDTs will be serialized according to the physical type they use underneath. If they map to an opaque physical type, a custom serde function needs to be provided.
3.a. There will be a parameter to limit how many requests we can have in-flight. If we hit that threshold, we stop consuming input data and just wait until any of the ongoing futures finish. This should provide us with control flow and prevent us from overloading the sidecar process.
Other considerations:
Testing And Benchmarking
Other than regular unit tests, an end-to-end test suite using FuzzerConnector will be built. The test will generate random datasets and expressions and compare a regular expression eval with the new off-process path.
In a similar manner, benchmarks to evaluate the overhead incurred by the serialization should also be built.
Related Work
These are some systems that support UDFs. Java systems usually have different limitations considering the JVM makes the ABI compatibility issues somewhat easier to handle. So these are some of the C++ engine examples I could find:
SQL Server:
Supports UDFs through "language extensions" which are executed in a separate process via a "launchpad" - the component that controls the separate/sidecar process:
DB2:
Supports "external function" that can be written in multiple languages. I could not find the exact architecture, but the documentation suggests it runs on a separate process:
Vertica:
This one is interesting. It support external scripts (which are just spawned and executed externally), and C++ UDFs either in "fenced" or "unfenced" mode:
Big Query:
Supports regular SQL and Javascript UDFs, but also remote UDF which are executed in Google's serverless architecture and communicate with the engine via HTTP:
This recent paper from Microsoft also contains a performance analysis of different UDF deployment modes:
Redshift
Supports Python UDFs and external AWS lambda-based UDFs:
Snowflake
Has external functions and remote UDFs . These are for scalar functions. The system architecture with the Proxy service and API integration seems quite nuanced:
Postgres
Has very advanced extensibility options. There is support for scalars, aggregates, table functions, srfs any many options for their optimization also:
--
Looking for feedback!
Cc: @mbasmanova @oerling @xiaoxmeng @bikramSingh91 @Yuhta @kagamiori @spershin @majetideepak @aditi-pandit @frankobe
Beta Was this translation helpful? Give feedback.
All reactions