-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] [Security Manager Replacement] Native Java Agent (dynamic code rewriting, must be low overhead) #16731
base: main
Are you sure you want to change the base?
Conversation
❌ Gradle check result for 6b73ddf: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
thanks @reta this is really interesting and such a quick progress. On a side note, it would be useful to add a small intro snippet how the |
Thanks @kumargu
Absolutely, I have updated the description (but will push it a bit once we get JDK-21 baseline with #16366, it would simplify a lot the APIs usage) |
9858717
to
ea045b0
Compare
libs/agent-sm/agent/build.gradle
Outdated
"Can-Retransform-Classes": "true", | ||
"Agent-Class": "org.opensearch.javaagent.Agent", | ||
"Premain-Class": "org.opensearch.javaagent.Agent", | ||
"Boot-Class-Path": 'byte-buddy-1.15.10.jar opensearch-agent-bootstrap-3.0.0-SNAPSHOT.jar' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opensearch-agent-bootstrap
is shared between the OpenSearch service and the agent (so the Policy
instance could be propagated)
❌ Gradle check result for ea045b0: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
This PR is stalled because it has been open for 30 days with no activity. |
❌ Gradle check result for 58a227c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 5e20fde: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 4688fd1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 930e6ef: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
@reta is it feasible for the agent to coexist with SM enabled in 3.0, meaning both SM and Agent will enforce socket restrictions? |
I was thinking we could bring in replacements of JSM in 3.0 while JSM remains enabled in 3.0 (because we'd be still on JDK-21 in 3.0). Having the alternatives coexist for sometime will give us confidence and enough community feedback before we decide to remove it in some 3.x or 4.0. |
I think we would only target a most critical APIs by Java Agent (we just cannot much it to SM), however we should be able to run Java Agent on JDK-21 at least. |
100% agree. Maybe just the Socket interceptor for now since we see the problems with defining the port ranges in the PR #17107 |
❌ Gradle check result for b5c9985: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
a6fce2a
to
e9d0966
Compare
❌ Gradle check result for a6fce2a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
||
if (args[0] instanceof InetSocketAddress address) { | ||
if (!AgentPolicy.isTrustedHost(address.getHostString())) { | ||
final String host = address.getHostString() + ":" + address.getPort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this host is allowed permission, should we add it in the cache of trustedHost? Maybe we need a cache of protectionDomain and trustedHost and we skip stack policy evaluation for items in cache.
cc @pranu2502
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this host is allowed permission, should be add it in the cache of trustedHost?
Not sure I got it: the AgentPolicy
caches the trusted hosts (it is a set) and agent does not evaluate any protection domains, or I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find where we are adding trusted hosts to the cache (the set). I could only find the reference of trusted hosts being added in the BootstrapForTesting
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (probably) the only place it is being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
Do you think that if we add host
to the trustedHosts
(set) which were allowed socket calls would help save the cost of below evaluation?
for (final ProtectionDomain domain : callers) {
if (!policy.implies(domain, permission)) {
throw new SecurityException("Denied access to: " + host + ", domain " + domain);
}
}
something like:
if (args[0] instanceof InetSocketAddress address) {
final String hostKey = address.getHostString() + ":" + address.getPort();
if (!AgentPolicy.isTrustedHost(hostKey)) {
final SocketPermission permission = new SocketPermission(hostKey, "connect,resolve");
for (final ProtectionDomain domain : callers) {
if (!policy.implies(domain, permission)) {
throw new SecurityException("Denied access to: " + hostKey + ", domain " + domain);
}
}
AgentPolicy.addTrustedHost(hostKey);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pranu2502 is still collecting perf numbers and will post it soon. But as early numbers are arriving: with http_logs
workload, we see a 5% increase in p99 indexing latency. Hence, I have been exploring perf optimisations. This one could be one of them (an another one was proposed earlier as a comment in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that if we add
host
to thetrustedHosts
(set) which were allowed socket calls would help save the cost of below evaluation?
We should not do that I believe: trustedHosts
is not associated with any protection domain. Whereas the context of where connect attempt is happening, matters a lot. Fe, the if any code flow successfully passes the security checks, it will implicitly permit the connection to the same host:port from anywhere - not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Quoting back my initial comment.
Maybe we need a cache of protectionDomain and trustedHost and we skip stack policy evaluation for items in cache.
So instead of a set
, we might use a map
of protection domain -> trustedHost and if the entry exists in the cache (map), we skip the policy evaluation. But, sure, we can come back to it when we get perf numbers.
❌ Gradle check result for dc2903f: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
{"run-benchmark-test": "id_3"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/2686/ . Final results will be published once the job is completed. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/2686/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/47/
|
{"run-benchmark-test": "id_4"} |
The Jenkins job url is https://build.ci.opensearch.org/job/benchmark-pull-request/2695/ . Final results will be published once the job is completed. |
Benchmark ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-pull-request/2695/
|
Benchmark Baseline Comparison ResultsBenchmark Results for Job: https://build.ci.opensearch.org/job/benchmark-compare/48/
|
…ing, must be low overhead) Signed-off-by: Andriy Redko <[email protected]> Signed-off-by: Andriy Redko <[email protected]>
❕ Gradle check result for e9d2d82: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
import org.opensearch.plugins.PluginInfo; | ||
import org.opensearch.plugins.PluginsService; | ||
import org.opensearch.secure_sm.SecureSM; | ||
import org.opensearch.secure_sm.policy.PolicyFile; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot use the PolicyFile implementation in JDK21? via --add-exports java.base/sun.security.provider=ALL-UNNAMED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cannot use the PolicyFile implementation in JDK21? via --add-exports java.base/sun.security.provider=ALL-UNNAMED
Yes, we can, but it is a temporary solution, I think the end game is to have OpenSearch 3 running on JDK-24. If we don't find the option to parse policy files in time for 3.x, we could:
- keep the
secure_sm
module but delete all the JDK code - refactor
PolicyFile
to inheritsun.security.provider.PolicyFile
:PolicyFile extends sun.security.provider.PolicyFile
- once we have our own policy parsing logic, we just change
PolicyFile
implementation
In this case, we cannot run on JDK-24 but at least we won't require any breaking changes to add the support a bit later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we cannot run on JDK-24 but at least we won't require any breaking changes to add the support a bit later
@andrross @peterzhuamazon Is that acceptable for 3.0? (AFAIK we are shipping 3.0 with JDK-21)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it is a temporary solution but I would vote for picking this temporary solution and get to the end state of the Java agent sooner. That will help us to find any breaking changes in the plugin/s.
We will continue to work on our own Policy parsing logic. A "fresh" logic may land next in 2-3 days weeks but it wont be necessarily tied to 3.0 and Java agent can be delivered without blockers or time-slips in 3.0. Whenever we have the policy parser ready that version we will swap the implementations and open gates for JDK-24.
Description
Explore the the native Java Agent (dynamic code rewriting, must be low overhead).
How does it work:
bootstrap
agent
bootstrap
module apply security policiesExample:
The sample
security.policy
(stays the same as before):The application (OpenSearch) is run with the agent:
The application (OpenSearch) is applies security policy to the agent:
Running with
24-ea+31-3600
:Related Issues
Closes #16633
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.