You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: troubleshooting/debugging.md
+13-2
Original file line number
Diff line number
Diff line change
@@ -838,13 +838,15 @@ Log delay can happen for the following reasons:
838
838
While the Fluent Bit maintainers are constantly working to improve its max performance, there are limitations. Carefully architecting your Fluent Bit deployment can ensure it can scale to meet your required throughput.
839
839
840
840
If you are handling very high throughput of logs, consider the following:
841
-
1. Switch to a sidecar deployment model if you are using a Daemon/Daemonset. In the sidecar model, there is a dedicated Fluent Bit container for each application container/pod/task. This means that as you scale out to more app containers, you automatically scale out to more Fluent Bit containers as well.
841
+
1. Switch to a sidecar deployment model if you are using a Daemon/Daemonset. In the sidecar model, there is a dedicated Fluent Bit container for each application container/pod/task. This means that as you scale out to more app containers, you automatically scale out to more Fluent Bit containers as well. Each app container presumably would process less work and thus produce fewer logs, decreasing the throughput that each Fluent Bit instance must handle.
842
842
2. Enable Workers. Workers is a new feature for multi-threading in core outputs. Even enabling a single worker can help, as it is a dedicated thread for that output instance. All of the AWS core outputs support workers:
3. Use a log streaming model instead of tailing log files. The AWS team has repeatedly found that complaints of scaling problems usually are cases where the tail input is used. Tailing log files and watching for rotations is inherently costly and slower than a direct streaming model. [Amazon ECS FireLens](https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/) is one good example of direct streaming model (that also uses sidecar for more scalability)- logs are sent from the container stdout/stderr stream via the container runtime over a unix socket to a Fluent Bit [forward](https://docs.fluentbit.io/manual/pipeline/inputs/forward) input. Another option is to use the [TCP input which can work with some loggers, including log4j](https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/ecs-log-collection#tutorial-3-using-log4j-with-tcp).
847
+
3. Use a log streaming model instead of tailing log files. The AWS team has repeatedly found that complaints of scaling problems usually are cases where the tail input is used. Tailing log files and watching for rotations is inherently costly and slower than a direct streaming model. [Amazon ECS FireLens](https://aws.amazon.com/blogs/containers/under-the-hood-firelens-for-amazon-ecs-tasks/) is one good example of direct streaming model (that also uses sidecar for more scalability)- logs are sent from the container stdout/stderr stream via the container runtime over a unix socket to a Fluent Bit [forward](https://docs.fluentbit.io/manual/pipeline/inputs/forward) input. Another option is to use the [TCP input which can work with some loggers, including log4j](https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/ecs-log-collection#tutorial-3-using-log4j-with-tcp).
848
+
4. Increase the Grace period and Container Stop Timeout. By default, Fluent Bit has a 5 second shutdown grace period. Both EKS and ECS give 30 seconds by default for containers to gracefully shutdown. See Guidance section [Set Grace to 30](#set-grace-to-30).
849
+
5. Make sure you are sending logs in-region (i.e. from a task/pod/cluster in us-east-1 to a log group in us-east-1). Sending logs out of region significantly reduces throughput.
848
850
849
851
## Throttling, Log Duplication & Ordering
850
852
@@ -1071,6 +1073,15 @@ In Amazon EKS and Amazon ECS, the shutdown grace period for containers, the time
1071
1073
Grace 30
1072
1074
```
1073
1075
1076
+
##### Increase the default container shutdown grace period
1077
+
1078
+
For both EKS and ECS, it is 30 seconds by default. Increasing the timeout gives Fluent Bit more time to send logs on shutdown if the Grace setting is also increased.
0 commit comments