|
50 | 50 | - [Fluent Bit Windows containers](#fluent-bit-windows-containers)
|
51 | 51 | - [Enabling debug mode for Fluent Bit Windows images](#enabling-debug-mode-for-fluent-bit-windows-images)
|
52 | 52 | - [Networking issue with Windows containers when using async DNS resolution by plugins](#networking-issue-with-windows-containers-when-using-async-dns-resolution-by-plugins)
|
| 53 | +- [Runbooks](#runbooks) |
| 54 | + - [FireLens Crash Report Runbook](#firelens-crash-report-runbook) |
| 55 | + - [1. Build and distribute a core dump S3 uploader image](#1-build-and-distribute-a-core-dump-s3-uploader-image) |
| 56 | + - [2. Setup your own repro attempt](#2-setup-your-own-repro-attempt) |
| 57 | + - [Template: Replicate FireLens case in Fargate](#template-replicate-firelens-case-in-fargate) |
53 | 58 | - [Testing](#testing)
|
54 | 59 | - [Simple TCP Logger Script](#simple-tcp-logger-script)
|
55 | 60 | - [Run Fluent Bit unit tests in a docker container](#run-fluent-bit-unit-tests-in-a-docker-container)
|
@@ -705,6 +710,96 @@ To work around this issue, we suggest using the following option so that system
|
705 | 710 | net.dns.mode LEGACY
|
706 | 711 | ```
|
707 | 712 |
|
| 713 | +### Runbooks |
| 714 | + |
| 715 | + |
| 716 | +#### FireLens Crash Report Runbook |
| 717 | + |
| 718 | +When you recieve a SIGSEGV/crash report from a FireLens customer, perform the following steps. |
| 719 | + |
| 720 | +##### 1. Build and distribute a core dump S3 uploader image |
| 721 | + |
| 722 | +You need a customized image build for the specific version/case you are testing. Make sure the `ENV FLB_VERSION` is set to the right version in the `Dockerfile.debug-base` and make sure the `AWS_FLB_CHERRY_PICKS` file has the right contents for the release you are testing. |
| 723 | + |
| 724 | +Then simply run: |
| 725 | +``` |
| 726 | +make core |
| 727 | +``` |
| 728 | + |
| 729 | +Push this image to AWS (ideally public ECR) so that you and the customer can download it. |
| 730 | + |
| 731 | +Send the customer a comment like the following: |
| 732 | + |
| 733 | +If you can deploy this image, in an env which easily reproduces the issue, we can obtain a "core dump" when Fluent Bit crashes. This image will run normally until Fluent Bit crashes, then it will run an AWS CLI command to upload a compressed "core dump" to an S3 bucket. You can then send that zip file to us, and we can use it to figure out what's going wrong. |
| 734 | + |
| 735 | +If you choose to deploy this, for the S3 upload on shutdown to work, you must: |
| 736 | + |
| 737 | +1. Set the following env vars: |
| 738 | + a. `S3_BUCKET` => an S3 bucket that your task can upload too. |
| 739 | + b. `S3_KEY_PREFIX` => this is the key prefix in S3 for the core dump, set it to something useful like the ticket ID or a human readable string. It must be valid for an S3 key. |
| 740 | +2. You then must add the following S3 permissions to your task role so that the AWS CLI can upload to S3. |
| 741 | + |
| 742 | +``` |
| 743 | +{ |
| 744 | + "Version": "2012-10-17", |
| 745 | + "Statement": [ |
| 746 | + { |
| 747 | + "Resource": [ |
| 748 | + "arn:aws:s3:::YOUR_BUCKET_NAME", |
| 749 | + "arn:aws:s3:::YOUR_BUCKET_NAME/*" |
| 750 | + ], |
| 751 | + "Effect": "Allow", |
| 752 | + "Action": [ |
| 753 | + "s3:DeleteObject", |
| 754 | + "s3:GetBucketLocation", |
| 755 | + "s3:GetObject", |
| 756 | + "s3:ListBucket", |
| 757 | + "s3:PutObject" |
| 758 | + ] |
| 759 | + } |
| 760 | + ] |
| 761 | +} |
| 762 | +``` |
| 763 | + |
| 764 | +Make sure to edit the `Resource` section with your bucket name. |
| 765 | + |
| 766 | + |
| 767 | +##### 2. Setup your own repro attempt |
| 768 | + |
| 769 | +There are two options for reproducing a crash: |
| 770 | +1. [Tutorial: Replicate an ECS FireLens Task Setup Locally](#tutorial-replicate-an-ecs-firelens-task-setup-locally) |
| 771 | +2. [Template: Replicate FireLens case in Fargate](#template-replicate-firelens-case-in-fargate) |
| 772 | + |
| 773 | +Replicating the case in Fargate is recommended, since you can easily scale up the repro to N instances. This makes it much more likely that you can actually reproduce the issue. |
| 774 | + |
| 775 | + |
| 776 | +##### Template: Replicate FireLens case in Fargate |
| 777 | + |
| 778 | +In [troubleshooting/tutorials/cloud-firelens-crash-repro-template](tutorials/cloud-firelens-crash-repro-template/), you will find a template for setting up a customer repro in Fargate. |
| 779 | + |
| 780 | +This can help you quickly setup a repro. The template and this guide require thought and consideration to create an ideal replication of the customer's setup. |
| 781 | + |
| 782 | +Perform the following steps: |
| 783 | +1. Copy the template to a new working directory. |
| 784 | +2. Setup the custom logger. Note: this step is optional, and other tools like [firelens-datajet](https://github.com/aws/firelens-datajet) may be used instead. You need some simulated log producer. There is also a logger in [troubleshooting/tools/big-rand-logger](tools/big-rand-logger/) which uses openssl to emit random data to stdout with a configurable size and rate. |
| 785 | + 1. If the customer tails a file, add issue log content or simulated log content to the `file1.log`. If they tail more than 1 file, then add a `file2.log` and add an entry in the `logger.sh` for it. If they do not tail a file, remove the file entry in `logger.sh`. |
| 786 | + 2. If the customer has a TCP input, place issue log content or simulated log content in `tcp.log`. If there are multiple TCP inputs or no TCP inputs, customize `logger.sh` accordingly. |
| 787 | + 3. If the customer sends logs to stdout, then add issue log content or simulated log content to `stdout.log`. Customize the `logger.sh` if needed. |
| 788 | + 4. Build the logger image with `docker build -t {useful case name}-logger .` and then push it to ECR for use in your task definition. |
| 789 | +3. Build a custom Fluent Bit image |
| 790 | + 1. Place customer config content in the `extra.conf` if they have custom Fluent Bit configuration. If they have a custom parser file set it in `parser.conf`. If there is a `storage.path` set, make sure the path is not on a read-only volume/filesystem. If the `storage.path` can not be created Fluent Bit will fail to start. |
| 791 | + 2. You probably need to customize many things in the customer's custom config, `extra.conf`. Read through the config and ensure it could be used in your test account. Note any required env vars. |
| 792 | + 3. Make sure all `auto_create_group` is set to `true` or `On` for CloudWatch outputs. Customers often create groups in CFN, but for a repro, you need Fluent Bit to create them. |
| 793 | + 4. The logger script and example task definition assume that any log files are outputted to the `/tail` directory. Customize tail paths to `/tail/file1*`, `/tail/file2*`, etc. If you do not do this, you must customize the logger and task definition to use a different path for log files. |
| 794 | + 5. In the `Dockerfile` make the base image your core dump image build from the first step in this Runbook. |
| 795 | +4. Customize the `task-def-template.json`. |
| 796 | + 1. Add your content for each of the `INSERT` statements. |
| 797 | + 2. Set any necessary env vars, and customize the logger env vars. `TCP_LOGGER_PORT` must be set to the port from the customer config. |
| 798 | + 3. You need to follow the same steps you gave the customer to run the S3 core uploader. Set `S3_BUCKET` and `S3_KEY_PREFIX`. Make sure your task role has all required permissions both for Fluent Bit to send logs and for the S3 core uploader to work. |
| 799 | +5. Run the task on Fargate. |
| 800 | + 1. Make sure your networking setup is correct so that the task can access the internet/AWS APIs. There are different ways of doing this. We recommend using the CFN in [aws-samples/ecs-refarch-cloudformation](https://github.com/aws-samples/ecs-refarch-cloudformation/blob/master/infrastructure/vpc.yaml) to create a VPC with private subnsets that can access the internet over a NAT gateway. This way, your Fargate tasks do not need be assigned a public IP. |
| 801 | +6. Make sure your repro actually works. Verify that the task actually successfully ran and that Fluent Bit is functioning normally and sending logs. Check that the task proceeds to running in ECS and then check the Fluent Bit log output in CloudWatch. |
| 802 | + |
708 | 803 | ### Testing
|
709 | 804 |
|
710 | 805 | #### Simple TCP Logger Script
|
@@ -918,16 +1013,16 @@ For reference, for this example, here is what the `fluent-bit.conf` should look
|
918 | 1013 |
|
919 | 1014 | ##### FireLens Customer Case Local Repro Template
|
920 | 1015 |
|
921 |
| -In [troubleshooting/tutorials/firelens-crash-repro-template](troubleshooting/tutorials/firelens-crash-repro-template) there are a set of files that you can use to quickly create the setup above. The template includes setup for outputting corefiles to a directory and optionally sending logs from stdout, file, and TCP loggers. Using the loggers is optional and we recommended considering [aws/firelens-datajet](https://github.com/aws/firelens-datajet) as another option to send log files. |
| 1016 | +In [troubleshooting/tutorials/local-firelens-crash-repro-template](tutorials/local-firelens-crash-repro-template) there are a set of files that you can use to quickly create the setup above. The template includes setup for outputting corefiles to a directory and optionally sending logs from stdout, file, and TCP loggers. Using the loggers is optional and we recommended considering [aws/firelens-datajet](https://github.com/aws/firelens-datajet) as another option to send log files. |
922 | 1017 |
|
923 | 1018 | To use the template:
|
924 | 1019 | 1. Build a core file debug build of the Fluent Bit version in the customer case.
|
925 | 1020 | 2. Clone/copy the [troubleshooting/tutorials/firelens-crash-repro-template](troubleshooting/tutorials/firelens-crash-repro-template) into a new project directory.
|
926 | 1021 | 3. Customize the `fluent-bit.conf` and the `extra.conf` with customer config file content. You may need to edit it to be convenient for your repro attempt. If there is a `storage.path`, set it to `/storage` which will be the storage sub-directory of your repro attempt. If there are log files read, customize the path to `/logfiles/app.log`, which will be the `logfiles` sub-directory containing the logging script.
|
927 | 1022 | 4. The provided `run-fluent-bit.txt` contains a starting point for constructing a docker run command for the repro.
|
928 |
| -5. The `logfiles` directory contains a script for appending to a log file every second from an example log file called `example.log`. Add customer custom log content that caused the issue to that file. Then use the instructions in `command.txt` to run the logger script. |
929 |
| -6. The `stdout-logger` sub-directory includes setup for a simple docker container that writes the `example.log` file to stdout every second. Fill `example.log` with customer log content. Then use the instructions in `command.txt` to run the logger container. |
930 |
| -7. The `tcp-logger` sub-directory includes setup for a simple script that writes the `example.log` file to a TCP port every second. Fill `example.log` with customer log content. Then use the instructions in `command.txt` to run the logger script. |
| 1023 | +5. The `logfiles` directory contains a script for appending to a log file every second from an example log file called `example.log`. Add case custom log content that caused the issue to that file. Then use the instructions in `command.txt` to run the logger script. |
| 1024 | +6. The `stdout-logger` sub-directory includes setup for a simple docker container that writes the `example.log` file to stdout every second. Fill `example.log` with case log content. Then use the instructions in `command.txt` to run the logger container. |
| 1025 | +7. The `tcp-logger` sub-directory includes setup for a simple script that writes the `example.log` file to a TCP port every second. Fill `example.log` with case log content. Then use the instructions in `command.txt` to run the logger script. |
931 | 1026 |
|
932 | 1027 | ### FAQ
|
933 | 1028 |
|
|
0 commit comments