|
| 1 | +--- |
| 2 | +title: Adding a New Thanos Component that Embeds Cortex Query Frontend |
| 3 | +type: proposal |
| 4 | +menu: proposals |
| 5 | +status: approved |
| 6 | +owner: bwplotka |
| 7 | +--- |
| 8 | + |
| 9 | +### Related Tickets |
| 10 | + |
| 11 | +* Response caching: https://github.com/thanos-io/thanos/issues/1651 |
| 12 | +* Moving query frontend to separate repo: https://github.com/cortexproject/Cortex/issues/1672 |
| 13 | +* Discussion about naming: https://cloud-native.slack.com/archives/CK5RSSC10/p1586939369171300 |
| 14 | + |
| 15 | +## Summary |
| 16 | + |
| 17 | +This proposal describes addition of a new Thanos command (component) into `cmd/thanos` called `query-frontend` |
| 18 | +This component will literally import a certain version of Cortex [frontend package](https://github.com/cortexproject/Cortex/tree/4410bed704e7d8f63418b02b328ddb93d99fad0b/pkg/querier/frontend). |
| 19 | + |
| 20 | +We will go through rationales, and potential alternatives. |
| 21 | + |
| 22 | +## Motivation |
| 23 | + |
| 24 | +[Cortex Frontend](https://www.youtube.com/watch?v=eyBbImSDOrI&t=2s) was introduced by Tom in August 2019. It was designed |
| 25 | +to be deployed in front of Prometheus Query API in order to ensure: |
| 26 | + |
| 27 | +* Query split by time. |
| 28 | +* Query step alignment. |
| 29 | +* Query retry logic |
| 30 | +* Query limit logic |
| 31 | +* Query response cache in memory, Memcached or Redis. |
| 32 | + |
| 33 | +Since the nature of Cortex backend is really similar to Thanos, with exactly the same PromQL API, and long term capabilities, the caching |
| 34 | +work done for Cortex fits to Thanos. Given also our good collaboration in the past, it feels natural to reuse Cortex's code. |
| 35 | +We even started discussion to move it to separate repo, but there was no motivation towards this, since there is no issue on using |
| 36 | +the Cortex one, as Cortex is happy to take generalized contributions. |
| 37 | + |
| 38 | +At the end we were advertising to use Cortex query frontend on production on top of Thanos and this works considerably well, with some |
| 39 | +problems on edge cases and for downsampled data as mentioned [here](https://github.com/thanos-io/thanos/issues/1651). |
| 40 | + |
| 41 | +However, we realized recently that asking users to install suddenly Cortex component on top of Thanos system is extremely confusing: |
| 42 | + |
| 43 | +* Cortex has totally different way of configuring services. It requires deciding what module you have in single YAML file. Thanos in opposite |
| 44 | +have flags and different subcommand for each component. |
| 45 | +* Cortex has bit different way of configuring memcached, which is inconsistent with what we have in Thanos Store Gateway. |
| 46 | +* There are many Cortex specific configuration items which can confuse Thanos user and increase complexity overall. |
| 47 | +* We have many ideas how to improve Cortex Query Frontend on top of Thanos, but adding Thanos specific configuration options will increase |
| 48 | +complexity on Cortex side as well. |
| 49 | +* Cortex has no good example or tutorial on how to use frontend either. We have only [Observatorium example](https://github.com/observatorium/configuration/blob/5129a8beb9507f29aec05566ca9a0f2ad82bbf76/environments/openshift/manifests/observatorium-template.yaml#L515). |
| 50 | + |
| 51 | +All of this were causing confusion and questions like [this](https://cloud-native.slack.com/archives/CK5RSSC10/p1586504362400300?thread_ts=1586492170.387900&cid=CK5RSSC10). |
| 52 | + |
| 53 | +At the end we decided with Thanos and Cortex maintainers that, ultimately, it would be awesome to create a new Thanos service called `query-frontend`. |
| 54 | + |
| 55 | +## Use Cases |
| 56 | + |
| 57 | +* User can cache responses for query range. |
| 58 | +* User can split query range queries. |
| 59 | +* User can rate limit and retry range queries. |
| 60 | + |
| 61 | +## Goals of this design |
| 62 | + |
| 63 | +* Enable response caching that will easy to use for Thanos users. |
| 64 | +* Keep it extensible and scalable for future improvements like advanced query planning, queuing, rate limiting etc. |
| 65 | +* Reuse as much as possible between projects, contribute. |
| 66 | +* Use the same configuration patterns as rest of Thanos components. |
| 67 | + |
| 68 | +## Non Goals |
| 69 | + |
| 70 | +* Create Thanos specific response caching from scratch. |
| 71 | + |
| 72 | +## Proposal |
| 73 | + |
| 74 | +The idea is to create `thanos query-frontend` component that allows specifying following options: |
| 75 | + |
| 76 | +* `--query-range.split-interval`, `time.Duration` |
| 77 | +* `--query-range.max-retries-per-request`, `int`, default = `5` |
| 78 | +* `--query-range.disable-step-align`, `bool` |
| 79 | +* `--query-range.response-cache-ttl` `time.Duration` |
| 80 | +* `--query-range.response-cache-max-freshness` `time.Duration` default = `1m` |
| 81 | +* `--query-range.response-cache-config(-file)` `pathorcontent` + [CacheConfig](https://github.com/thanos-io/thanos/blob/55cb8ca38b3539381dc6a781e637df15c694e50a/pkg/store/cache/factory.go#L32) |
| 82 | + |
| 83 | +We plan to have in-mem, fifo and memcached support for now. Cache config will be exactly the same as the one used for Store Gateway. |
| 84 | + |
| 85 | +This command will be placeholder for any query planning or queueing logic that we might want to add at some point. It will be not part of any gRPC API. |
| 86 | + |
| 87 | +To make this happen we will propose a small refactor in Cortex code to avoid unnecessary package dependencies. |
| 88 | + |
| 89 | +### Alternatives |
| 90 | + |
| 91 | +#### Don't add anything, document Cortex query frontend and add examples of usage |
| 92 | + |
| 93 | +Unfortunately we tried this path already without success. Reasons were mentioned in [Motivation](202004_embedd_cortex_frontend.md#Motivation) |
| 94 | + |
| 95 | +#### Add response caching to Querier itself, in the same process. |
| 96 | + |
| 97 | +This will definitely simplify deployment if Querier would allow caching directly. However, this way is not really scalable. |
| 98 | + |
| 99 | +Furthermore, eventually frontend will be responsible for more than just caching. It is meant to do query planning like splitting or even |
| 100 | +advanced query parallelization (query sharding). This might mean future improvements in terms of query scheduling, queuing and retrying. |
| 101 | +This means that at some point we would need an ability to scale query part and caching/query planner totally separately. |
| 102 | + |
| 103 | +Last but not least splitting queries allows to perform request in parallel. Only if used in single binary we can achieve load balancing of those requests. |
| 104 | + |
| 105 | +NOTE: We can still consider just simple response caching inside the Querier if user will request so. |
| 106 | + |
| 107 | +#### Write response caching from scratch. |
| 108 | + |
| 109 | +I think this does not need to be explained. Response caching has proven to be not trivial. It's really amazing that we |
| 110 | +have opportunity to work towards something that works with experts in the field like @tomwilkie and others from Loki and Cortex Team. |
| 111 | + |
| 112 | +Overall, [Reusing is caring](https://www.bwplotka.dev/2020/how-to-became-oss-maintainer/#5-want-more-help-give-back-help-others). |
| 113 | + |
| 114 | +## Work Plan |
| 115 | + |
| 116 | +1. Refactor [IndexCacheConfig](https://github.com/thanos-io/thanos/blob/55cb8ca38b3539381dc6a781e637df15c694e50a/pkg/store/cache/factory.go#L32) to generic cache config so we can reuse. |
| 117 | +Make it implement Cortex cache.Cache interface. |
| 118 | +1. Add necessary changes to Cortex frontend |
| 119 | + * Metric generalization (they are globals now). |
| 120 | + * Avoid unnecessary dependencies. |
| 121 | +1. Add `thanos query-frontend` subcommand. |
| 122 | +1. Add proper e2e test using cache. |
| 123 | +1. Document new subcommand |
| 124 | +1. Add to [kube-thanos](https://github.com/thanos-io/kube-thanos) |
| 125 | + |
| 126 | +## Future Work |
| 127 | + |
| 128 | +Improvements to Cortex query frontend, so Thanos `query-frontend` as described [here](https://github.com/thanos-io/thanos/issues/1651) |
0 commit comments