diff --git a/blog/2025-12-10_kv-caching-vllm-lmcache-ceph.md b/blog/2025-12-10_kv-caching-vllm-lmcache-ceph.md
new file mode 100644
index 0000000..c41ab43
--- /dev/null
+++ b/blog/2025-12-10_kv-caching-vllm-lmcache-ceph.md
@@ -0,0 +1,625 @@
+---
+title: "KV Caching with vLLM, LMCache, and Ceph"
+description: "Exploring KV caching integration with vLLM, LMCache, and Ceph for optimized inference performance"
+slug: kv-caching-vllm-lmcache-ceph
+date: 2025-12-10T09:00
+
+authors:
+  - kylebader
+  - tushargohad
+
+tags: [blog, ceph, rgw, s3, kv-cache]
+---
+
+Inference accounts for [90% of the machine learning
+costs](https://www.sciencedirect.com/science/article/pii/S2210537923000124) for deployed AI
+systems, and it is no surprise that inference optimization is a burgeoning topic
+in the research community. [IDC
+estimates](https://info.idc.com/futurescape-generative-ai-2025-predictions.html) that global enterprises will invest
+$307 billion on AI solutions in 2025, and that number is expected to grow
+aggressively year-over-year.
+
+## Understanding the workload
+
+Unlike training, inference for autoregressive language models only involves the
+forward pass, which itself is broken up into two distinct phases: prefill and
+decode. Each phase has a unique workload profile – prefill tends to be
+computation bound, consuming every ounce of floating-point arithmetic capability
+the system can garner, followed by decode, which is principally limited by
+memory bandwidth. 
+
+The computational complexity of both prefill and decode phases grows
+quadratically with each additional token. Prefill is easily parallelized across
+GPUs - all prompt tokens are known up front when a request arrives at the model
+API. The decode phase brings in the transformer multi-headed attention mechanism
+and must compute the attention states across all previous tokens - including any
+prompt(s) and generated responses. This complicates the deployment of inference
+services where context lengths are growing rapidly to accommodate larger code
+bases, longer documents, and retrieval augmented generation. KV caching is where
+the computed key and value weights that correspond with token sequences in a
+prompt are saved for later, and then retrieved when they are used in a
+subsequent prompt to avoid the cost of computation (GPU hours) and to reduce
+the time between when the prompt was submitted as a request and the first
+response token (time-to-first-token, or TTFT).
+
+<!-- truncate -->
+
+## Cache blocks in vLLM and LMCache
+
+vLLM takes a hierarchical approach to KV caching. First it checks for the
+existence of cache blocks in GPU memory, if there is a cache miss it will
+progress to CPU memory, and if there is again a cache miss it will try to
+retrieve cache blocks over any configured KV connectors. LMCache works with vLLM
+over this KV connector interface - vLLM sends or requests cache blocks and
+LMCache works to diligently store or stream cache blocks it locates. vLLM also
+introduced the technique of [Paged Attention](https://arxiv.org/pdf/2309.06180), which breaks up prompts into fixed
+sized token sequences referred to as a block, 16 tokens by default. LMCache uses
+a larger 256 token block by default, presumably to reduce the overhead of
+managing reference to many blocks and to better amortize the per-block transfer
+overhead. Storage folks, being unfamiliar with a token as a unit of measurement
+for space and IO, might naturally wonder what this translates to in terms of
+block sizes expressed in bytes. The bytes-per-token is model dependent, because
+it’s a product of the model’s hidden size, number of key-value heads, number of
+hidden layers, head dimension, and data type size. For a model like Qwen3-32B
+this works out to be approximately 62.5 MiB. There is a convenient [KV Cache
+calculator](https://docs.lmcache.ai/getting_started/kv_cache_calculator.html) available on the documentation page for LMCache if you want to see
+how much KV space would be required for any given model or number of tokens.
+
+## Content addressable KV storage
+
+vLLM and LMCache both calculate a hash of the token sequence that represents a
+block and use that as a cache block identifier. This means that vLLM will pass
+over the kv-connector interface the hashes of cache blocks that it is interested
+in, and LMCache will return a bitmask indicating which cache blocks it can
+provide. Under the covers the LMCache S3 connector will make GetObjectAttributes
+calls with each block identifier (hash of the token sequence) and for each block
+that exists it will flip the corresponding bit in the mask. The elegance of this
+approach is that there is no cache block map that needs to be persisted, and no
+coordination necessary when there are multiple instances of vLLM+LMCache running
+across different hosts. In fact, there is no requirement that the [LMCache
+controller](https://docs.lmcache.ai/kv_cache_management/index.html) be configured at all. This design also
+permits flexible eviction: a storage system could implement time-based
+expiration via Lifecycle configurations, and any deleted block simply registers
+as a miss. In the end you get fully elastic content addressable storage for KV
+cache blocks with flexible eviction. Anyone familiar with Ceph will truly
+appreciate the notion of computing the location of data over performing a
+lookup.
+
+## Retrieving cache blocks
+
+We began exploring LMCache by testing it's native S3 connector with Ceph, as it
+provides an accessible entry point for most existing environments. The other
+appeal of the native S3 connector in LMCache is that it leverages an AWS common
+runtime library (CRT), which means that the connections in the client’s
+connection pool will be multiplexed across endpoints that are returned in the
+DNS response for the object store’s FQDN. The downside is that the bindings in
+the AWS common runtime library for Python only support recv\_filepath and
+send\_filepath, which limits the ability of LMCache to stream the response body
+of a GetObject call directly to page-locked memory buffers allocated by the 
+LocalCPUBackend. To work around this limitation the connector pre-allocates and
+mmaps files on a tmpfs mounted at /dev/shm (one per concurrent request), in this
+way the CRT client can pass the file descriptors of memory mapped files and then
+memcpy from their corresponding buffers to page-locked LocalCPUBackend buffers
+that are used for DMA transfers to the GPU. This is a clever way of working
+around most of the limitations of aws-crt-python, but to get true zero-copy it
+will require changes to the bindings.
+
+After some preliminary testing with the native S3 connector [LMCache
+PR#1939](https://github.com/LMCache/LMCache/pull/1939)
+caught our eye because it leveraged NVIDIA Inference Xfer Library (NIXL). This
+PR introduces the ability to directly read S3 data into page-locked NIXL
+buffers, bypassing files on /dev/shm and the associated memory copy. It also
+introduced a presence cache to eliminate redundant GetObjectInfo requests that
+are used to determine if a cache block exists for a given sequence. We had
+experimented with the NIXL obj plugin already and ran some rudimentary nixlbench
+tests. What we found was that the NIXL obj plugin alone wanted a pre-allocated
+pool of object keys, and that it required either the LMCache coordinator or
+Dynamo KVBM to maintain device ID, offset, and length information for each cache
+block. Unlike other NIXL plugins, the obj plugin could only write a single cache
+block to each device ID (1:1 mapping with object key), because object APIs like
+S3 do not support writes to arbitrary offsets. This is all addressed by PR1939,
+because instead of using a pool of object keys and tracking cache block
+metadata, it preserves the content addressable approach of LMCache’s native S3
+connector. The only remaining downside with NIXL is that it used S3Client
+instead of S3CrtClient, the latter of which supports multipathing across S3
+endpoints.
+
+## Hyperscale AI deployments
+
+Drawing from over a decade of experience selecting hardware for Ceph storage
+systems we had an idea of what sort of system we would want to build to
+maximize throughput, while also drawing inspiration from choices made by major
+AI practitioners like Meta and OpenAI. Enter Meta’s contribution to the Open
+Compute project – the [Yosemite
+V3.5](https://www.opencompute.org/documents/yosemite-v3-5-platform-design-specification-v1-2-pdf) Sierra Point server platform. The YV3.5
+cubby occupies 3 OU and can be populated with 6x Sierra Point blades. Unlike
+conventional enterprise blade systems the YV3.5 platform does not have an
+integrated ethernet switch, instead each Sierra Point blade has OCP 3.0 slot
+for direct to host network connectivity. We wanted a system that was a spiritual
+successor to YV3.5 and Sierra Point, that reaped the advantages of cutting-edge
+processor designs and lithography. While surveying the server landscape across a
+whole host of OEMs there was one system that caught our attention, the
+Supermicro X14 2U 4-node GrandTwin Rear IO.
+
+![](/img/blogs/kv-caching-ceph/smci-x14-grandtwin.png)
+
+[Supermicro X14 2U 4-node GrandTwin Rear
+IO](https://www.supermicro.com/en/products/system/datasheet/sys-212gt-hnr)
+
+Each node:
+•	1x Intel Xeon 6 6740E 96C/96T, 205W
+•	16x16GB DDR5-6400
+•	1x Broadcom 57608 2x200GbE
+•	6x 2.5” Kioxia CM6-R, 7.68TB Gen4 NVMe SSD
+•	RAID1 2x 480TB NVMe (boot) 
+
+This system is utilized to provide high-bandwidth all-flash object storage for
+the AI solution using IBM Storage Ceph 8.1.
+
+![](/img/blogs/kv-caching-ceph/smci-gaudi3.png)
+
+[Supermicro Gaudi 3 AI Server
+SYS-822GA-NGR3](https://www.supermicro.com/en/products/system/datasheet/sys-822ga-ngr3)
+
+•	2x Intel Xeon 6 6960P 72C/144T
+•	24x 64GB DDR5-6400
+•	8x Gaudi 3 HL-325L accelerators
+•	Up to 8x 2.5" Gen5 NVMe SSD
+•	Scale-up networking: 21x 200GbE Gaudi NICs
+•	2x Broadcom 57608 1x400GbE
+
+This system is utilized to run inference workloads with the combination of vLLM
+and LMCache, leveraging Gaudi 3 accelerators from Intel. 
+
+![](/img/blogs/kv-caching-ceph/smci-gpu-aplus.png)
+
+[Supermicro GPU A+ Server AS
+-8125GS-TNMR2](https://www.supermicro.com/en/products/system/datasheet/as-8125gs-tnmr2)
+
+•	1x AMD EPYC 9654 96C/192T
+•	24x 96GB DDR5-4800
+•	8x AMD MI300X accelerators
+•	Up to 8x 2.5" Gen5 NVMe SSD
+•	Scale-up networking: 4x400GbE
+•	Storage and GPU scale-out networking: 4x NVIDIA MT28908 ConnectX-6 200GbE
+
+This system is utilized to run inference workloads with the combination of vLLM
+and LMCache, leveraging MI300X accelerators from AMD.
+
+![](/img/blogs/kv-caching-ceph/smci-sw.png)
+
+[SSE-T7132S - 400Gb Ethernet
+Switch](https://www.supermicro.com/en/products/accessories/Networking/SSE-T7132SR.php)
+
+•	32x QSFP-DD 400GbE, or 64x QSFP56 / 128x QSFP28 with breakout cables
+•	25.6Tb/s switching capacity
+•	SONiC OS
+•	RoCEv2/RDMA support with PFC
+
+For simplicity we used a single fixed-port 400Gb switch for both GPU-to-GPU and
+the storage fabric.
+
+## Host configuration
+
+* Performance profile set in BIOS
+* Set the tuned profile to network-latency
+```
+tuned-adm profile network-latency
+```
+* All hosts were configured with mode 802.3AD with xmit_hash_policy=Layer3+4
+
+## Ceph configuration
+
+### OSD service
+
+```
+---
+service_type: osd
+service_id: nvme
+placement:
+  hosts:
+    - ceph-osd01
+    - ceph-osd02
+    - ceph-osd03
+data_devices:
+  paths:
+    - /dev/disk/by-path/pci-0000:63:00.5-pci-10001:81:00.0-nvme-1
+    - /dev/disk/by-path/pci-0000:63:00.5-pci-10001:82:00.0-nvme-1
+    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:01:00.0-nvme-1
+    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:02:00.0-nvme-1
+    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:03:00.0-nvme-1
+    - /dev/disk/by-path/pci-0000:89:00.5-pci-10002:04:00.0-nvme-1
+```
+
+### Pool configuration
+
+We decided to pre-create metadata and data pools for RGW before initializing the
+RGW service.
+
+```
+ceph osd pool set noautoscale
+ceph osd pool create default.rgw.buckets.data 2048 2048 replicated
+ceph osd pool create default.rgw.buckets.index 64 64 replicated
+ceph osd pool create default.rgw.buckets.non-ec 64 64 replicated
+ceph osd pool set default.rgw.buckets.data size 2
+ceph osd pool set default.rgw.buckets.data min_size 1
+ceph osd pool application enable default.rgw.buckets.data
+ceph osd pool application enable default.rgw.buckets.index
+ceph osd pool application enable default.rgw.buckets.non-ec
+```
+
+### RGW service
+
+This RGW service configuration will create 4x RGW instances on each of the 4
+hosts, with a concentrator bind to the host IP address at port 80.
+
+```
+---
+service_type: rgw
+service_id: standard
+service_name: rgw.standard
+placement:
+  count_per_host: 4
+  label: rgw
+networks:
+  - 10.67.67.0/24
+spec:
+  rgw_exit_timeout_secs: 120
+  rgw_frontend_port: 8080
+  concentrator: haproxy
+  concentrator_frontend_port: 80
+  concentrator_monitor_port: 1967
+  concentrator_monitor_user: admin
+```
+
+### Traffic management
+
+Like many applications, LMCache expects a single S3 endpoint. For us to maximize
+bandwidth to storage cluster we decided to leverage Hashicorp Consul and CoreDNS
+to return multiple DNS records in response to queries for our chosen object
+FQDN. As stated earlier, this works perfectly with AWS CRT libraries like those
+utilized by LMCache’s native S3 connector.
+
+#### Consul
+
+/etc/consul.d/consul.hcl
+```
+datacenter = "smci"
+data_dir = "/opt/consul"
+bind_addr = "172.19.65.41"
+client_addr = "0.0.0.0"
+retry_join = [
+  "172.19.65.41",
+  "172.19.65.42",
+  "172.19.65.43",
+  "172.19.65.44"
+]
+server = true
+bootstrap_expect = 3
+
+services = [
+  {
+    name = "s3"
+    port = 8080
+    check = {
+      id       = "tcp-check"
+      name     = "S3 TCP"
+      tcp      = "localhost:8080"
+      interval = "10s"
+      timeout  = "2s"
+    }
+  },
+  {
+    name = "s3"
+    port = 8081
+    check = {
+      id       = "tcp-check"
+      name     = "S3 TCP"
+      tcp      = "localhost:8081"
+      interval = "10s"
+      timeout  = "2s"
+    }
+  },
+  {
+    name = "s3"
+    port = 8082
+    check = {
+      id       = "tcp-check"
+      name     = "S3 TCP"
+      tcp      = "localhost:8082"
+      interval = "10s"
+      timeout  = "2s"
+    }
+  },
+  {
+    name = "s3"
+    port = 8083
+    check = {
+      id       = "tcp-check"
+      name     = "S3 TCP"
+      tcp      = "localhost:8083"
+      interval = "10s"
+      timeout  = "2s"
+    }
+  }
+]
+```
+
+#### CoreDNS
+
+/etc/coredns/Corefile
+```
+.:53 {
+    log
+    errors
+    forward . 8.8.8.8
+}
+
+cephlab.com {
+    file /etc/coredns/cephlab.com
+    prometheus
+    errors
+    log
+    debug
+}
+
+consul {
+  forward . 172.19.65.41:8600 172.19.65.42:8600 172.19.65.43:8600 172.19.65.44:8600
+  log
+  errors
+}
+
+s3.cephlab.com {
+    rewrite stop {
+        name exact s3.cephlab.com s3.service.consul.
+        answer name s3.service.consul. s3.cephlab.com.
+    }
+    rewrite stop {
+        name regex (.*)\.s3\.cephlab\.com s3.service.consul.
+        answer auto
+    }
+    forward . 172.19.65.41:8600 172.19.65.42:8600 172.19.65.43:8600 172.19.65.44:8600
+    log
+    errors
+    debug
+}
+
+example.hosts s3.ecmp.cephlab.com {
+    hosts {
+        10.67.67.67 s3.ecmp.cephlab.com
+        10.67.67.67 nixl.s3.ecmp.cephlab.com
+        fallthrough
+    }
+    whoami
+}
+```
+
+#### Testing DNS balancing
+
+To validate that the Hashicorp Consul and CoreDNS based approach is functioning
+properly, we can test DNS resolution of the FQDN of our object endpoint. Note
+that we’re seeing 4 records returned, which is exactly what we want.
+
+```
+[cephuser@ceph-osd01 ~]$ dig s3.cephlab.com
+
+; <<>> DiG 9.16.23-RH <<>> s3.cephlab.com
+;; global options: +cmd
+;; Got answer:
+;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12051
+;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
+;; WARNING: recursion requested but not available
+
+;; OPT PSEUDOSECTION:
+; EDNS: version: 0, flags:; udp: 4096
+;; QUESTION SECTION:
+;s3.cephlab.com.                        IN      A
+
+;; ANSWER SECTION:
+s3.cephlab.com.         0       IN      A       172.19.65.41
+s3.cephlab.com.         0       IN      A       172.19.65.42
+s3.cephlab.com.         0       IN      A       172.19.65.43
+s3.cephlab.com.         0       IN      A       172.19.65.44
+
+;; Query time: 1 msec
+;; SERVER: 172.19.65.41#53(172.19.65.41)
+;; WHEN: Tue Nov 04 12:33:03 PST 2025
+;; MSG SIZE  rcvd: 163
+```
+
+## Baseline performance
+
+To establish the baseline performance of the storage cluster before we introduce
+vLLM and LMCache we assessed the performance using
+[elbencho](https://github.com/breuner/elbencho) to generate load
+from the Gaudi3 GPU host and direct it towards the Ceph S3 endpoints. We used a
+62MB block size to match the expected size of KV cache blocks being persisted by
+LMCache. This shows that we’re able to multiplex connections across the
+concentrator endpoints on each host and drive a considerable amount of S3
+traffic from even a single host, topping out at nearly 60 GB/s.
+
+![](/img/blogs/kv-caching-ceph/elbencho.png)
+
+## vLLM
+
+At the time of our testing the vllm production stack did not support our
+end-to-end workflows, so we created customized vLLM container images that
+incorporated a LMCache development release, including one that incorporated the
+latest [vllm-gaudi](https://github.com/vllm-project/vllm-gaudi) development for our testing.
+
+AMD Container
+* vLLM: 
+* LMCache: 
+* NIXL:
+
+Gaudi Container
+* vLLM:
+* LMCache:
+* NIXL:
+
+Below you will find the configuration files and command line arguments we used
+to run vLLM and LMCache together.
+
+.aws/credentials
+```
+[lmcache]
+region = default
+endpoint_url = http://s3.cephlab.com:80
+aws_access_key_id = xxx
+aws_secret_access_key = yyy
+response_checksum_validation = when_required
+preferred_transfer_client = crt
+```
+
+lmcache-ceph.yaml
+```
+chunk_size: 256
+local_cpu: False
+max_local_cpu_size: 100
+remote_url: "s3://lmcache.s3.cephlab.com"
+save_unfull_chunk: False
+enable_async_loading: True
+remote_serde: "naive"
+blocking_timeout_secs: 100
+extra_config:
+  s3_max_io_concurrency: 1024
+  s3_max_inflight_reqs: 1024
+  s3_prefer_http2: False
+  s3_region: "default"
+  s3_enable_s3express: False
+  save_chunk_meta: False
+  s3_file_prefix: "test"
+```
+
+lmcache-nixl-ceph.yaml
+```
+chunk_size: 512
+local_cpu: false
+max_local_cpu_size: 50
+remote_serde: "naive"
+nixl_buffer_size: 1073741824
+nixl_buffer_device: cpu
+extra_config:
+  enable_nixl_storage: true
+  nixl_backend: OBJ
+  nixl_pool_size: 512
+  nixl_backend_params:
+    endpoint_override: http://s3.cephlab.com
+    access_key: CR98FOT054QZJ60NR7E3
+    secret_key: 15CTFkiAdwPkkiSh4gOlQ5zF14KZ0uCnZloYVo3w
+    scheme: http
+    region: default
+    req_checksum: required
+    bucket: lmcache
+```
+
+lmcache-dram.yaml
+```
+chunk_size: 256
+local_cpu: True
+max_local_cpu_size: 50
+save_unfull_chunk: False
+enable_async_loading: True
+remote_serde: "naive"
+blocking_timeout_secs: 100
+```
+
+Starting vLLM
+```
+LMCACHE_CONFIG_FILE="/root/lmcache-nixl-s3.yaml"
+LMCACHE_USE_EXPERIMENTAL=True
+PYTHONHASHSEED=67
+AWS_PROFILE='lmcache'
+vllm serve Qwen/Qwen3-32B  \
+       --gpu-memory-utilization 0.55 \
+       --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
+       --max-model-len 131072 \
+       --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both","kv_parallel_size":"16"}' \
+       --tensor-parallel-size 2
+```
+
+For the Gaudi3 accelerator testing we set the following additional environmental
+variables:
+
+```
+PT_HPU_GPU_MIGRATION=1
+VLLM_USE_V1=1
+VLLM_SKIP_WARMUP=True
+VLLM_EXPONENTIAL_BUCKETING=False
+```
+
+## Benchmark
+
+We wanted to characterize the reduction in time-to-first-token for a 100% cache
+hit rate from remote storage with Ceph across various context lengths, and chart
+it relative to computational prefill. For this we selected the LMCache
+[long_doc_qa.py](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long_doc_qa/long_doc_qa.py). We developed the following methodology for TTFT data collection:
+
+1.	Start vLLM
+2.	Run long_doc_qa.py and record TTFT for the warm-up round (computational
+    prefill result)
+3.	Restart vLLM
+4.	Run long_doc_qa.py and record TTFT for the warm-up round (KV cache hit from
+    remote storage result)
+5.	Stop vLLM
+6.	Remove cache blocks from remote storage
+
+By restarting vLLM in step 3 we ensure that the results are not skewed by KV
+caching in GPU HBM or CPU memory, and by stopping vLLM and removing cache blocks
+from remote storage we ensure that each subsequent context length is not
+benefitting from remote storage KV caching from the previous context length.
+With this methodology all KV caches are cold at the beginning of each test,
+except for remote storage KV caching which we want to measure the benefit of in
+step 4.
+
+long_doc_qa.py example command line
+```
+python3 ~/LMCache/benchmarks/long_doc_qa/long_doc_qa.py \
+      --model Qwen/Qwen3-32B \
+      --port 8000 \
+      --num-documents 1 \
+      --document-length ${len} \
+      --output-len 100 \
+      --repeat-count 1 \
+      --repeat-mode interleave \
+      --max-inflight-requests 1 \
+      --output results/ttft_${L}.out
+```
+
+## Results
+
+![](/img/blogs/kv-caching-ceph/amd-tp1-sweep-qwen.png)
+![](/img/blogs/kv-caching-ceph/amd-tp-qwen.png)
+![](/img/blogs/kv-caching-ceph/amd-tp-llama.png)
+
+![](/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-qwen.png)
+![](/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-llama.png)
+![](/img/blogs/kv-caching-ceph/gaudi3-tp-qwen.png)
+![](/img/blogs/kv-caching-ceph/gaudi3-tp-llama.png)
+
+Considerable reduction in TTFT with both Intel Guadi3 and AMD MI300X
+accelerators, with up to 23x reduction with the longest context length tested at
+the tensor parallelism set to 1 with Llama3.3-70B-Instruct. This testing also
+illustrates how KV caching can reduce TTFT more than using tensor parallelism to
+spread prefill across multiple GPUs in a system and that combing these
+techniques can deliver the lowest TTFT. It’s also worth pointing out that in
+addition to reducing TTFT, prefix caching derives additional value by conserving
+GPU cycles for decode – potentially reducing time-per-output-token (TPOT).
+
+## What's next?
+
+We shared our results with the llm-d team at Red Hat and have sarted to work
+with them to commodify KV caching by establishing KV caching with Ceph as a
+[well-lit
+path](https://www.redhat.com/en/topics/ai/what-is-llm-d#what-are-well-lit-paths). We believe that our approach is perhaps the most accessible
+because it uses standard object protocols like S3, standard TCP/IP networking,
+works with a variety of accelerators from different vendors, and because Ceph
+object is ubiquitously deployed in OpenShift clusters through OpenShift Data
+Foundation and IBM Fusion. Our next phase of testing will utilize llm-d, with
+the GPU hosts serving as worker nodes, and exploring more sophisticated
+scenarios like PD disaggregation and cache blending.
+
+Finally, we'd like to thank Supermicro for providing the environment for these
+testing efforts. If you have any questions about Data or AI workloads for Ceph,
+please [reach out](mailto:kbader@ibm.com).
+
diff --git a/blog/authors.yml b/blog/authors.yml
index f3cb57d..6702f1f 100644
--- a/blog/authors.yml
+++ b/blog/authors.yml
@@ -94,4 +94,16 @@ kayyan:
   name: Kay Yan
   title: Principal Software Engineer, DaoCloud
   url: https://www.linkedin.com/in/yankay/
-  image_url: /img/blogs/kayyan.jpg
\ No newline at end of file
+  image_url: /img/blogs/kayyan.jpg
+
+kylebader:
+  name: Kyle Bader
+  title: Chief Architect, Data and AI, Ceph at IBM
+  url: https://www.linkedin.com/in/kyle-bader-5267a030/
+  image_url: /img/blogs/kyle-bader.jpg
+
+tushargohad:
+  name: Tushar Gohad
+  title: Distinguished Engineer, Intel
+  url: https://www.linkedin.com/in/tushargohad/
+  image_url: /img/blogs/tushar-gohad.jpg
diff --git a/blog/tags.yml b/blog/tags.yml
index db23144..03070ec 100644
--- a/blog/tags.yml
+++ b/blog/tags.yml
@@ -62,4 +62,24 @@ sig-benchmarking:
 releases:
   label: Releases
   permalink: /releases
-  description: llm-d release announcements
\ No newline at end of file
+  description: llm-d release announcements
+
+ceph:
+  label: Ceph
+  permalink: /ceph
+  description: Ceph storage related content
+
+rgw:
+  label: RGW
+  permalink: /rgw
+  description: RADOS Gateway (RGW) content
+
+s3:
+  label: S3
+  permalink: /s3
+  description: S3 object storage content
+
+kv-cache:
+  label: KV Cache
+  permalink: /kv-cache
+  description: KV caching for LLM inference
\ No newline at end of file
diff --git a/static/img/blogs/kv-caching-ceph/amd-tp-llama.png b/static/img/blogs/kv-caching-ceph/amd-tp-llama.png
new file mode 100644
index 0000000..7eb4f57
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/amd-tp-llama.png differ
diff --git a/static/img/blogs/kv-caching-ceph/amd-tp-qwen.png b/static/img/blogs/kv-caching-ceph/amd-tp-qwen.png
new file mode 100644
index 0000000..c416e34
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/amd-tp-qwen.png differ
diff --git a/static/img/blogs/kv-caching-ceph/amd-tp1-sweep-qwen.png b/static/img/blogs/kv-caching-ceph/amd-tp1-sweep-qwen.png
new file mode 100644
index 0000000..e65a20d
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/amd-tp1-sweep-qwen.png differ
diff --git a/static/img/blogs/kv-caching-ceph/amd-tp1-sweep.png b/static/img/blogs/kv-caching-ceph/amd-tp1-sweep.png
new file mode 100644
index 0000000..4d68d61
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/amd-tp1-sweep.png differ
diff --git a/static/img/blogs/kv-caching-ceph/elbencho.png b/static/img/blogs/kv-caching-ceph/elbencho.png
new file mode 100644
index 0000000..3619202
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/elbencho.png differ
diff --git a/static/img/blogs/kv-caching-ceph/gaudi3-tp-llama.png b/static/img/blogs/kv-caching-ceph/gaudi3-tp-llama.png
new file mode 100644
index 0000000..87704f1
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/gaudi3-tp-llama.png differ
diff --git a/static/img/blogs/kv-caching-ceph/gaudi3-tp-qwen.png b/static/img/blogs/kv-caching-ceph/gaudi3-tp-qwen.png
new file mode 100644
index 0000000..afc66bb
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/gaudi3-tp-qwen.png differ
diff --git a/static/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-llama.png b/static/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-llama.png
new file mode 100644
index 0000000..31be5a0
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-llama.png differ
diff --git a/static/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-qwen.png b/static/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-qwen.png
new file mode 100644
index 0000000..2999782
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/gaudi3-tp2-sweep-qwen.png differ
diff --git a/static/img/blogs/kv-caching-ceph/smci-gaudi3.png b/static/img/blogs/kv-caching-ceph/smci-gaudi3.png
new file mode 100644
index 0000000..6015280
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/smci-gaudi3.png differ
diff --git a/static/img/blogs/kv-caching-ceph/smci-gpu-aplus.png b/static/img/blogs/kv-caching-ceph/smci-gpu-aplus.png
new file mode 100644
index 0000000..32c4612
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/smci-gpu-aplus.png differ
diff --git a/static/img/blogs/kv-caching-ceph/smci-sw.png b/static/img/blogs/kv-caching-ceph/smci-sw.png
new file mode 100644
index 0000000..42a9ab5
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/smci-sw.png differ
diff --git a/static/img/blogs/kv-caching-ceph/smci-x14-grandtwin.png b/static/img/blogs/kv-caching-ceph/smci-x14-grandtwin.png
new file mode 100644
index 0000000..4b17233
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/smci-x14-grandtwin.png differ
diff --git a/static/img/blogs/kv-caching-ceph/title.png b/static/img/blogs/kv-caching-ceph/title.png
new file mode 100644
index 0000000..f21b4a7
Binary files /dev/null and b/static/img/blogs/kv-caching-ceph/title.png differ
diff --git a/static/img/blogs/kyle-bader.jpg b/static/img/blogs/kyle-bader.jpg
new file mode 100644
index 0000000..c480d4a
Binary files /dev/null and b/static/img/blogs/kyle-bader.jpg differ
diff --git a/static/img/blogs/tushar-gohad.jpg b/static/img/blogs/tushar-gohad.jpg
new file mode 100644
index 0000000..c199b95
Binary files /dev/null and b/static/img/blogs/tushar-gohad.jpg differ