Skip to content

feat(infra): scrape systemd journal logs via promtail#3542

Merged
manamana32321 merged 3 commits into
mainfrom
feat/promtail-journal-scrape
Apr 27, 2026
Merged

feat(infra): scrape systemd journal logs via promtail#3542
manamana32321 merged 3 commits into
mainfrom
feat/promtail-journal-scrape

Conversation

@manamana32321
Copy link
Copy Markdown
Member

@manamana32321 manamana32321 commented Apr 26, 2026

Description

promtail에 systemd-journal scrape job을 추가합니다. K8s/보안 관련 unit (k3s, k3s-agent, containerd, sshd, sudo, systemd-logind, kernel transport)을 화이트리스트로 수집하고, 호스트의 /var/log/journal을 read-only로 마운트합니다. journal cursor 오버헤드를 위해 메모리 limit을 128Mi → 256Mi로 올립니다.

해결하려는 문제: 최근 prod 컨트롤플레인 장애 조사 시 노드 로그(k3s, kernel link-flap 등)를 보려고 kubectl debug node + chroot 우회로를 매번 사용해야 했습니다. systemd journal을 Loki에 중앙화하면 다음 사고 때 SSH 키 배포에 의존하지 않고 Grafana에서 LogQL 한 번으로 모든 노드 로그를 트리아지할 수 있습니다.

L4-cheap에 해당하는 보안 시그널(sshd / sudo)도 추가 비용 거의 없이 함께 수집합니다.

수집 범위

계층 unit
L2 Platform k3s.service, k3s-agent.service, containerd.service
L3 OS kernel transport, systemd-logind.service
L4-cheap sshd.service, sudo

이 PR 범위 밖 (별도 PR로 진행 예정): K8s API audit log, K8s Events (eventrouter), Falco.

동작 방식

promtail이 systemd journal을 어떻게 읽고, 어떤 라벨로 Loki에 보내는지 (펼쳐 보기)

파이프라인 (3단계):

[systemd-journald binary files /var/log/journal/<machine-id>/]
        │ sd-journal C lib (text tail 아님 — binary index 직접 읽음)
        ▼
[promtail journal source] ← matches:로 source-side 필터링
        │ 메타데이터 → internal labels (__journal__*)
        ▼
[relabel_configs] internal → user-visible 라벨 변환
        │
        ▼
[Loki API push]

journal: 블록 핵심

  • path: /var/log/journal — host에서 마운트한 위치. promtail은 sd-journal 라이브러리로 binary 파일 직접 read
  • max_age: 24h — positions cursor가 없거나 오래됐을 때 catch-up 상한. 정상 운영 (cursor 살아있음) 중엔 무관. 노드 재부팅(positions가 tmpfs /run/promtail/에 있어 휘발) 시 영향
  • labels: { job: systemd-journal } — 모든 entry에 강제로 붙는 정적 라벨. LogQL {job="systemd-journal"} 진입점
  • matches:source-side 필터 (read 자체를 안 함, drop보다 효율적). journalctl 매치 문법:
    • 같은 field 반복 = 그 field 내 OR (UNIT=A UNIT=B → A 또는 B)
    • 다른 field 사이 = AND
    • + = OR 그룹 분리
    • 우리 매치: (UNIT=k3s|k3s-agent|containerd|sshd|systemd-logind) OR TRANSPORT=kernel OR COMM=sudo

relabel_configs — 라벨 변환

internal 라벨(이중 underscore prefix)은 Loki push 전 drop됨. 보고 싶은 건 target_label로 promote.

internal user-visible 출처
__journal__systemd_unit (2 underscore) unit raw systemd field _SYSTEMD_UNIT
__journal__hostname (2) hostname raw _HOSTNAME
__journal__transport (2) transport raw _TRANSPORT
__journal_priority_keyword (1) level promtail이 derive (PRIORITY 0~7 → emerg..debug)

Underscore 1개 vs 2개 차이: raw systemd field는 __journal__ (이중), promtail derived field는 __journal_ (단일). 컨벤션 차이라 코드 리뷰 봇이 한 번 헷갈렸음 (false positive 1건).

라벨 카디널리티

Loki는 라벨 조합당 stream 1개. 우리 조합:

  • job 1 × hostname 5 × unit ~7 × transport 4 × level 8 = 이론 max ~1,120 stream
  • 실제 sparse → 200400 stream 예상 (Loki 단일 tenant 권장 한계 1만~10만)
  • _PID _UID 같은 고카디널리티 field는 라벨로 promote 안 함 — 본문 검색(|=)으로 충분

LogQL 사용 예시

# 특정 노드의 k3s 에러
{job="systemd-journal", hostname="skkuding-4f-2", unit="k3s.service", level=~"err|warning"}

# kernel link flap (지난 컨트롤플레인 사고 패턴)
{job="systemd-journal", transport="kernel"} |~ "eno|link|carrier"

# 5개 노드의 시간당 k3s/agent 에러 카운트 (Grafana panel 용)
sum by (hostname) (
  count_over_time({job="systemd-journal", unit=~"k3s(-agent)?\\.service", level=~"err|warning"}[1h])
)

Additional context

용량 추정kubectl debug node로 prod 노드에서 직접 샘플링한 결과:

  • 정상 CP 노드 (4f-1): ~72 MB/일 (대부분 k3s.service)
  • 정상 worker (4f-5): ~150 KB/일 (재측정값. iris CrashLoop 패치 후 6h 윈도우에서 k3s-agent 26 KB만 관측)
  • 클러스터 합계: ~250 MB/일 raw (CP 3대가 대부분 차지)
  • minio 위 Loki 압축비 ~10:1 → 90일 retention 기준 minio에 약 2.3 GB 추가
  • 현재 minio 여유: 3 TB+ (3% 사용 중) — 사실상 무영향

초기 추정은 700 MB/일이었으나, iris CrashLoop 패치 후 worker 노드 k3s-agent 로그가 천 배 감소(이전 ~171 MB/일 → 현재 ~100 KB/일). 실제 측정값으로 업데이트.

노드 측 작업 불필요 — prod 5개 노드 모두 /var/log/journal/ 디렉토리가 이미 존재하고 journald Storage=auto 기본값에서 persistent로 동작 중입니다. Ansible playbook 수정 없음.

리뷰 포인트

  • matches: 화이트리스트 문법 (journalctl 스타일; +는 필드 간 OR, 같은 필드 반복은 필드 내 OR)
  • 메모리 256Mi 상향 — journal 볼륨 대비 버퍼 충분한지
  • promtail Helm chart 6.17.1이 deprecated 경고를 출력합니다. Grafana Alloy로의 마이그레이션은 별도 작업이며 이 PR에는 스키마 변경 없음

검증 계획 (main 머지 후 stage 클러스터에서):

  • stage Loki에서 {job="systemd-journal"} 결과 반환 확인
  • unit별 sanity check: {unit="k3s.service"}, {transport="kernel"}, {unit="sshd.service"}
  • promtail pod 메모리 256 MiB 이내 유지
  • Loki ingest rate 증가량 확인
  • 기존 kubernetes-pods scrape 회귀 없음

Before submitting the PR, please make sure you do the following

  • Read the Contributing Guidelines
  • Read the Contributing Guidelines and follow the Commit Convention
  • Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. fixes #123).
  • Ideally, include relevant tests that fail without this PR but pass with it. (N/A — 인프라 설정 변경이며, 검증은 런타임에서만 가능. 위 검증 계획 참조)

🤖 Generated with Claude Code

Add a journal scrape job to promtail with a whitelist of K8s and
security units (k3s, k3s-agent, containerd, sshd, systemd-logind,
sudo, kernel transport). Mounts host /var/log/journal as read-only.

Bumps memory limit 128Mi -> 256Mi for journal cursor overhead.

Estimated additional ingest: ~700 MB/day raw (~6 GB on minio over
90-day retention, well within capacity).

Enables LogQL queries like {job="systemd-journal", unit="k3s.service"}
for control plane debugging without SSH dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables Promtail to scrape systemd-journal logs by mounting the host journal directory and adding a dedicated scrape configuration for system services like k3s and containerd. It also increases the memory limit for the Promtail deployment. Feedback suggests increasing the memory request to match the limit for 'Guaranteed' QoS and extending the journal 'max_age' to 24h to prevent data loss during extended downtime.

Comment thread infra/k8s/monitoring/promtail/values.yaml
Comment thread infra/k8s/monitoring/promtail/values.yaml Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e15e68e1c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread infra/k8s/monitoring/promtail/values.yaml
@manamana32321
Copy link
Copy Markdown
Member Author

Code review

Found 1 issue:

  1. The PR mounts host /var/log/journal but does not mount host /etc/machine-id. systemd journal stores logs under /var/log/journal/<machine-id>/, and the journal reader uses the value of /etc/machine-id (as seen by the process) to pick the subdirectory. Inside a container that ID is the container's, not the host's, so the scraper opens a non-existent subdirectory and the new systemd-journal job silently ingests zero logs. Independently flagged inline by chatgpt-codex-connector. Fix: add a second hostPath volume + read-only mount for /etc/machine-id.

# Mount host journal directory so promtail can scrape systemd-journald
extraVolumes:
- name: journal
hostPath:
path: /var/log/journal
extraVolumeMounts:
- name: journal
mountPath: /var/log/journal
readOnly: true

Bot reviewers (gemini-code-assist, chatgpt-codex-connector) raised two additional points that did not meet this review's confidence threshold but are worth a glance: bumping requests.memory to 256Mi for Guaranteed QoS, and increasing max_age from 12h to 24h-48h to survive longer outages. The requests.memory line is unmodified by this PR; the max_age is a tunable trade-off.

Robot Generated with Claude Code

- If this code review was useful, please react with thumbs-up. Otherwise, react with thumbs-down.

manamana32321 and others added 2 commits April 27, 2026 01:03
Without the host machine-id, the journal reader inside the container
uses the container's own /etc/machine-id and looks for journals under
/var/log/journal/<container-machine-id>/, which does not exist on the
host filesystem. Result: zero logs ingested, no error surfaced.

Independently flagged by codex and the in-house code review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
max_age caps how far back promtail catches up when the positions cursor
is unavailable. Positions live on tmpfs (/run/promtail/) so a node reboot
triggers catch-up from now-max_age. 12h misses logs from a weekday
outage that recovers the next morning; 24h covers that case while
keeping startup catch-up bounded.

Suggested by gemini-code-assist on PR review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@sunghyun1000 sunghyun1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@manamana32321 manamana32321 enabled auto-merge April 27, 2026 01:50
@manamana32321 manamana32321 added this pull request to the merge queue Apr 27, 2026
Merged via the queue into main with commit badbbfe Apr 27, 2026
37 checks passed
@manamana32321 manamana32321 deleted the feat/promtail-journal-scrape branch April 27, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants