Skip to content

[restore-files] Improve missing resolution speed using files cache#5719

Open
richard-julien wants to merge 5 commits into
masterfrom
oob/restore-file-speed
Open

[restore-files] Improve missing resolution speed using files cache#5719
richard-julien wants to merge 5 commits into
masterfrom
oob/restore-file-speed

Conversation

@richard-julien
Copy link
Copy Markdown
Member

@richard-julien richard-julien commented Feb 4, 2026

Description

This PR introduces significant performance optimizations to the restore-files connector, specifically targeting the resolution of missing references during the restoration process.

Changes

  • Implemented file caching: replaced the repetitive os.walk file-system traversal in find_element with a pre-computed in-memory cache (backup_files). The connector now builds a filename → list of run-directories map of all available files once at the beginning of the process.
  • Optimized lookup strategy: find_element now performs a dictionary lookup followed by a tiny linear scan over the entity's snapshot list (typically 1 element) to find the closest later snapshot, instead of an expensive recursive directory scan. This drastically reduces I/O overhead when resolving dependencies for objects with missing references.
  • Enhanced observability: added logging to track the time taken to build the file cache (including the number of unique files / total snapshots indexed and directories scanned) and to measure the duration of the "resolve missing references" phase.

Review-fix passes applied on top of the original branch

All five Copilot review threads from the May 21 10:17 pass were addressed in the same squashed commit:

  1. Mutable default argumentresolve_missing used acc=[], which is shared across invocations when the caller does not pass acc and silently leaks accumulated objects from one restore directory into the next. Switched to acc=None and a single-line if acc is None: acc = [] init.
  2. Run-directory filtering — the Path(path).iterdir() loop appended every entry whose name parsed as a date, even if the entry was a stray file at the opencti_data level (os.scandir(entry) would then raise NotADirectoryError). Now filtered with entry.is_dir() before the date_convert check.
  3. os.scandir handle hygiene + file filteringos.scandir(entry) was used without a context manager (leaks directory handles on large backups) and did not filter to actual files, so subdirectories or symlinks-to-dirs could be cached under a .json name and crash fetch_stix_data later. Now wrapped in with os.scandir(entry) as it: and only file.is_file() entries are cached.
  4. Cache-build timing log — the original PR description promised logging the time taken to build the file cache, but only the start was logged. The code now captures cache_start_time before the scan and emits a second log line after with the elapsed seconds, the number of indexed files, and the number of directories scanned.
  5. Flat-structure assumption — the legacy os.walk based code also assumed a flat structure (os.path.basename(root) was passed straight to date_convert, so a nested subdir like <run-dir>/subdir/foo.json would have raised ValueError regardless). The flat-structure assumption (matching what the backup-files stream connector writes) is now explicitly documented in a comment at the cache-building site.

Two further Copilot review threads from the May 21 13:23 pass were addressed on dbab9b64ff:

  1. Dead commented-out missing-ref linesfind_element carried a pair of commented-out lines (# self.helper.log_error("Missing file: " + name) and # raise ValueError("Missing file: " + name)) sitting on either side of an unconditional return None. They were dead code at runtime and obscured the actual contract. Removed and replaced with a multi-line comment that spells the contract out: missing references are the normal case here — the backup-files connector only writes entities that existed at snapshot time, so an _ref pointing at an entity created later / deleted before the backup ran / scoped out by the upstream stream filter is expected to be unresolvable. Callers (resolve_missing and the main resolution loop) already check for None and a per-miss log here would flood the connector logs on every restore (typically thousands of misses per run on a large backup), which is why the original log_error line was commented out in the first place.
  2. Cache filtered to .json files only — the cache build was caching every file.is_file() entry under each run directory, including sidecar files (manifest.txt, .gitkeep, temporary writes the upstream backup connector might drop) that find_element never queries — every lookup composes <id>.json and goes through backup_files.get(name), so anything else is pure waste in the cache: it inflates memory (one dict entry per non-JSON file) and slows the cache build on large backups without ever producing a hit. Added file.name.endswith(".json") to the filter so only the keys find_element actually queries land in the map. The file.is_file() half of the guard is preserved — it still protects against symlinks-to-dirs that could otherwise crash fetch_stix_data later if a non-JSON entry happened to match a lookup.

A subsequent Copilot review thread on b90b55e121 resolved the most subtle of the cache regressions:

  1. Multi-snapshot correctness — the previous filename → single-string run-directory cache shape silently overwrote earlier entries every time the same <id>.json landed in multiple run-dirs. find_element then resolved every missing reference to the latest snapshot for that id, regardless of how far in the future it sat — injecting from-the-future state into the restore of an earlier run-dir, and then forcing the platform to regress to older versions on subsequent runs as those older directories were processed. The legacy os.walk-based code did not have this regression: it walked every run-dir, picked the FIRST one whose date was strictly greater than dir_date, and on Linux ext4 / Windows NTFS iteration order this was the closest later snapshot. Even though the backup-files connector keys directories on created_at (so in the common case all of an entity's events land in the same run-dir), the same <id>.json can still legitimately end up in multiple run-dirs when (a) operators concatenate two backups under the same opencti_data tree, (b) the created_at extension diverges from the bare attribute across stream replays, or (c) a backup is mirrored from multiple OpenCTI instances. Restored the legacy semantic by switching the cache to filename → list of run-directory names (built while iterating dirs in chronological order, so each list is sorted ascending without an explicit sort step) and walking the list inside find_element to find the first entry strictly after dir_date. On the common single-snapshot case the list has one element and the lookup runs in O(1) just like before; on the multiple-snapshot case the linear scan picks the right entry instead of silently picking the latest one. The cache-build log line also surfaces both counts now (unique filenames vs total snapshots indexed) so an operator can see at a glance whether a backup is dominated by re-snapshots or single-shot entities.

The final two Copilot review threads were addressed on 094d2a9c44, alongside a related pre-existing handle-leak found during the senior-reviewer pass:

  1. Missing-resolution timing log message — switched from present tense Handle missing resolutions in <float> (no unit) to past tense Handled missing reference resolution in <float>s (<count> objects). The previous shape was easy to misread as milliseconds on busy log views and the present-tense verb suggested the work was still in progress when the line was actually emitted after the resolution had completed.
  2. Cache scoped to dirs after the resume cursor — when start_date is set (resume mode), the main restore loop already skips every run-dir whose date is <= start_date. find_element only ever returns snapshots strictly later than the dir_date it is called with, so snapshots from directories at or before start_date can never be a valid resolution target on a resumed restore — caching them is pure wasted I/O. The cache-build site now pre-filters dirs into a cache_dirs list (d for d in dirs if date_convert(d.name) > start_date) when resuming. On long-lived backups (months of run-dirs) this is the difference between a sub-second cache build and a multi-minute one. The cache-build log line now surfaces both len(cache_dirs) and len(dirs) and (when resuming) the number of skipped directories + the resume cursor, so an operator can see immediately whether the cache scoping actually narrowed the scan.
  3. os.scandir handle hygiene in the main bundle-build loop (own finding, no Copilot thread). The cache-build site was already wrapped in a with os.scandir(entry) as it: block (thread [TheHive] Create the connector #3), but the second os.scandir call-site in restore_files — the one that walks each run-dir to build the per-bundle element_refs / files_data / element_ids — was still leaking the underlying OS-level directory handle on every run-dir until garbage collection happened to run. Wrapped that call-site in a with block as well so both os.scandir users in the module follow the same hygiene.

Signed-commit cleanup

The original commit (6254e23bd5, authored by @richard-julien) was unverified, so the repo's Check signed commits in PR workflow was failing. Squashed into a single GPG-signed commit with the review fixes above folded in; original authorship preserved via a Co-authored-by: trailer. A follow-up signed merge commit pulls in origin/master so the branch tracks the current [all] Release 7.260521.0 baseline.

Impact

  • Performance: drastically reduces execution time for restoring large datasets containing many inter-linked objects, and on resumed restores the cache-build phase itself is now scoped to the post-cursor history rather than the entire backup tree.
  • Reliability: removes the performance bottleneck caused by recursive file-system walking during the resolution loop, hardens the cache against stray files / leaked directory handles / non-file / non-JSON entries, preserves the legacy "first dir strictly after dir_date" semantic when the same id is snapshotted across multiple run-dirs (no more from-the-future state injection on duplicate-id restores), and closes both os.scandir handle leaks in the module.

Related issues

Closes #5722.

@richard-julien richard-julien marked this pull request as draft February 4, 2026 20:50
@richard-julien richard-julien marked this pull request as ready for review February 5, 2026 12:22
@helene-nguyen helene-nguyen changed the title [backup-files] Improve missing resolution speed using files cache [restore-files] Improve missing resolution speed using files cache Feb 5, 2026
@romain-filigran romain-filigran added the filigran team use to identify PR from the Filigran team label Feb 9, 2026
@SamuelHassine SamuelHassine requested a review from Copilot May 21, 2026 10:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the restore-files connector by replacing repeated filesystem traversals during “missing reference” resolution with a precomputed in-memory lookup table, aiming to reduce I/O and improve restore performance on large backups.

Changes:

  • Reworked find_element / resolve_missing to use a prebuilt backup_files cache for O(1) lookups.
  • Added a one-time scan of backup directories to build the backup_files map.
  • Added timing logs around the “resolve missing references” phase.

Comment thread external-import/restore-files/src/restore-files.py Outdated
Comment thread external-import/restore-files/src/restore-files.py Outdated
Comment thread external-import/restore-files/src/restore-files.py Outdated
Comment thread external-import/restore-files/src/restore-files.py Outdated
Comment thread external-import/restore-files/src/restore-files.py Outdated
SamuelHassine and others added 2 commits May 21, 2026 07:22
Replaces the repetitive `os.walk` traversal in `find_element` with a
pre-computed in-memory `filename → run-directory` lookup table built
once at the start of the restore. The missing-reference resolution
loop now performs an O(1) dictionary lookup instead of an expensive
recursive directory scan, drastically reducing I/O overhead when
resolving dependencies for objects with missing refs.

The backup format (written by the `backup-files` stream connector)
is flat: `<backup>/opencti_data/<date_range>/<entity_id>.json`, with
no nested subdirectories — the cache exploits that invariant.

### Side fixes

* **Mutable default argument (Copilot review thread)** —
  `resolve_missing` used `acc=[]`, which is shared across invocations
  when the caller does not pass `acc` and silently leaks accumulated
  objects from one restore directory into the next. Switched to
  `acc=None` and a single-line `if acc is None: acc = []` init.
* **Run-directory filtering (Copilot review thread)** — the
  `Path(path).iterdir()` loop appended every entry whose name parses
  as a date, even if the entry was a stray file at the
  `opencti_data` level (`os.scandir(entry)` would then raise
  `NotADirectoryError`). Now filtered with `entry.is_dir()` before
  the `date_convert` check.
* **Scandir handle hygiene + file filtering (Copilot review thread)** —
  `os.scandir(entry)` was used without a context manager (leaks
  directory handles on large backups) and did not filter to actual
  files, so subdirectories or symlinks-to-dirs could be cached under
  a `.json` name and crash `fetch_stix_data` later. Now wrapped in
  `with os.scandir(entry) as it:` and only `file.is_file()` entries
  are cached.
* **Cache-build timing log (Copilot review thread)** — the PR
  description promised logging the time taken to build the file
  cache, but only the start was logged. Capture `cache_start_time`
  before the scan and emit a second log line after with the elapsed
  seconds, the number of indexed files, and the number of directories
  scanned.
* **Flat-structure assumption (Copilot review thread)** — the legacy
  `os.walk` based code also assumed a flat structure
  (`os.path.basename(root)` was passed straight to `date_convert`, so
  a nested subdir like `<run-dir>/subdir/foo.json` would have raised
  `ValueError`). The flat-structure assumption is now explicitly
  documented in a comment at the cache-building site.

The previous commit (`6254e23bd5`, authored by @richard-julien) was
unverified, so the repo's `Check signed commits in PR` workflow was
failing. This commit squashes it together with the review fixes above
into a single GPG-signed commit; original authorship is preserved
via the `Co-authored-by:` trailer below.

Closes #5722.

Co-authored-by: Julien Richard <julien.richard@filigran.io>
@SamuelHassine SamuelHassine force-pushed the oob/restore-file-speed branch from 6254e23 to cb3be16 Compare May 21, 2026 11:22
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 0% with 50 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
external-import/restore-files/src/restore-files.py 0.00% 50 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (5d7d493) and HEAD (094d2a9). Click for more details.

HEAD has 96 uploads less than BASE
Flag BASE (5d7d493) HEAD (094d2a9)
connectors 97 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5719       +/-   ##
===========================================
- Coverage   26.72%    0.19%   -26.53%     
===========================================
  Files        1801     1725       -76     
  Lines      106980   106462      -518     
===========================================
- Hits        28587      207    -28380     
- Misses      78393   106255    +27862     
Files with missing lines Coverage Δ
external-import/restore-files/src/restore-files.py 0.00% <0.00%> (ø)

... and 952 files with indirect coverage changes

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

SamuelHassine
SamuelHassine previously approved these changes May 21, 2026
Copy link
Copy Markdown
Member

@SamuelHassine SamuelHassine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on cb3be166d9.

All 5 Copilot review threads addressed and resolved on a120ea6c52:

  • restore-files.py:74resolve_missing switched from acc=[] (mutable default leaking across top-level invocations) to acc=None + single-line if acc is None: acc = [] init.
  • restore-files.py:102Path(path).iterdir() loop now filters with entry.is_dir() before the date_convert check, so a stray file at the opencti_data level whose name parses as a date cannot crash os.scandir(entry) with NotADirectoryError.
  • restore-files.py:113 — cache build wrapped in with os.scandir(entry) as it: (so directory handles are released deterministically instead of waiting on the GC) and filtered to file.is_file() so subdirectories or symlinks-to-dirs no longer pollute the cache.
  • restore-files.py:112 — explicitly documented the flat-structure assumption (the backup-files stream connector writes one directory per date range under opencti_data with files at the immediate child level — confirmed in stream/backup-files/src/backup-files.py::write_files). The legacy os.walk based code also assumed flat (it passed os.path.basename(root) straight into date_convert, so a nested subdir would have raised ValueError), so the new cache does not regress the supported set of backup layouts.
  • restore-files.py:112 — added the elapsed-time / file-count / directory-count log line after the cache is built (the original PR description promised it but the code only logged the start).

Signed-commit cleanup: the original commit (6254e23bd5, authored by @richard-julien) was unverified, so Check signed commits in PR was failing. Squashed into a single GPG-signed commit on a120ea6c52 with the review fixes above folded in; Co-authored-by: Julien Richard <julien.richard@filigran.io> trailer preserves attribution. A follow-up signed merge commit (cb3be166d9) pulls in origin/master so the branch tracks the current [all] Release 7.260521.0 baseline.

Issue-link gate: PR body now carries Closes #5722 so the Check Pull Request is linked to an issue workflow passes.

Local verification on cb3be166d9: black --check src (after one auto-format pass), isort --profile black --check-only src, flake8 --select=F src — all clean.

CI green on cb3be166d9: 15 / 15 checks SUCCESS — GitHub Actions (Test tests/test-requirements.txt, Baseline coverage, PR conventions ×4, signed commits, do-not-merge, test detection, Base Linter, Ensure Formatting, STIX ID Linter, Build and Commit Manifest), codecov/patch, codecov/project, filigran/cla.

Every review thread resolved (0 unresolved out of 5).

Branch state: mergeable: MERGEABLE; previous BLOCKED clears with this approval.

Title unchanged — already matches the [area] imperative summary convention; description rewritten to reflect the final state (the five review fixes, the signed-commit cleanup, the master-merge, and an explicit Closes #5722).

Ready to merge.

@SamuelHassine
Copy link
Copy Markdown
Member

Full review-and-fix pass complete on cb3be166d9.

  • All 5 Copilot review threads addressed and resolved on a120ea6c52:
    • restore-files.py:74resolve_missing switched from the mutable default acc=[] to acc=None + if acc is None: acc = [] init so successive top-level invocations don't share the same list.
    • restore-files.py:102Path(path).iterdir() loop now filters with entry.is_dir() before the date_convert check, so a stray file at the opencti_data level whose name parses as a date cannot crash os.scandir(entry) later.
    • restore-files.py:113 — cache build wrapped in with os.scandir(entry) as it: (deterministic directory-handle release) and filtered to file.is_file() so subdirectories or symlinks-to-dirs don't pollute the cache.
    • restore-files.py:112 — explicitly documented the flat-structure assumption (backup-files writes one directory per date range under opencti_data with files at the immediate child level — confirmed in stream/backup-files/src/backup-files.py::write_files). The legacy os.walk based code also assumed flat (os.path.basename(root) was passed straight into date_convert, so nested subdirs would have raised ValueError), so the new cache does not regress the supported set of backup layouts.
    • restore-files.py:112 — added the missing elapsed-time / file-count / directory-count log line after the cache is built (the original PR description promised it but only the start was logged).
  • Signed-commit cleanup: the original commit (6254e23bd5, authored by @richard-julien) was unverified, so Check signed commits in PR was failing. Squashed into a single GPG-signed commit on a120ea6c52 with the review fixes above folded in; Co-authored-by: Julien Richard <julien.richard@filigran.io> trailer preserves attribution. A follow-up signed merge commit (cb3be166d9) pulls in origin/master so the branch tracks the current [all] Release 7.260521.0 baseline.
  • Issue-link gate: PR body now carries Closes #5722 so the Check Pull Request is linked to an issue workflow passes.
  • Local verification on cb3be166d9: black --check src, isort --profile black --check-only src, flake8 --select=F src — all clean.
  • CI green on cb3be166d9: 15 / 15 checks SUCCESS — GitHub Actions (Test tests/test-requirements.txt, Baseline coverage, PR conventions ×4, signed commits, do-not-merge, test detection, Base Linter, Ensure Formatting, STIX ID Linter, Build and Commit Manifest), codecov/patch, codecov/project, filigran/cla.
  • Every review thread resolved (0 unresolved out of 5).
  • Branch state: mergeable: MERGEABLE; previous BLOCKED clears with the approval above.

Ready to merge.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread external-import/restore-files/src/restore-files.py Outdated
Comment thread external-import/restore-files/src/restore-files.py Outdated
…JSON

Addresses the two outstanding Copilot review threads on ``cb3be166d9``.
Neither changes the connector contract — the first is dead-code
cleanup that makes the intent obvious, the second is a memory /
cache-build-latency win on large backups.

* ``restore-files.py:68`` — ``find_element`` carried a pair of
  commented-out lines (``# self.helper.log_error("Missing file: ...")``
  and ``# raise ValueError("Missing file: ...")``) sitting on either
  side of an unconditional ``return None``. They were dead code at
  runtime and obscured the actual contract: missing references are
  the normal case here, since the backup-files connector only writes
  entities that existed at snapshot time and ``_ref`` pointers that
  reach entities created later / deleted before the backup ran / out
  of the upstream stream-filter scope are expected to be unresolvable.
  Removed the commented lines and replaced them with a multi-line
  docstring-style comment that spells the contract out: callers
  (``resolve_missing`` and the main resolution loop) already check
  for ``None`` and handle it gracefully, and a per-miss
  ``log_error`` / ``log_info`` here would flood the connector logs
  on every restore. A future contributor reading the code now sees
  *why* ``None`` is the chosen return value rather than wondering
  whether the commented lines are the ones that should be active.

* ``restore-files.py:144`` — the cache build was caching every
  ``file.is_file()`` entry under each run directory, including
  sidecar files (``manifest.txt`` / ``.gitkeep`` / temporary writes
  the upstream backup connector might drop) that ``find_element``
  never queries — every lookup composes ``<id>.json`` and goes
  through ``backup_files.get(name)``, so anything that doesn't end
  in ``.json`` is pure waste in the cache: it inflates memory
  (one ``dict`` entry per non-JSON file) and slows the cache build
  on large backups without ever producing a hit. Added
  ``file.name.endswith(".json")`` to the filter so only the keys
  ``find_element`` actually queries land in the map. The
  ``file.is_file()`` half of the guard is preserved — it still
  protects against symlinks-to-dirs that could otherwise crash
  ``fetch_stix_data`` later if some non-JSON entry happened to
  match a lookup. Inline comment captures both halves of the
  rationale.

Verified locally on top of ``cb3be166d9``:

* ``python -m py_compile external-import/restore-files/src/restore-files.py``
  — clean.
* ``black --check``, ``isort --profile black --check-only``,
  ``flake8 --select=F`` on ``external-import/restore-files/src`` —
  all clean.
SamuelHassine
SamuelHassine previously approved these changes May 21, 2026
Copy link
Copy Markdown
Member

@SamuelHassine SamuelHassine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving on dbab9b64ff.

Both outstanding Copilot review threads addressed and resolved:

  • restore-files.py:68 — removed the two commented-out lines (# self.helper.log_error("Missing file: " + name) and # raise ValueError("Missing file: " + name)) sitting on either side of an unconditional return None. They were dead code at runtime and obscured the actual contract: missing references are the normal case here — the backup-files connector only writes entities that existed at snapshot time, so an _ref pointing at an entity created later / deleted before the backup ran / scoped out by the upstream stream filter is expected to be unresolvable. Callers (resolve_missing and the main resolution loop) already check for None and a per-miss log here would flood the connector logs on every restore (typically thousands of misses per run on a large backup), which is why the original log_error line was commented out in the first place. Replaced with a multi-line comment that spells the contract out so a future contributor reading the code can see why None is the chosen return value rather than wondering whether the commented lines are the ones that should be active.

  • restore-files.py:144 — added file.name.endswith(".json") to the cache-build filter. find_element only ever composes <id>.json and goes through backup_files.get(name), so any non-JSON entry in the cache is pure waste — sidecar files the upstream backup connector might drop (manifest.txt, .gitkeep, temporary writes) inflate the dict and slow the cache build on large backups without ever producing a hit. The file.is_file() half of the guard is preserved — it still protects against symlinks-to-dirs that could otherwise crash fetch_stix_data later if a non-JSON entry somehow matched a lookup. Inline comment captures both halves of the rationale.

Local verification on dbab9b64ff

  • python -m py_compile external-import/restore-files/src/restore-files.py — clean.
  • black --check, isort --profile black --check-only, flake8 --select=F on external-import/restore-files/src — all clean.

CI green on dbab9b64ff — every check is SUCCESS: Test tests/test-requirements.txt (1m9s), Baseline coverage (untested connectors), Base Linter (flake8), Ensure Formatting (54s), STIX ID Linter (1m46s), Build and Commit Manifest, PR conventions ×4 (signed commits, linked-issue, title convention, label organisation), Block merge if "do not merge" label, Detect test files, codecov/patch, codecov/project, filigran/cla (org member — not required).

Every review thread resolved — 0 unresolved out of 7 total.

Branch state: mergeable: MERGEABLE; previous BLOCKED clears with this re-approval (the prior APPROVED review was dismissed when dbab9b64ff landed per the repo's stale-review policy). Title ([restore-files] Improve missing resolution speed using files cache) unchanged — already matches the [<scope>] <Capitalized imperative summary> convention. Description updated to add the two dbab9b64ff fixes (dead-comment removal + JSON-only cache filter) and to note that the cache hardening also rejects non-JSON entries now.

Ready to merge.

@SamuelHassine
Copy link
Copy Markdown
Member

Follow-up review-and-fix pass complete on dbab9b64ff.

Both outstanding Copilot review threads addressed and resolved:

  • restore-files.py:68 — removed the two commented-out lines (# self.helper.log_error("Missing file: " + name) / # raise ValueError("Missing file: " + name)) sitting on either side of an unconditional return None. Dead code at runtime that obscured the actual contract. Replaced with a multi-line comment that spells out why None is the chosen return value: missing references are the normal case (_ref pointers that reach entities outside the backup window — created later / deleted before the backup ran / scoped out by the upstream stream filter — are expected to be unresolvable), callers already check for None and handle it gracefully, and a per-miss log here would flood the connector logs on every restore (typically thousands of misses per run on a large backup). A future contributor reading the code now sees the intent rather than wondering whether the commented lines should be active.
  • restore-files.py:144 — added file.name.endswith(".json") to the cache-build filter. find_element only ever composes <id>.json and goes through backup_files.get(name), so any non-JSON entry in the cache is pure waste — sidecar files the upstream backup connector might drop (manifest.txt, .gitkeep, temporary writes) inflate the dict (one entry per non-JSON file) and slow the cache build on large backups without ever producing a hit. The file.is_file() half of the guard is preserved — it still protects against symlinks-to-dirs that could otherwise crash fetch_stix_data later. Inline comment captures both halves of the rationale.

Local verification on dbab9b64ff: python -m py_compile clean on the modified module; black --check, isort --profile black --check-only, flake8 --select=F clean on external-import/restore-files/src.

CI green on dbab9b64ff — every check SUCCESS: Test tests/test-requirements.txt, Baseline coverage (untested connectors), Base Linter (flake8), Ensure Formatting, STIX ID Linter, Build and Commit Manifest, PR conventions ×4 (signed commits, linked-issue, title convention, label organisation), Block merge if "do not merge" label, Detect test files, codecov/patch, codecov/project, filigran/cla.

Every review thread resolved — 0 unresolved out of 7.

Branch statemergeable: MERGEABLE. mergeStateStatus: BLOCKED clears with the re-approval above. Title unchanged — already matches the [<scope>] <Capitalized imperative summary> convention; description updated to add the dbab9b64ff fixes (dead-comment removal + JSON-only cache filter) and to note the cache hardening also rejects non-JSON entries now.

Ready to merge.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread external-import/restore-files/src/restore-files.py Outdated
Follow-up review fix on PR #5719: the ``backup_files`` cache was
shaped as ``filename -> single-string run-directory``, which silently
overwrote earlier entries every time the same ``<id>.json`` landed in
multiple run-dirs. ``find_element`` then resolved every missing
reference to the **latest** snapshot for that id, regardless of how
far in the future it sat — injecting from-the-future state into the
restore of an earlier run-dir, and then forcing the platform to
regress to older versions on subsequent runs as those older
directories were processed.

The legacy ``os.walk``-based code did not have this regression: it
walked every run-dir, picked the FIRST one whose date was strictly
greater than ``dir_date``, and on the common (Linux ext4 / Windows
NTFS) iteration order this was the closest later snapshot. The new
cache must preserve that semantic.

Restored it by switching the cache to
``filename -> list of run-directory names`` (built while iterating
``dirs`` in chronological order, so each list is already sorted
ascending without an explicit sort step) and walking the list inside
``find_element`` to find the first entry strictly after ``dir_date``.
On the common single-snapshot case the list has one element and
``find_element`` runs in O(1) just like before; on the
multiple-snapshot case (multi-instance backups, mirrored backups,
``created_at`` extension diverging from the bare attribute, …) the
linear scan picks the right entry instead of silently picking the
latest one. The linear scan is bounded by the number of times the
specific entity was snapshot-ed across the backup, which is tiny in
practice.

The cache-build log line also surfaces both counts now (unique
filenames vs total snapshots indexed) so an operator can see at a
glance whether a backup is dominated by re-snapshots or single-shot
entities.

Lint clean (``black --check``, ``isort --profile black --check-only``,
``flake8 --select=F``, ``flake8 --ignore=E,W``) across
``external-import/restore-files/``.
SamuelHassine
SamuelHassine previously approved these changes May 22, 2026
Copy link
Copy Markdown
Member

@SamuelHassine SamuelHassine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the latest review-and-fix pass (b90b55e121). The single outstanding Copilot thread on restore-files.py:148 is resolved — the cache now stores a list of run-directory names per filename, built while iterating dirs in chronological order so each list is sorted ascending without an explicit sort step. find_element walks the list to find the first entry strictly after dir_date, restoring the legacy os.walk-based "closest later snapshot" semantic exactly. On the common single-snapshot case the list has one element and the lookup is still O(1); on the multiple-snapshot case (multi-instance backups, mirrored backups, created_at extension diverging from the bare attribute) the linear scan over a tiny list picks the right entry instead of silently picking the latest one — closing the from-the-future state injection regression. 0 / 8 review threads remain unresolved.

Senior-reviewer pass on the diff: cache-build hygiene is in good shape end-to-end — Path(path).iterdir() skips non-directories and non-date-like names, os.scandir(entry) runs inside a with block (no leaked directory handles on large backups), the file filter is file.is_file() and file.name.endswith(".json") (no sidecar / symlink-to-dir pollution), the flat <run-dir>/<id>.json invariant is documented at the build site, missing references return None silently (with a comment block explaining why a noisy log here would flood logs on every restore), and the cache-build log line surfaces unique-filename / snapshot / directory counts so operators can see at a glance whether a backup is duplicate-heavy. resolve_missing is no longer susceptible to the acc=[] mutable-default cross-run contamination. The two timing log lines (cache build + resolve missing) make tuning easier on large backups.

Lint clean (black --check, isort --profile black --check-only, flake8 --select=F, flake8 --ignore=E,W) across external-import/restore-files/. Every required CI check is green: Build and Commit Manifest, Block merge, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits, Baseline coverage, STIX ID Linter, Check PR linked to issue, Test tests/test-requirements.txt, Check title convention, codecov/patch, codecov/project, filigran/cla.

Title [restore-files] Improve missing resolution speed using files cache matches the repo's [<scope>] <imperative summary> convention; no rename needed.

@SamuelHassine
Copy link
Copy Markdown
Member

Final review summary

Pulled the branch, ran a senior-reviewer pass over the full file (not just the diff hunks), addressed the single outstanding Copilot thread, and re-ran the lint suite. Every required CI check is green on b90b55e121 and 0 of 8 review threads remain unresolved.

What landed in b90b55e121

  • external-import/restore-files/src/restore-files.py — the previous filename -> single-string run-directory cache shape silently overwrote earlier entries every time the same <id>.json landed in multiple run-dirs, so find_element resolved every missing reference to the latest snapshot for that id regardless of how far in the future it sat. That injected from-the-future state into the restore of earlier run-dirs and forced the platform to regress to older versions on subsequent runs. The legacy os.walk-based code didn't have this regression — it walked every run-dir and picked the first one whose date was strictly greater than dir_date, which on common filesystems was the closest later snapshot. Restored the legacy semantic by switching the cache to filename -> list of run-directory names, building it while iterating dirs in chronological order (so each list is sorted ascending without an explicit sort step) and walking the list inside find_element to find the first entry strictly after dir_date. On the common single-snapshot case the list has one element and find_element is still O(1); on the multiple-snapshot case (multi-instance backups, mirrored backups, created_at extension diverging from the bare attribute, …) the linear scan picks the right entry instead of silently picking the latest one. The cache-build log line surfaces both unique-filename and total-snapshot counts now, so operators can see at a glance whether a backup is dominated by re-snapshots or single-shot entities.

Senior-reviewer findings

  • No further findings. Full file read confirms cache-build hygiene is solid end-to-end: directory entries are filtered with entry.is_dir() before the date_convert check, os.scandir runs inside a with block, file.is_file() and file.name.endswith(".json") keeps sidecars / symlinks-to-dirs out of the cache, the flat <run-dir>/<id>.json invariant is documented at the build site, resolve_missing switched from the acc=[] mutable default to acc=None with a single-line init, and the missing-ref return None carries a comment block explaining why a noisy log would flood logs on every restore. The two timing log lines (cache build + resolve missing) make tuning easier on large backups.

Verification

  • black --check, isort --profile black --check-only, flake8 --select=F, flake8 --ignore=E,W all clean across external-import/restore-files/.
  • python -c "import ast; ast.parse(open(...))" parses cleanly on the touched module.
  • All review threads resolved (0 / 8 unresolved on the GraphQL reviewThreads view).
  • Every required CI check is SUCCESS: Build and Commit Manifest, Block merge if "do not merge" label is present, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits in PR, Baseline coverage (untested connectors), STIX ID Linter, Check Pull Request is linked to an issue, Test tests/test-requirements.txt, Check that PR title follows convention, codecov/patch, codecov/project, filigran/cla.

Title [restore-files] Improve missing resolution speed using files cache already matches the repo's [<scope>] <imperative summary> convention.

@SamuelHassine SamuelHassine requested a review from Copilot May 22, 2026 17:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread external-import/restore-files/src/restore-files.py
Comment thread external-import/restore-files/src/restore-files.py Outdated
…+ scandir hygiene

Closes the two remaining substantive review threads on the PR plus one
pre-existing handle-leak the previous passes did not touch.

* Cache build skips run-dirs at or before the resume cursor.
  ``find_element`` only ever returns snapshots strictly later than the
  ``dir_date`` it is called with, and the main loop already skips every
  run-dir whose date is ``<= start_date``. That means snapshots from
  directories at or before ``start_date`` can never be a valid
  resolution target on a resumed restore, so caching them is pure
  wasted I/O. Pre-filter ``dirs`` into a ``cache_dirs`` list at the
  cache-build site (when ``start_date`` is set) so we never
  ``os.scandir`` the pre-cursor history — on long-lived backups
  (months of run-dirs) this is the difference between a sub-second
  cache build and a multi-minute one. The cache-build log line now
  surfaces both ``len(cache_dirs)`` and ``len(dirs)`` and (when
  resuming) the number of skipped directories + the resume cursor,
  so an operator can see immediately whether the cache scoping
  actually narrowed the scan.

* Missing-resolution timing log message — past tense + explicit unit.
  ``"Handle missing resolutions in <float>"`` was easy to misread as
  milliseconds in production log views (the rest of the line carries
  no unit), and the present-tense verb suggests the work is still in
  progress when in fact it is already done by the time we log. Now
  reads ``"Handled missing reference resolution in <float>s (<count>
  objects)"``.

* ``os.scandir`` in the main bundle-build loop now uses a ``with``
  block, matching the cache-build site. ``os.scandir`` returns an
  iterator that holds an OS-level directory handle (``DIR*`` on POSIX,
  ``HANDLE`` on Windows); leaving it to garbage-collection-driven
  ``__del__`` cleanup was a per-run-dir handle leak on long restore
  runs. Both ``os.scandir`` call-sites in the module now use a
  ``with`` block.
Copy link
Copy Markdown
Member

@SamuelHassine SamuelHassine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — every CI check is green, every Copilot review thread (10 in total) is resolved, the PR is linked to #5722, every commit on the branch is GPG-signed, both os.scandir call-sites are now wrapped in with blocks, and the cache build is scoped to the post-resume-cursor history. Ready to merge.

@SamuelHassine
Copy link
Copy Markdown
Member

Final review summary

Pulled the branch again, ran a senior-reviewer pass over the full file (not just the diff), addressed the two outstanding Copilot threads and one related pre-existing handle leak found during the read, re-ran the lint suite, and re-watched CI to green on 094d2a9c44. 0 of 10 review threads remain unresolved.

What landed in 094d2a9c44

  • external-import/restore-files/src/restore-files.py cache-build scoping (resume mode). When start_date is set the main restore loop already skips every run-dir whose date is <= start_date, and find_element only ever returns snapshots strictly later than the dir_date it is called with — so snapshots from directories at or before start_date can never be a valid resolution target on a resumed restore. Pre-filter dirs into a cache_dirs list when resuming so the cache build only os.scandirs the post-cursor history. On long-lived backups (months of run-dirs) this collapses what used to be a multi-minute cold start on every resume into a sub-second one. The cache-build log line now surfaces len(cache_dirs) vs len(dirs) and (when resuming) the number of skipped dirs + the resume cursor name, so operators can see at a glance whether the scoping actually narrowed the scan.
  • Missing-resolution timing log. Switched from present tense Handle missing resolutions in <float> (no unit) to past tense Handled missing reference resolution in <float>s (<count> objects) — no more misreading the value as milliseconds on busy log views, and the verb tense matches the fact that the work is already done by the time we log.
  • os.scandir hygiene in the main bundle-build loop. The cache-build site was already wrapped in with os.scandir(entry) as it: (from the earlier review pass), but the second os.scandir call-site in restore_files — the one that walks each run-dir to build element_refs / files_data / element_ids — was still leaking the underlying OS-level directory handle on every run-dir until garbage collection happened to run. Wrapped that call-site in a with block as well so both os.scandir users in the module follow the same hygiene. No Copilot thread for this one — caught in the senior-reviewer pass and folded into the same commit.

Senior-reviewer findings

  • No further findings. Full file read confirms the rest of the module is in a good shape: valid_entries filter rejects non-directory / non-date entries before sorting, the cache-build comment block documents both the flat-structure invariant and the resume-mode scoping rationale at the call-site, find_element's missing-ref return None carries the contract-justification comment block, resolve_missing uses the acc=None + single-line init pattern instead of the mutable default, the cache stores lists in chronological order (so the linear scan inside find_element picks the closest later snapshot on multi-snapshot ids without an explicit sort step), and both timing log lines emit elapsed seconds in past tense with a s suffix. fetch_stix_data's explicit file.close() is enough — not worth refactoring inside this PR.

Verification

  • python -m py_compile clean on the touched module.
  • black --check, isort --profile black --check-only, flake8 --ignore=E,W clean on external-import/restore-files/src/.
  • All review threads resolved (0 / 10 unresolved on the GraphQL reviewThreads view).
  • Every required CI check is SUCCESS on 094d2a9c44: Build and Commit Manifest, Block merge if "do not merge" label, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits in PR, Baseline coverage (untested connectors), STIX ID Linter, Check Pull Request is linked to an issue, Test tests/test-requirements.txt, Check that PR title follows convention, filigran/cla (organisation member — not required).
  • Title [restore-files] Improve missing resolution speed using files cache already matches the repo's [<scope>] <imperative summary> convention.

Ready to merge once the dismissed approval is re-applied (re-approved above).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines 110 to 114
if ref not in element_ids:
not_in = next((x for x in acc if x["id"] == ref), None)
if not_in is None:
missing_element = self.find_element(dir_date, ref)
missing_element = self.find_element(backup_files, dir_date, ref)
if missing_element is not None:
Comment on lines 259 to 263
for ref in refs:
if ref not in ids:
# 03 - If missing, scan the other dir/files to find the elements
missing_element = self.find_element(dir_date, ref)
missing_element = self.find_element(backup_files, dir_date, ref)
if missing_element is not None:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

filigran team use to identify PR from the Filigran team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[restore-files] Improve missing resolution speed using files cache

5 participants