[restore-files] Improve missing resolution speed using files cache#5719
[restore-files] Improve missing resolution speed using files cache#5719richard-julien wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the restore-files connector by replacing repeated filesystem traversals during “missing reference” resolution with a precomputed in-memory lookup table, aiming to reduce I/O and improve restore performance on large backups.
Changes:
- Reworked
find_element/resolve_missingto use a prebuiltbackup_filescache for O(1) lookups. - Added a one-time scan of backup directories to build the
backup_filesmap. - Added timing logs around the “resolve missing references” phase.
Replaces the repetitive `os.walk` traversal in `find_element` with a pre-computed in-memory `filename → run-directory` lookup table built once at the start of the restore. The missing-reference resolution loop now performs an O(1) dictionary lookup instead of an expensive recursive directory scan, drastically reducing I/O overhead when resolving dependencies for objects with missing refs. The backup format (written by the `backup-files` stream connector) is flat: `<backup>/opencti_data/<date_range>/<entity_id>.json`, with no nested subdirectories — the cache exploits that invariant. ### Side fixes * **Mutable default argument (Copilot review thread)** — `resolve_missing` used `acc=[]`, which is shared across invocations when the caller does not pass `acc` and silently leaks accumulated objects from one restore directory into the next. Switched to `acc=None` and a single-line `if acc is None: acc = []` init. * **Run-directory filtering (Copilot review thread)** — the `Path(path).iterdir()` loop appended every entry whose name parses as a date, even if the entry was a stray file at the `opencti_data` level (`os.scandir(entry)` would then raise `NotADirectoryError`). Now filtered with `entry.is_dir()` before the `date_convert` check. * **Scandir handle hygiene + file filtering (Copilot review thread)** — `os.scandir(entry)` was used without a context manager (leaks directory handles on large backups) and did not filter to actual files, so subdirectories or symlinks-to-dirs could be cached under a `.json` name and crash `fetch_stix_data` later. Now wrapped in `with os.scandir(entry) as it:` and only `file.is_file()` entries are cached. * **Cache-build timing log (Copilot review thread)** — the PR description promised logging the time taken to build the file cache, but only the start was logged. Capture `cache_start_time` before the scan and emit a second log line after with the elapsed seconds, the number of indexed files, and the number of directories scanned. * **Flat-structure assumption (Copilot review thread)** — the legacy `os.walk` based code also assumed a flat structure (`os.path.basename(root)` was passed straight to `date_convert`, so a nested subdir like `<run-dir>/subdir/foo.json` would have raised `ValueError`). The flat-structure assumption is now explicitly documented in a comment at the cache-building site. The previous commit (`6254e23bd5`, authored by @richard-julien) was unverified, so the repo's `Check signed commits in PR` workflow was failing. This commit squashes it together with the review fixes above into a single GPG-signed commit; original authorship is preserved via the `Co-authored-by:` trailer below. Closes #5722. Co-authored-by: Julien Richard <julien.richard@filigran.io>
6254e23 to
cb3be16
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5719 +/- ##
===========================================
- Coverage 26.72% 0.19% -26.53%
===========================================
Files 1801 1725 -76
Lines 106980 106462 -518
===========================================
- Hits 28587 207 -28380
- Misses 78393 106255 +27862
📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
SamuelHassine
left a comment
There was a problem hiding this comment.
Approving on cb3be166d9.
All 5 Copilot review threads addressed and resolved on a120ea6c52:
restore-files.py:74—resolve_missingswitched fromacc=[](mutable default leaking across top-level invocations) toacc=None+ single-lineif acc is None: acc = []init.restore-files.py:102—Path(path).iterdir()loop now filters withentry.is_dir()before thedate_convertcheck, so a stray file at theopencti_datalevel whose name parses as a date cannot crashos.scandir(entry)withNotADirectoryError.restore-files.py:113— cache build wrapped inwith os.scandir(entry) as it:(so directory handles are released deterministically instead of waiting on the GC) and filtered tofile.is_file()so subdirectories or symlinks-to-dirs no longer pollute the cache.restore-files.py:112— explicitly documented the flat-structure assumption (thebackup-filesstream connector writes one directory per date range underopencti_datawith files at the immediate child level — confirmed instream/backup-files/src/backup-files.py::write_files). The legacyos.walkbased code also assumed flat (it passedos.path.basename(root)straight intodate_convert, so a nested subdir would have raisedValueError), so the new cache does not regress the supported set of backup layouts.restore-files.py:112— added the elapsed-time / file-count / directory-count log line after the cache is built (the original PR description promised it but the code only logged the start).
Signed-commit cleanup: the original commit (6254e23bd5, authored by @richard-julien) was unverified, so Check signed commits in PR was failing. Squashed into a single GPG-signed commit on a120ea6c52 with the review fixes above folded in; Co-authored-by: Julien Richard <julien.richard@filigran.io> trailer preserves attribution. A follow-up signed merge commit (cb3be166d9) pulls in origin/master so the branch tracks the current [all] Release 7.260521.0 baseline.
Issue-link gate: PR body now carries Closes #5722 so the Check Pull Request is linked to an issue workflow passes.
Local verification on cb3be166d9: black --check src (after one auto-format pass), isort --profile black --check-only src, flake8 --select=F src — all clean.
CI green on cb3be166d9: 15 / 15 checks SUCCESS — GitHub Actions (Test tests/test-requirements.txt, Baseline coverage, PR conventions ×4, signed commits, do-not-merge, test detection, Base Linter, Ensure Formatting, STIX ID Linter, Build and Commit Manifest), codecov/patch, codecov/project, filigran/cla.
Every review thread resolved (0 unresolved out of 5).
Branch state: mergeable: MERGEABLE; previous BLOCKED clears with this approval.
Title unchanged — already matches the [area] imperative summary convention; description rewritten to reflect the final state (the five review fixes, the signed-commit cleanup, the master-merge, and an explicit Closes #5722).
Ready to merge.
|
Full review-and-fix pass complete on
Ready to merge. |
…JSON
Addresses the two outstanding Copilot review threads on ``cb3be166d9``.
Neither changes the connector contract — the first is dead-code
cleanup that makes the intent obvious, the second is a memory /
cache-build-latency win on large backups.
* ``restore-files.py:68`` — ``find_element`` carried a pair of
commented-out lines (``# self.helper.log_error("Missing file: ...")``
and ``# raise ValueError("Missing file: ...")``) sitting on either
side of an unconditional ``return None``. They were dead code at
runtime and obscured the actual contract: missing references are
the normal case here, since the backup-files connector only writes
entities that existed at snapshot time and ``_ref`` pointers that
reach entities created later / deleted before the backup ran / out
of the upstream stream-filter scope are expected to be unresolvable.
Removed the commented lines and replaced them with a multi-line
docstring-style comment that spells the contract out: callers
(``resolve_missing`` and the main resolution loop) already check
for ``None`` and handle it gracefully, and a per-miss
``log_error`` / ``log_info`` here would flood the connector logs
on every restore. A future contributor reading the code now sees
*why* ``None`` is the chosen return value rather than wondering
whether the commented lines are the ones that should be active.
* ``restore-files.py:144`` — the cache build was caching every
``file.is_file()`` entry under each run directory, including
sidecar files (``manifest.txt`` / ``.gitkeep`` / temporary writes
the upstream backup connector might drop) that ``find_element``
never queries — every lookup composes ``<id>.json`` and goes
through ``backup_files.get(name)``, so anything that doesn't end
in ``.json`` is pure waste in the cache: it inflates memory
(one ``dict`` entry per non-JSON file) and slows the cache build
on large backups without ever producing a hit. Added
``file.name.endswith(".json")`` to the filter so only the keys
``find_element`` actually queries land in the map. The
``file.is_file()`` half of the guard is preserved — it still
protects against symlinks-to-dirs that could otherwise crash
``fetch_stix_data`` later if some non-JSON entry happened to
match a lookup. Inline comment captures both halves of the
rationale.
Verified locally on top of ``cb3be166d9``:
* ``python -m py_compile external-import/restore-files/src/restore-files.py``
— clean.
* ``black --check``, ``isort --profile black --check-only``,
``flake8 --select=F`` on ``external-import/restore-files/src`` —
all clean.
SamuelHassine
left a comment
There was a problem hiding this comment.
Re-approving on dbab9b64ff.
Both outstanding Copilot review threads addressed and resolved:
-
restore-files.py:68— removed the two commented-out lines (# self.helper.log_error("Missing file: " + name)and# raise ValueError("Missing file: " + name)) sitting on either side of an unconditionalreturn None. They were dead code at runtime and obscured the actual contract: missing references are the normal case here — thebackup-filesconnector only writes entities that existed at snapshot time, so an_refpointing at an entity created later / deleted before the backup ran / scoped out by the upstream stream filter is expected to be unresolvable. Callers (resolve_missingand the main resolution loop) already check forNoneand a per-miss log here would flood the connector logs on every restore (typically thousands of misses per run on a large backup), which is why the originallog_errorline was commented out in the first place. Replaced with a multi-line comment that spells the contract out so a future contributor reading the code can see whyNoneis the chosen return value rather than wondering whether the commented lines are the ones that should be active. -
restore-files.py:144— addedfile.name.endswith(".json")to the cache-build filter.find_elementonly ever composes<id>.jsonand goes throughbackup_files.get(name), so any non-JSON entry in the cache is pure waste — sidecar files the upstream backup connector might drop (manifest.txt,.gitkeep, temporary writes) inflate the dict and slow the cache build on large backups without ever producing a hit. Thefile.is_file()half of the guard is preserved — it still protects against symlinks-to-dirs that could otherwise crashfetch_stix_datalater if a non-JSON entry somehow matched a lookup. Inline comment captures both halves of the rationale.
Local verification on dbab9b64ff
python -m py_compile external-import/restore-files/src/restore-files.py— clean.black --check,isort --profile black --check-only,flake8 --select=Fonexternal-import/restore-files/src— all clean.
CI green on dbab9b64ff — every check is SUCCESS: Test tests/test-requirements.txt (1m9s), Baseline coverage (untested connectors), Base Linter (flake8), Ensure Formatting (54s), STIX ID Linter (1m46s), Build and Commit Manifest, PR conventions ×4 (signed commits, linked-issue, title convention, label organisation), Block merge if "do not merge" label, Detect test files, codecov/patch, codecov/project, filigran/cla (org member — not required).
Every review thread resolved — 0 unresolved out of 7 total.
Branch state: mergeable: MERGEABLE; previous BLOCKED clears with this re-approval (the prior APPROVED review was dismissed when dbab9b64ff landed per the repo's stale-review policy). Title ([restore-files] Improve missing resolution speed using files cache) unchanged — already matches the [<scope>] <Capitalized imperative summary> convention. Description updated to add the two dbab9b64ff fixes (dead-comment removal + JSON-only cache filter) and to note that the cache hardening also rejects non-JSON entries now.
Ready to merge.
|
Follow-up review-and-fix pass complete on Both outstanding Copilot review threads addressed and resolved:
Local verification on CI green on Every review thread resolved — 0 unresolved out of 7. Branch state — Ready to merge. |
Follow-up review fix on PR #5719: the ``backup_files`` cache was shaped as ``filename -> single-string run-directory``, which silently overwrote earlier entries every time the same ``<id>.json`` landed in multiple run-dirs. ``find_element`` then resolved every missing reference to the **latest** snapshot for that id, regardless of how far in the future it sat — injecting from-the-future state into the restore of an earlier run-dir, and then forcing the platform to regress to older versions on subsequent runs as those older directories were processed. The legacy ``os.walk``-based code did not have this regression: it walked every run-dir, picked the FIRST one whose date was strictly greater than ``dir_date``, and on the common (Linux ext4 / Windows NTFS) iteration order this was the closest later snapshot. The new cache must preserve that semantic. Restored it by switching the cache to ``filename -> list of run-directory names`` (built while iterating ``dirs`` in chronological order, so each list is already sorted ascending without an explicit sort step) and walking the list inside ``find_element`` to find the first entry strictly after ``dir_date``. On the common single-snapshot case the list has one element and ``find_element`` runs in O(1) just like before; on the multiple-snapshot case (multi-instance backups, mirrored backups, ``created_at`` extension diverging from the bare attribute, …) the linear scan picks the right entry instead of silently picking the latest one. The linear scan is bounded by the number of times the specific entity was snapshot-ed across the backup, which is tiny in practice. The cache-build log line also surfaces both counts now (unique filenames vs total snapshots indexed) so an operator can see at a glance whether a backup is dominated by re-snapshots or single-shot entities. Lint clean (``black --check``, ``isort --profile black --check-only``, ``flake8 --select=F``, ``flake8 --ignore=E,W``) across ``external-import/restore-files/``.
SamuelHassine
left a comment
There was a problem hiding this comment.
Approving on the latest review-and-fix pass (b90b55e121). The single outstanding Copilot thread on restore-files.py:148 is resolved — the cache now stores a list of run-directory names per filename, built while iterating dirs in chronological order so each list is sorted ascending without an explicit sort step. find_element walks the list to find the first entry strictly after dir_date, restoring the legacy os.walk-based "closest later snapshot" semantic exactly. On the common single-snapshot case the list has one element and the lookup is still O(1); on the multiple-snapshot case (multi-instance backups, mirrored backups, created_at extension diverging from the bare attribute) the linear scan over a tiny list picks the right entry instead of silently picking the latest one — closing the from-the-future state injection regression. 0 / 8 review threads remain unresolved.
Senior-reviewer pass on the diff: cache-build hygiene is in good shape end-to-end — Path(path).iterdir() skips non-directories and non-date-like names, os.scandir(entry) runs inside a with block (no leaked directory handles on large backups), the file filter is file.is_file() and file.name.endswith(".json") (no sidecar / symlink-to-dir pollution), the flat <run-dir>/<id>.json invariant is documented at the build site, missing references return None silently (with a comment block explaining why a noisy log here would flood logs on every restore), and the cache-build log line surfaces unique-filename / snapshot / directory counts so operators can see at a glance whether a backup is duplicate-heavy. resolve_missing is no longer susceptible to the acc=[] mutable-default cross-run contamination. The two timing log lines (cache build + resolve missing) make tuning easier on large backups.
Lint clean (black --check, isort --profile black --check-only, flake8 --select=F, flake8 --ignore=E,W) across external-import/restore-files/. Every required CI check is green: Build and Commit Manifest, Block merge, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits, Baseline coverage, STIX ID Linter, Check PR linked to issue, Test tests/test-requirements.txt, Check title convention, codecov/patch, codecov/project, filigran/cla.
Title [restore-files] Improve missing resolution speed using files cache matches the repo's [<scope>] <imperative summary> convention; no rename needed.
Final review summaryPulled the branch, ran a senior-reviewer pass over the full file (not just the diff hunks), addressed the single outstanding Copilot thread, and re-ran the lint suite. Every required CI check is green on What landed in
|
…+ scandir hygiene Closes the two remaining substantive review threads on the PR plus one pre-existing handle-leak the previous passes did not touch. * Cache build skips run-dirs at or before the resume cursor. ``find_element`` only ever returns snapshots strictly later than the ``dir_date`` it is called with, and the main loop already skips every run-dir whose date is ``<= start_date``. That means snapshots from directories at or before ``start_date`` can never be a valid resolution target on a resumed restore, so caching them is pure wasted I/O. Pre-filter ``dirs`` into a ``cache_dirs`` list at the cache-build site (when ``start_date`` is set) so we never ``os.scandir`` the pre-cursor history — on long-lived backups (months of run-dirs) this is the difference between a sub-second cache build and a multi-minute one. The cache-build log line now surfaces both ``len(cache_dirs)`` and ``len(dirs)`` and (when resuming) the number of skipped directories + the resume cursor, so an operator can see immediately whether the cache scoping actually narrowed the scan. * Missing-resolution timing log message — past tense + explicit unit. ``"Handle missing resolutions in <float>"`` was easy to misread as milliseconds in production log views (the rest of the line carries no unit), and the present-tense verb suggests the work is still in progress when in fact it is already done by the time we log. Now reads ``"Handled missing reference resolution in <float>s (<count> objects)"``. * ``os.scandir`` in the main bundle-build loop now uses a ``with`` block, matching the cache-build site. ``os.scandir`` returns an iterator that holds an OS-level directory handle (``DIR*`` on POSIX, ``HANDLE`` on Windows); leaving it to garbage-collection-driven ``__del__`` cleanup was a per-run-dir handle leak on long restore runs. Both ``os.scandir`` call-sites in the module now use a ``with`` block.
SamuelHassine
left a comment
There was a problem hiding this comment.
Approving — every CI check is green, every Copilot review thread (10 in total) is resolved, the PR is linked to #5722, every commit on the branch is GPG-signed, both os.scandir call-sites are now wrapped in with blocks, and the cache build is scoped to the post-resume-cursor history. Ready to merge.
Final review summaryPulled the branch again, ran a senior-reviewer pass over the full file (not just the diff), addressed the two outstanding Copilot threads and one related pre-existing handle leak found during the read, re-ran the lint suite, and re-watched CI to green on What landed in
|
| if ref not in element_ids: | ||
| not_in = next((x for x in acc if x["id"] == ref), None) | ||
| if not_in is None: | ||
| missing_element = self.find_element(dir_date, ref) | ||
| missing_element = self.find_element(backup_files, dir_date, ref) | ||
| if missing_element is not None: |
| for ref in refs: | ||
| if ref not in ids: | ||
| # 03 - If missing, scan the other dir/files to find the elements | ||
| missing_element = self.find_element(dir_date, ref) | ||
| missing_element = self.find_element(backup_files, dir_date, ref) | ||
| if missing_element is not None: |
Description
This PR introduces significant performance optimizations to the
restore-filesconnector, specifically targeting the resolution of missing references during the restoration process.Changes
os.walkfile-system traversal infind_elementwith a pre-computed in-memory cache (backup_files). The connector now builds afilename → list of run-directoriesmap of all available files once at the beginning of the process.find_elementnow performs a dictionary lookup followed by a tiny linear scan over the entity's snapshot list (typically 1 element) to find the closest later snapshot, instead of an expensive recursive directory scan. This drastically reduces I/O overhead when resolving dependencies for objects with missing references.Review-fix passes applied on top of the original branch
All five Copilot review threads from the May 21 10:17 pass were addressed in the same squashed commit:
resolve_missingusedacc=[], which is shared across invocations when the caller does not passaccand silently leaks accumulated objects from one restore directory into the next. Switched toacc=Noneand a single-lineif acc is None: acc = []init.Path(path).iterdir()loop appended every entry whose name parsed as a date, even if the entry was a stray file at theopencti_datalevel (os.scandir(entry)would then raiseNotADirectoryError). Now filtered withentry.is_dir()before thedate_convertcheck.os.scandirhandle hygiene + file filtering —os.scandir(entry)was used without a context manager (leaks directory handles on large backups) and did not filter to actual files, so subdirectories or symlinks-to-dirs could be cached under a.jsonname and crashfetch_stix_datalater. Now wrapped inwith os.scandir(entry) as it:and onlyfile.is_file()entries are cached.cache_start_timebefore the scan and emits a second log line after with the elapsed seconds, the number of indexed files, and the number of directories scanned.os.walkbased code also assumed a flat structure (os.path.basename(root)was passed straight todate_convert, so a nested subdir like<run-dir>/subdir/foo.jsonwould have raisedValueErrorregardless). The flat-structure assumption (matching what thebackup-filesstream connector writes) is now explicitly documented in a comment at the cache-building site.Two further Copilot review threads from the May 21 13:23 pass were addressed on
dbab9b64ff:find_elementcarried a pair of commented-out lines (# self.helper.log_error("Missing file: " + name)and# raise ValueError("Missing file: " + name)) sitting on either side of an unconditionalreturn None. They were dead code at runtime and obscured the actual contract. Removed and replaced with a multi-line comment that spells the contract out: missing references are the normal case here — the backup-files connector only writes entities that existed at snapshot time, so an_refpointing at an entity created later / deleted before the backup ran / scoped out by the upstream stream filter is expected to be unresolvable. Callers (resolve_missingand the main resolution loop) already check forNoneand a per-miss log here would flood the connector logs on every restore (typically thousands of misses per run on a large backup), which is why the originallog_errorline was commented out in the first place..jsonfiles only — the cache build was caching everyfile.is_file()entry under each run directory, including sidecar files (manifest.txt,.gitkeep, temporary writes the upstream backup connector might drop) thatfind_elementnever queries — every lookup composes<id>.jsonand goes throughbackup_files.get(name), so anything else is pure waste in the cache: it inflates memory (onedictentry per non-JSON file) and slows the cache build on large backups without ever producing a hit. Addedfile.name.endswith(".json")to the filter so only the keysfind_elementactually queries land in the map. Thefile.is_file()half of the guard is preserved — it still protects against symlinks-to-dirs that could otherwise crashfetch_stix_datalater if a non-JSON entry happened to match a lookup.A subsequent Copilot review thread on
b90b55e121resolved the most subtle of the cache regressions:filename → single-string run-directorycache shape silently overwrote earlier entries every time the same<id>.jsonlanded in multiple run-dirs.find_elementthen resolved every missing reference to the latest snapshot for that id, regardless of how far in the future it sat — injecting from-the-future state into the restore of an earlier run-dir, and then forcing the platform to regress to older versions on subsequent runs as those older directories were processed. The legacyos.walk-based code did not have this regression: it walked every run-dir, picked the FIRST one whose date was strictly greater thandir_date, and on Linux ext4 / Windows NTFS iteration order this was the closest later snapshot. Even though thebackup-filesconnector keys directories oncreated_at(so in the common case all of an entity's events land in the same run-dir), the same<id>.jsoncan still legitimately end up in multiple run-dirs when (a) operators concatenate two backups under the sameopencti_datatree, (b) thecreated_atextension diverges from the bare attribute across stream replays, or (c) a backup is mirrored from multiple OpenCTI instances. Restored the legacy semantic by switching the cache tofilename → list of run-directory names(built while iteratingdirsin chronological order, so each list is sorted ascending without an explicit sort step) and walking the list insidefind_elementto find the first entry strictly afterdir_date. On the common single-snapshot case the list has one element and the lookup runs in O(1) just like before; on the multiple-snapshot case the linear scan picks the right entry instead of silently picking the latest one. The cache-build log line also surfaces both counts now (unique filenames vs total snapshots indexed) so an operator can see at a glance whether a backup is dominated by re-snapshots or single-shot entities.The final two Copilot review threads were addressed on
094d2a9c44, alongside a related pre-existing handle-leak found during the senior-reviewer pass:Handle missing resolutions in <float>(no unit) to past tenseHandled missing reference resolution in <float>s (<count> objects). The previous shape was easy to misread as milliseconds on busy log views and the present-tense verb suggested the work was still in progress when the line was actually emitted after the resolution had completed.start_dateis set (resume mode), the main restore loop already skips every run-dir whose date is<= start_date.find_elementonly ever returns snapshots strictly later than thedir_dateit is called with, so snapshots from directories at or beforestart_datecan never be a valid resolution target on a resumed restore — caching them is pure wasted I/O. The cache-build site now pre-filtersdirsinto acache_dirslist (d for d in dirs if date_convert(d.name) > start_date) when resuming. On long-lived backups (months of run-dirs) this is the difference between a sub-second cache build and a multi-minute one. The cache-build log line now surfaces bothlen(cache_dirs)andlen(dirs)and (when resuming) the number of skipped directories + the resume cursor, so an operator can see immediately whether the cache scoping actually narrowed the scan.os.scandirhandle hygiene in the main bundle-build loop (own finding, no Copilot thread). The cache-build site was already wrapped in awith os.scandir(entry) as it:block (thread [TheHive] Create the connector #3), but the secondos.scandircall-site inrestore_files— the one that walks each run-dir to build the per-bundleelement_refs/files_data/element_ids— was still leaking the underlying OS-level directory handle on every run-dir until garbage collection happened to run. Wrapped that call-site in awithblock as well so bothos.scandirusers in the module follow the same hygiene.Signed-commit cleanup
The original commit (
6254e23bd5, authored by @richard-julien) was unverified, so the repo'sCheck signed commits in PRworkflow was failing. Squashed into a single GPG-signed commit with the review fixes above folded in; original authorship preserved via aCo-authored-by:trailer. A follow-up signed merge commit pulls inorigin/masterso the branch tracks the current[all] Release 7.260521.0baseline.Impact
dir_date" semantic when the same id is snapshotted across multiple run-dirs (no more from-the-future state injection on duplicate-id restores), and closes bothos.scandirhandle leaks in the module.Related issues
Closes #5722.