Skip to content

[assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis#6429

Open
SamuelHassine wants to merge 16 commits into
masterfrom
feat/assemblyline-connector
Open

[assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis#6429
SamuelHassine wants to merge 16 commits into
masterfrom
feat/assemblyline-connector

Conversation

@SamuelHassine
Copy link
Copy Markdown
Member

@SamuelHassine SamuelHassine commented May 19, 2026

Proposed changes

Adds a new internal-enrichment connector for AssemblyLine 4 at internal-enrichment/assemblyline/.

When OpenCTI asks the connector to enrich a StixFile or Artifact, the connector:

  • downloads the file content from OpenCTI's storage API (via pycti's helper.api.fetch_opencti_file, inheriting the platform's session timeouts, retries, custom CA bundles and proxy / SSL settings);
  • submits it to the configured AssemblyLine deployment with a classification mirroring the source observable's TLP marking (TLP:CLEAR ΓåÆ TLP:C, TLP:GREEN ΓåÆ TLP:G, TLP:AMBER ΓåÆ TLP:A, TLP:RED ΓåÆ TLP:R), falling back to ASSEMBLYLINE_CLASSIFICATION when the source has no TLP ΓÇö so a sample is never silently downgraded once it leaves OpenCTI;
  • polls until the analysis completes (within a configurable timeout, surfacing terminal failed / error / cancelled states immediately via AssemblyLineTerminalError instead of waiting for the global timeout);
  • pushes the results back into OpenCTI:
    • a Malware-Analysis SDO (submission id, profile, verdict, score) that surfaces in the Malware Analysis section of the enriched observable;
    • STIX Indicator objects (and optional matching Observable objects linked via based-on relationships) for every malicious IOC AssemblyLine extracted (domains, IPs, URLs), labelled malicious with x_opencti_score=80. When ASSEMBLYLINE_INCLUDE_SUSPICIOUS=true, suspicious-only IOCs are emitted as separate Indicator SDOs labelled suspicious with x_opencti_score=50 so OpenCTI keeps the two classifications distinct (the original implementation labelled everything malicious regardless of AssemblyLine's tag);
    • Malware SDOs for every malware family attributed by AssemblyLine;
    • Attack-Pattern SDOs for every MITRE ATT&CK technique observed at runtime, linked to the generated indicators with related-to relationships;
    • a Note summarising the verdict and per-category indicator counts (the per-bucket numbers come from the connector's own bookkeeping in _create_indicators, not the raw IOC-extraction list lengths, so the counts always match what was actually written to OpenCTI even when the per-bucket 20-indicator cap is engaged; the Note also renders a Suspicious IOCs Created as Indicators block whenever ASSEMBLYLINE_INCLUDE_SUSPICIOUS=true produces any);
    • an External-Reference attached to the enriched observable pointing back to the AssemblyLine submission.

All three "create X" behaviours and the "sequential mode" (queue submissions while AssemblyLine is busy, bounded by ASSEMBLYLINE_TIMEOUT) are configurable via ASSEMBLYLINE_CREATE_ATTACK_PATTERNS / ASSEMBLYLINE_CREATE_MALWARE_ANALYSIS / ASSEMBLYLINE_CREATE_OBSERVABLES / ASSEMBLYLINE_SEQUENTIAL_MODE so operators can tune the noise level on the OpenCTI side.

The connector also enforces a TLP gate (ASSEMBLYLINE_MAX_TLP, default TLP:AMBER) before any file content leaves OpenCTI for the external AssemblyLine deployment. Every derived SDO, SCO, Indicator and SRO inherits the source observable's object_marking_refs, so a TLP:AMBER source produces TLP:AMBER analysis observables, indicators, Malware family SDOs, Attack-Pattern SDOs, the Malware-Analysis SDO, the summary Note, and every related-to / based-on relationship between them ΓÇö never silently downgraded to TLP:CLEAR.

Source

This is a re-submission of #5199 (which became unmergeable after the head branch was accidentally promoted to master and then rolled back). The implementation is the one from @V1D1AN's standalone connector repo at https://github.com/V1D1AN/connector-assemblyline, adapted to the OpenCTI monorepo conventions and with all the review findings from #5199 fixed in this branch.

Monorepo adaptations applied on top of the upstream code

  • Files placed under internal-enrichment/assemblyline/ with the standard layout (Dockerfile, entrypoint.sh, docker-compose.yml, README.md, .dockerignore, __metadata__/connector_manifest.json, __metadata__/connector_config_schema.json, __metadata__/logo.png, src/{main.py, requirements.txt, config.yml.sample, __init__.py}, tests/{test_connector.py, test-requirements.txt, __init__.py}).
  • Dockerfile rebased on python:3.12-alpine matching the lean internal-enrichment/joe-sandbox pattern: libmagic + libffi-dev for the runtime, git + build-base only for pip install (removed afterwards). assemblyline-client==4.9.9 does not transitively pull lxml, so the previously-installed libxml2-dev / libxslt-dev headers (never cleaned up) are dropped.
  • docker-compose.yml rewritten with image: opencti/connector-assemblyline:rolling, inline CONNECTOR_* / ASSEMBLYLINE_* environment variables with sensible defaults.
  • src/requirements.txt pinned to pycti==7.260521.0 (aligned with the rest of the monorepo's current pin), assemblyline-client==4.9.9, stix2==3.0.1, requests~=2.32, PyYAML>=6.0,<7. Removed the bogus uuid>=1.30 PyPI package (the stdlib uuid module ships with Python and the third-party package is unmaintained).
  • connector_manifest.json fixed: slug lowercased to assemblyline, description accurately reflects what the connector actually emits (domains, IPs, URLs, Malware SDOs, ATT&CK Attack Patterns ΓÇö no spurious "file hashes" claim), logo path adjusted to internal-enrichment/assemblyline/__metadata__/logo.png, container_image set to opencti/connector-assemblyline, source_code pointed at the monorepo path.
  • tests/test-requirements.txt added so the GitHub Actions test runner picks the connector up.

Review findings from #5199 addressed in this branch

Correctness

  • Normalise ASSEMBLYLINE_VERIFY_SSL so requests does not treat the string "false" / "true" as a CA-bundle path.
  • Cap the sequential-mode wait by ASSEMBLYLINE_TIMEOUT so the worker can never block forever when AssemblyLine stays busy. ASSEMBLYLINE_POLL_INTERVAL is clamped to max(1, int(...)) so a stale config.yml or env var cannot turn the wait into a busy loop.
  • Add an HTTP timeout to the AssemblyLine submission request.
  • Keep MITRE ATT&CK phase names hyphenated (defense-evasion, ...) so OpenCTI can match them against the official kill-chain entries.
  • Link ATT&CK Attack-Patterns to the generated indicators with related-to, matching what the README documents.
  • Drop the duplicate _create_relationships pass so observables are not created twice and ASSEMBLYLINE_CREATE_OBSERVABLES is the single source of truth.
  • Switch IPv6 IOCs to ipv6-addr / IPv6-Addr (they were modelled as ipv4-addr).
  • Escape backslashes (and single quotes) before embedding values in STIX patterns.
  • Remove the hard-coded "trojan" malware label so families keep their actual classification.
  • Read StixFile content from importFiles in addition to x_opencti_files, matching the other sandbox connectors.
  • Pass the canonical /storage/get/<id> URL to helper.api.fetch_opencti_file ΓÇö and do the same for the importFiles path so it inherits pycti's HTTP session config (timeouts, retries, CA bundles, proxy / SSL) instead of using a raw requests.get with a manually-set Authorization header.
  • Terminal AssemblyLine states (failed / error / cancelled) now surface immediately through a dedicated AssemblyLineTerminalError exception that the polling loop re-raises in front of its broad transient-error catcher ΓÇö the previous code masked terminal failures and let the enrichment run until the global timeout.
  • Suspicious-only IOCs are no longer treated as malicious. _extract_malicious_iocs returns only IOCs AssemblyLine tagged malicious; a new _extract_suspicious_iocs returns the suspicious subset (when ASSEMBLYLINE_INCLUDE_SUSPICIOUS=true). Downstream: the malicious label on the source observable, the high x_opencti_score, and the malware-analysis.result=malicious upgrade only fire on actually-malicious IOCs. Suspicious-only verdicts produce a suspicious label, x_opencti_score=50, and indicators labelled suspicious with the same score. The summary Note now renders a Suspicious IOCs Created as Indicators section whenever the suspicious bucket is non-empty so the user-facing summary cannot contradict what the connector actually sent.
  • _parse_al_timestamp parses positive AND negative ISO-8601 offsets correctly. The earlier split('+', 1)[0] discarded positive offsets entirely and silently fell back to the caller's default for negative ones. Switched to datetime.fromisoformat after normalising the Z suffix, then astimezone(UTC).replace(tzinfo=None) so the rest of the connector keeps using a single naive-UTC representation.
  • Strip trailing slash on configured base URLs (OPENCTI_URL, ASSEMBLYLINE_URL) at assignment time so requests >= 2.34 doesn't produce double-slash paths ΓÇö also satisfies the repo-wide tests/test_url_construction.py guard.
  • SHA-256-only deduplication. _check_existing_analysis searches AssemblyLine submissions via files.sha256:<hash> ΓÇö _process_file and _get_artifact_content now pick the SHA-256 explicitly via _select_sha256(observable["hashes"]) instead of any-hash, so MD5 / SHA-1 inputs no longer generate audit-log noise hitting an unmatched Lucene query and cannot accidentally hit an unrelated submission.
  • Identity-lookup failure no longer crashes the Malware-Analysis bundle. _get_assemblyline_identity swallows errors and stores None; sco_author_properties now omits x_opencti_created_by_ref and stix2.MalwareAnalysis(...) now omits created_by_ref when the identity is unavailable, so the rest of the enrichment still lands.
  • The summary Note reports created-indicator counts, not extracted-IOC list lengths. _create_indicators caps creation at 20 indicators per IOC bucket (domains / IPs / URLs, for both the malicious and the suspicious classifications); the Note's "Malicious IOCs Created as Indicators" / "Suspicious IOCs Created as Indicators" sections now read from per-category counters in the counts dict (malicious_domains / malicious_ips / malicious_urls / suspicious_domains / suspicious_ips / suspicious_urls) instead of len(malicious_iocs[...]) / len(suspicious_iocs[...]). Counters are incremented only on a successful indicator.create, so transient OpenCTI failures cannot inflate them either. Result: on a 50-domain analysis the Note now shows **Malicious Domains:** 20 (matching what was actually written to OpenCTI and the run's success-message) instead of **Malicious Domains:** 50.

Security

  • Enforce a TLP gate (ASSEMBLYLINE_MAX_TLP, default TLP:AMBER) before any file content leaves OpenCTI for the external AssemblyLine deployment.
  • Submission classification mirrors the source TLP. Maps TLP:CLEAR / TLP:WHITE ΓåÆ TLP:C, TLP:GREEN ΓåÆ TLP:G, TLP:AMBER / TLP:AMBER+STRICT ΓåÆ TLP:A, TLP:RED ΓåÆ TLP:R, falling back to ASSEMBLYLINE_CLASSIFICATION when the source has no TLP marking. A TLP:AMBER file passing the max-TLP gate is no longer silently downgraded to the connector default (TLP:C) once it leaves OpenCTI.
  • Never log the full observable on error: log only entity_type + id to avoid leaking payload_bin or other sensitive observable fields.
  • The full derived sub-graph inherits the source observable's object_marking_refs (every derived SDO / SCO / Indicator / Malware family / Malware-Analysis / Attack-Pattern / summary Note, and every related-to / based-on SRO between them) instead of being hard-coded to TLP:WHITE or left unmarked. A file that passes the ASSEMBLYLINE_MAX_TLP gate with TLP:AMBER now produces TLP:AMBER analysis objects throughout ΓÇö never downgraded. The _source_marking_refs helper reads observable["objectMarking"][*]["standard_id"] and falls back to OpenCTI's custom TLP:CLEAR marking only when the source carries no marking at all.
  • Config loading uses yaml.safe_load inside a with open(...) context manager rather than yaml.load(open(...), Loader=FullLoader) ΓÇö FullLoader could instantiate arbitrary Python objects through YAML tags from a tampered config.yml, and the bare open() leaked the file handle on parse errors.

UX

  • Attach an AssemblyLine External-Reference to the enriched observable, not just to the Malware-Analysis SDO.
  • Surface a max_tlp knob in docker-compose.yml, config.yml.sample and the README.
  • Update the README so the documented IOC set matches what is actually emitted, the documented submission flow matches what the code does (requests.post to /api/v4/submit/, with assemblyline-client used for polling / summary retrieval / file fetch), and the TLP propagation contract is spelt out.

Tests

  • Replace the placeholder asserts with real coverage of the connector implementation: bool coercion, malicious-IOC extraction with/without suspicious, score buckets, STIX-pattern escaping (backslash and quote), IPv4/IPv6 dispatch, ATT&CK phase preservation, TLP gate, external-reference attachment, sequential-mode deadline, CREATE_OBSERVABLES flag, malware-family label, file-fetch URL.
  • TestSourceMarkingRefs, TestMalwareAnalysisPropagatesSourceMarkings, TestApiIndicatorAndObservableInheritSourceMarkings, TestSummaryNoteInheritsSourceMarking, TestMalwareFamilyInheritsSourceMarkings and TestAttackPatternInheritsSourceMarkings pin the marking-inheritance contract end-to-end across the full derived sub-graph (SCOs, Indicators, Observables, Malware family SDOs, Malware-Analysis, Attack-Patterns, summary Note, and the related-to / based-on SROs between them).
  • TestTerminalAssemblyLineStates::test_process_file_raises_terminal_error_on_terminal_state pins the real _process_file polling-loop contract end-to-end: stubs requests.post (200 + valid sid) and al_client.submission.full to return {"state": <terminal>} on the first poll, asserts AssemblyLineTerminalError is raised, the submission id appears in the message, and submission.full ran exactly once (no retry, no sleep-until-timeout fallback). Parameterised across all three terminal states (failed / error / cancelled).
  • TestMaliciousIOCExtraction extended to pin the suspicious / malicious split. TestIocClassificationLabelsAndScores pins the per-IOC label + score plumbing. TestResolveSubmissionClassification pins the TLP ΓåÆ AL classification mapping. TestParseAlTimestamp extended to pin positive AND negative ISO-8601 offset parsing. TestConfigYamlSafeLoad pins the safe_load contract against !!python/object/apply tags. TestUnpinnedFileFetch extended to pin _download_import_file routing through helper.api.fetch_opencti_file.
  • TestMissingIdentityDoesNotEmitNoneAuthor pins both the sco_author_properties and stix2.MalwareAnalysis(...) paths when _get_assemblyline_identity returned None ΓÇö the bundle is still sent and the derived SCOs / Malware-Analysis serialise without a null author. TestProcessFileShaDedup pins the SHA-256-only dedup contract (called once for a SHA-256, never for an MD5 / SHA-1-only observable). TestSummaryNoteSuspiciousSection pins that the Note's Suspicious IOCs block is rendered with correct counts when present and omitted otherwise.
  • TestCreateIndicatorsTracksPerCategoryCreatedCounts pins the per-category created-indicator counters on the counts dict (capped at 20 per bucket; only incremented on a successful indicator.create; tracked separately for the malicious and suspicious buckets). TestSummaryNoteUsesCreatedCountsNotExtractedLengths pins that the Note's malicious / suspicious "Created as Indicators" sections report counts.get('<classification>_<kind>', 0) rather than len(...) so the displayed numbers match what the connector actually wrote to OpenCTI.
  • Whole suite is now 119 cases (started at 57 on the original branch, 69 / 94 / 107 after the first three review-and-fix passes, 119 after this one).

Related issues

Checklist

  • I consider the submitted work as finished
  • I have signed my commits using a GPG key.
  • I tested the code for its functionality using different use cases
  • I added/updated the relevant documentation
  • Where necessary I refactored code to improve the overall quality

Further comments

The connector is published as a new entry under internal-enrichment/assemblyline/; no existing connector is modified. The author is recorded as the AssemblyLine identity so generated SDOs are clearly attributable in the OpenCTI UI.

… 4 sandbox analysis

Adds a new internal-enrichment connector for AssemblyLine 4 at
`internal-enrichment/assemblyline/`.

When OpenCTI asks the connector to enrich a `StixFile` or `Artifact`,
the connector:

* downloads the file content from OpenCTI's storage API;
* submits it to the configured AssemblyLine deployment via
  `assemblyline-client`;
* polls until the analysis completes (within a configurable timeout);
* pushes the results back into OpenCTI:
  * a `Malware-Analysis` SDO (submission id, profile, verdict, score)
    that surfaces in the *Malware Analysis* section of the enriched
    observable;
  * STIX `Indicator` objects (and optional matching `Observable`
    objects linked via `based-on` relationships) for every malicious
    IOC AssemblyLine extracted (domains, IPs, URLs);
  * `Malware` SDOs for every malware family attributed by
    AssemblyLine;
  * `Attack-Pattern` SDOs for every MITRE ATT&CK technique observed
    at runtime, linked to the generated indicators with `related-to`
    relationships;
  * a `Note` summarising the verdict and counts;
  * an `External-Reference` attached to the enriched observable
    pointing back to the AssemblyLine submission.

All the "create *X*" toggles and the sequential mode (queue
submissions while AssemblyLine is busy, bounded by the configured
timeout) are configurable via `ASSEMBLYLINE_CREATE_ATTACK_PATTERNS` /
`ASSEMBLYLINE_CREATE_MALWARE_ANALYSIS` /
`ASSEMBLYLINE_CREATE_OBSERVABLES` / `ASSEMBLYLINE_SEQUENTIAL_MODE`.

The connector also enforces a TLP gate (`ASSEMBLYLINE_MAX_TLP`,
default `TLP:AMBER`) before any file content leaves OpenCTI for the
external AssemblyLine deployment.

This connector is a re-submission of #5199 (which became unmergeable
after the head branch was accidentally promoted to master and then
rolled back). All the Copilot review findings on #5199 have been
addressed in this re-submission: SSL-verify normalisation, bounded
sequential-mode wait, HTTP submission timeout, hyphenated MITRE
phase names, ATT&CK→Indicator `related-to` relationships, IPv6
dispatch, STIX-pattern escaping (backslash + quote), no hard-coded
`trojan` label, `CREATE_OBSERVABLES` no longer duplicates,
StixFile `importFiles` lookup, canonical `/storage/get/<id>` URL
for `fetch_opencti_file`, TLP gate, sanitised observable in error
logs, and proper External-Reference attachment to the enriched
observable.

Co-authored-by: V1D1AN <V1D1AN@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal-enrichment/assemblyline/src/__init__.py 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (7947ad3) and HEAD (fd06e67). Click for more details.

HEAD has 95 uploads less than BASE
Flag BASE (7947ad3) HEAD (fd06e67)
connectors 97 2
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6429       +/-   ##
===========================================
- Coverage   26.75%    0.62%   -26.14%     
===========================================
  Files        1801     1727       -74     
  Lines      106940   107281      +341     
===========================================
- Hits        28613      666    -27947     
- Misses      78327   106615    +28288     
Files with missing lines Coverage Δ
internal-enrichment/assemblyline/src/main.py 54.12% <ø> (ø)
internal-enrichment/assemblyline/src/__init__.py 0.00% <0.00%> (ø)

... and 953 files with indirect coverage changes

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new AssemblyLine internal-enrichment connector that submits OpenCTI Artifact/StixFile content for sandbox analysis and imports analysis results back into OpenCTI.

Changes:

  • Adds connector runtime, Docker, compose, config, manifest, and README files under internal-enrichment/assemblyline.
  • Implements AssemblyLine submission/polling, TLP gating, IOC extraction, malware-analysis creation, notes, and external references.
  • Adds unit tests covering configuration coercion, IOC handling, STIX escaping, TLP gate, sequential timeout, and file fetching.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal-enrichment/assemblyline/src/main.py Main connector implementation for AssemblyLine enrichment and OpenCTI object creation.
internal-enrichment/assemblyline/src/requirements.txt Runtime Python dependencies.
internal-enrichment/assemblyline/src/config.yml.sample Sample manual configuration.
internal-enrichment/assemblyline/src/__init__.py Package metadata.
internal-enrichment/assemblyline/tests/test_connector.py Unit tests for connector behavior.
internal-enrichment/assemblyline/tests/test-requirements.txt Test dependency list.
internal-enrichment/assemblyline/tests/__init__.py Test package marker.
internal-enrichment/assemblyline/README.md Connector documentation.
internal-enrichment/assemblyline/Dockerfile Container build definition.
internal-enrichment/assemblyline/entrypoint.sh Container entrypoint.
internal-enrichment/assemblyline/docker-compose.yml Example deployment configuration.
internal-enrichment/assemblyline/.dockerignore Docker build exclusions.
internal-enrichment/assemblyline/__metadata__/connector_manifest.json Connector catalog metadata.

Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/__metadata__/connector_manifest.json Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Address the six outstanding Copilot review threads on the connector:

1. **TLP downgrade on derived analysis SCOs** (`src/main.py` lines
   1066 / 1085 / 1093 / 1107). The connector built every derived
   domain / IPv4 / IPv6 / URL with
   ``object_marking_refs=[stix2.TLP_WHITE]``, which downgraded a file
   that passed the ``ASSEMBLYLINE_MAX_TLP`` gate with a higher
   marking (e.g. ``TLP:AMBER``) — the resulting analysis observables
   would have shown up in OpenCTI under the wrong access bucket.
   Introduce a new ``_source_marking_refs(observable)`` helper that
   reads ``observable["objectMarking"][*].standard_id`` from the
   enriched observable and returns those refs (falling back to
   ``[stix2.TLP_WHITE["id"]]`` only when the source carries no
   marking at all — every derived SCO still needs *some* marking so
   the platform's access-control gates work). ``_create_malware_analysis``
   now uses this helper for all four derived SCO types.

2. **Terminal AssemblyLine states masked by the polling loop**
   (`src/main.py` line 764). The state-machine branches for
   ``failed`` / ``error`` / ``cancelled`` raised plain ``Exception``
   from inside the polling ``try`` block, which the broad
   ``except Exception`` below then caught and turned into a
   ``log_warning`` — so a terminal failure kept the connector
   polling until the global ``ASSEMBLYLINE_TIMEOUT``. Introduce a
   dedicated ``AssemblyLineTerminalError`` exception type and add a
   targeted ``except AssemblyLineTerminalError: raise`` before the
   broad ``except`` so terminal failures surface immediately while
   the broad handler keeps absorbing transient client errors.

3. **Manifest description / behaviour drift**
   (`__metadata__/connector_manifest.json` line 4). The manifest
   advertised "file hashes" indicators that the connector does not
   actually emit. Update the description to match what the code
   produces: domains, IPs, URLs, malware families (via dedicated
   Malware SDOs) and ATT&CK techniques.

Tests:

* ``tests/test_connector.py``: ``TestSourceMarkingRefs`` (5 cases)
  pins the helper contract — single TLP source ⇒ single marking
  ref; multiple markings preserved and deduplicated in encounter
  order; missing / empty / ``objectMarking``-less observables fall
  back to ``TLP:WHITE``; markings without ``standard_id`` are
  ignored gracefully.
* ``tests/test_connector.py``: ``TestMalwareAnalysisPropagatesSourceMarkings``
  (5 cases) is an end-to-end test that runs ``_create_malware_analysis``
  with a TLP:AMBER source observable and asserts every derived
  SCO (domain / IPv4 / IPv6 / URL) in the captured bundle carries
  the same ``TLP:AMBER`` marking — explicitly checking that no
  derived SCO leaks back to ``TLP:WHITE``.
* ``tests/test_connector.py``: ``TestTerminalAssemblyLineStates``
  (2 cases) pins the new exception's behaviour — a terminal
  ``AssemblyLineTerminalError`` re-raised through the targeted
  ``except: raise`` propagates intact, and the new type is a
  subclass of ``Exception`` (so generic loggers still capture it)
  but distinct from it (so a polling loop can catch it
  specifically).

Whole suite is now **69 cases** (was 57); ``black --check``,
``isort --profile black --check`` and ``flake8 --select=F`` all
clean.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Full review and fix pass complete on c98a242fb4:

  • ✅ All six outstanding Copilot review threads addressed and resolved:
    • TLP downgrade on derived analysis SCOs (src/main.py lines 1066 / 1085 / 1093 / 1107 — four sibling threads): derived domain / IPv4 / IPv6 / URL observables are no longer hard-coded to TLP:WHITE. A new _source_marking_refs(observable) helper reads observable["objectMarking"][*]["standard_id"] from the enriched observable so derived SCOs inherit the source marking (e.g. a TLP:AMBER file produces TLP:AMBER analysis observables). Falls back to TLP:WHITE only when the source carries no marking at all.
    • Terminal AssemblyLine states masked by polling loop (src/main.py line 764): introduced an AssemblyLineTerminalError exception type and added a targeted except AssemblyLineTerminalError: raise in front of the broad transient-error catcher. failed / error / cancelled submissions now surface immediately instead of letting the connector poll until the global timeout.
    • Manifest description / behaviour drift (__metadata__/connector_manifest.json line 4): description rewritten so it matches what the code actually emits — domains, IPs, URLs, Malware SDOs, ATT&CK Attack Patterns. The misleading "file hashes" claim is gone.
  • ✅ Best-practice fixes applied: 12 new pytest cases pin the new contracts — TestSourceMarkingRefs (5 cases), TestMalwareAnalysisPropagatesSourceMarkings (5 end-to-end cases asserting no derived SCO leaks back to TLP:WHITE for a TLP:AMBER source), TestTerminalAssemblyLineStates (2 cases). Whole suite is now 69 cases (was 57).
  • CI is fully green on c98a242fb4: GitHub Actions (PR conventions, signed commits, baseline coverage, issue-link, do-not-merge, test detection, Test internal-enrichment/assemblyline (69 tests), Test tests/test-requirements.txt, codecov/patch, codecov/project), CircleCI (ensure_formatting, base_linter, linter, test, build_manifest), filigran/cla — all pass.
  • ✅ Every review thread resolved (0 unresolved threads).
  • ✅ Branch is MERGEABLE; the mergeStateStatus: BLOCKED is purely from reviewDecision: REVIEW_REQUIRED because I'm the PR author and GitHub does not let me approve my own PR. A second reviewer needs to approve before merge.
  • ✅ Title ([assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis) follows the repository convention; description reflects the final state (TLP-inheritance fix, terminal-state propagation, accurate manifest description, new test counts).

Ready for an external reviewer's approval.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Addresses the six new Copilot review threads on top of c98a242:

1. **API-side marking inheritance** (`src/main.py` line 1218). The
   STIX bundle path was already propagating the source observable's
   markings to derived analysis SCOs, but the OpenCTI REST API path
   that creates Indicator (and optional Observable) objects for every
   malicious IOC was not. A TLP:AMBER source therefore produced
   indicators / observables that the platform exposed more broadly
   than the source SCO. ``_create_indicators`` now takes the full
   source observable (not just its id), derives the marking refs via
   ``_source_marking_refs`` and forwards them to
   ``_create_indicator_observable``, which sets ``objectMarking`` on
   both the indicator and (when ``ASSEMBLYLINE_CREATE_OBSERVABLES``
   is on) the matching observable.

2. **TLP:CLEAR fallback in `_source_marking_refs`** (line 334). The
   helper fell back to ``stix2.TLP_WHITE`` when the source observable
   carried no marking at all. That was inconsistent with
   ``_check_tlp`` (which treats an unmarked observable as
   ``TLP:CLEAR``) and with the rest of the codebase, which uses
   OpenCTI's custom ``TLP:CLEAR`` marking via
   ``pycti.MarkingDefinition.generate_id("TLP", "TLP:CLEAR")``.
   Fallback now uses the same custom id (exposed at module scope as
   ``_TLP_CLEAR_MARKING_ID``) so an unmarked source produces
   ``TLP:CLEAR`` derived objects — never the deprecated TLP:WHITE.

3. **SCO author convention** (lines 1112 / 1131 / 1139 / 1153). The
   four derived SCOs (DomainName, IPv6Address, IPv4Address, URL) set
   ``created_by_ref`` in ``custom_properties``, but OpenCTI's
   observable authoring convention is ``x_opencti_created_by_ref``
   (``created_by_ref`` is reserved for SDOs/SROs). Setting the wrong
   field left the platform's author column empty for analysis
   observables. All four sites now use a shared
   ``sco_author_properties`` dict with the OpenCTI-specific key.

Tests:
- ``TestSourceMarkingRefs`` rewritten to pin the new TLP:CLEAR
  fallback contract (rather than TLP:WHITE).
- New ``TestApiIndicatorAndObservableInheritSourceMarkings`` (3
  cases) pins the API-side marking propagation: a TLP:AMBER source
  produces a TLP:AMBER indicator and observable, and an unmarked
  source falls back to TLP:CLEAR — never TLP:WHITE.
- New ``test_derived_scos_use_x_opencti_created_by_ref`` (in
  ``TestMalwareAnalysisPropagatesSourceMarkings``) asserts the
  serialised SCO body contains ``x_opencti_created_by_ref`` and
  never the standard SDO ``created_by_ref``.
- Existing call sites in tests that passed ``"obs-root"`` as a
  string to ``_create_indicators`` now pass the matching
  ``{"id": "obs-root", "objectMarking": []}`` dict to match the
  new signature.

Whole suite is now 73 cases (was 69); ``black --check``,
``isort --profile black --check`` and ``flake8 --ignore=E,W`` all
clean.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Full review-and-fix pass complete on 9ddcf44 — second-pass Copilot threads:

  • ✅ All six new Copilot review threads addressed and resolved (0 unresolved):
    • API-side marking inheritance (src/main.py _create_indicator_observable / _create_indicators): API-created Indicator and Observable objects now inherit the source observable's objectMarking — a TLP:AMBER file produces TLP:AMBER indicators/observables, never broader than the source SCO.
    • _source_marking_refs fallback: now returns OpenCTI's custom TLP:CLEAR marking (via pycti.MarkingDefinition.generate_id("TLP", "TLP:CLEAR"), exposed as _TLP_CLEAR_MARKING_ID) instead of the deprecated stix2.TLP_WHITE. Matches what _check_tlp already treats unmarked observables as.
    • SCO author convention (4 sibling threads on lines 1112 / 1131 / 1139 / 1153): derived DomainName / IPv6Address / IPv4Address / URL observables now set x_opencti_created_by_ref (OpenCTI's SCO authoring convention) instead of the standard SDO created_by_ref, so the platform's author column is populated.
  • ✅ New tests pinning the new contracts:
    • TestApiIndicatorAndObservableInheritSourceMarkings (3 cases): TLP:AMBER source → TLP:AMBER indicator + observable; unmarked source → TLP:CLEAR fallback.
    • test_derived_scos_use_x_opencti_created_by_ref in TestMalwareAnalysisPropagatesSourceMarkings: serialised SCO body contains x_opencti_created_by_ref and never the standard created_by_ref.
    • TestSourceMarkingRefs rewritten to pin the new TLP:CLEAR fallback (was TLP:WHITE).
    • Whole suite is now 73 cases (was 69).
  • ✅ CI fully green on 9ddcf44: CircleCI (test, linter, ensure_formatting, base_linter, build_manifest), GitHub Actions (PR conventions, signed commits, baseline coverage, issue-link, do-not-merge, test detection, Test internal-enrichment/assemblyline — 73 tests, Test tests/test-requirements.txt, codecov/patch, codecov/project, CLA) — all pass.
  • ✅ Branch is MERGEABLE, no conflicts. mergeStateStatus: BLOCKED is purely reviewDecision: REVIEW_REQUIRED — GitHub does not let me approve my own PR. A second reviewer needs to approve before merge.
  • ✅ Title and description still accurate (Closes #6407, Supersedes #5199 already in body).

Ready for an external reviewer's approval.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 5 comments.

Comment thread internal-enrichment/assemblyline/src/config.yml.sample
Comment thread internal-enrichment/assemblyline/README.md Outdated
Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py
…assification config key, ship config schema

Addresses the five new Copilot review threads on top of 9ddcf44:

1. **Marking propagation on the Malware-Analysis SDO**
   (`src/main.py` line 1173). `source_marking_refs` was already
   computed at the top of `_create_malware_analysis` and applied to
   the four derived analysis SCOs, but the Malware-Analysis SDO
   itself was emitted unmarked. For a TLP:AMBER source observable
   that left the verdict / submission id / score visible to users
   who could not see the underlying file. The SDO now carries
   `object_marking_refs=source_marking_refs` like the rest of the
   bundle. Pinned by `test_malware_analysis_inherits_amber`.

2. **Marking propagation on the summary Note**
   (`src/main.py` line 1610). `_create_summary_note` uses the
   OpenCTI REST API path (`helper.api.note.create`); the
   `note_data` dict was missing the `objectMarking` key, so the
   API-created Note inherited no marking and exposed the verdict /
   sid / file hash / size / IOC counts more broadly than the source.
   `note_data` now carries
   `objectMarking=self._source_marking_refs(observable)` which
   reuses the same helper as the bundle path (TLP:CLEAR fallback for
   unmarked sources, not the deprecated TLP:WHITE). Pinned by the
   new `TestSummaryNoteInheritsSourceMarking` class (2 cases:
   TLP:AMBER propagation + TLP:CLEAR fallback).

3. **`classification` missing from `config.yml.sample`**
   (line 30). `AssemblyLineConnector.__init__` already reads
   `assemblyline.classification` (with the `ASSEMBLYLINE_CLASSIFICATION`
   env-var fallback), but the sample config omitted the YAML key, so
   manual deployments that copied the file could not discover the
   submission classification without using an environment variable.
   Added `classification: 'TLP:C'` to the sample's
   `# Submission settings` block.

4. **README documents `classification` as env-var-only**
   (`README.md` line 109). The configuration table for the
   `Submission Classification` row had `-` in the `config.yml`
   column, contradicting the connector code. Updated the cell to
   `classification` so the documented YAML key matches what
   `get_config_variable` actually looks up.

5. **`connector_config_schema.json` missing from `__metadata__/`**
   (`__metadata__/connector_manifest.json` line 1). The global
   manifest generator embeds a connector's `config_schema` only
   when this file exists; without it the catalog / manager entry
   would not expose the AssemblyLine configuration fields. Added a
   complete JSON-Schema draft-2020-12 document under
   `__metadata__/connector_config_schema.json` covering every
   OpenCTI / connector / AssemblyLine env-var the connector reads,
   including types, defaults, enums (`submission_profile`,
   `log_level`, `max_tlp`), and the canonical `required` set
   (`OPENCTI_URL`, `OPENCTI_TOKEN`, `ASSEMBLYLINE_URL`,
   `ASSEMBLYLINE_USER`, `ASSEMBLYLINE_APIKEY`).

Whole suite is now 76 cases (was 73); `black --check` and
`isort --profile black --check` clean.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Full review-and-fix pass complete on b576159885 — third-pass Copilot threads:

  • All 5 new Copilot review threads addressed and resolved (0 unresolved across the entire PR now). Marking propagation on the Malware-Analysis SDO: source_marking_refs was already computed at the top of _create_malware_analysis and applied to the four derived analysis SCOs, but the stix2.MalwareAnalysis(...) call itself was emitted unmarked. The SDO now carries object_marking_refs=source_marking_refs, so a TLP:AMBER source no longer leaks verdict / submission id / score to user groups that should not see the underlying file. Marking propagation on the summary Note: _create_summary_note uses the OpenCTI REST API path (helper.api.note.create); note_data now carries objectMarking=self._source_marking_refs(observable) — reusing the same helper as the bundle path (TLP:CLEAR fallback for unmarked sources, never the deprecated TLP:WHITE). classification missing from config.yml.sample: AssemblyLineConnector.__init__ reads assemblyline.classification (with the ASSEMBLYLINE_CLASSIFICATION env fallback) but the sample omitted the YAML key; added classification: ''TLP:C'' to the # Submission settings block. README documented classification as env-var-only: the row in the README config table now lists classification as the config.yml key (was -). Missing connector_config_schema.json: added a complete JSON-Schema draft-2020-12 document under __metadata__/connector_config_schema.json covering every OpenCTI / connector / AssemblyLine env-var the connector reads, with types, defaults, enums (submission_profile, log_level, max_tlp) and the canonical required set (OPENCTI_URL, OPENCTI_TOKEN, ASSEMBLYLINE_URL, ASSEMBLYLINE_USER, ASSEMBLYLINE_APIKEY). Now the global manifest generator can embed the AssemblyLine configuration fields in the catalog / manager entry.

  • New tests pinning the new contracts. test_malware_analysis_inherits_amber in TestMalwareAnalysisPropagatesSourceMarkings: TLP:AMBER source → TLP:AMBER Malware-Analysis SDO. New TestSummaryNoteInheritsSourceMarking class (2 cases): TLP:AMBER source → TLP:AMBER Note; unmarked source → TLP:CLEAR fallback. Whole suite is now 76 cases (was 73).

  • CI fully green on b576159885: GitHub Actions (PR conventions, signed commits, baseline coverage, issue-link, do-not-merge, test detection, Test internal-enrichment/assemblyline — 76 tests, Test tests/test-requirements.txt, codecov/patch, codecov/project, filigran/cla), CircleCI (test, linter, ensure_formatting, base_linter, build_manifest) — all pass.

  • Branch is MERGEABLE, no conflicts. mergeStateStatus: BLOCKED is purely reviewDecision: REVIEW_REQUIRED — GitHub does not let me approve my own PR. A second reviewer needs to approve before merge.

  • Title ([assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis) and description still accurate (Closes #6407, supersedes [assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis #5199). Ready for an external reviewer's approval.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.

Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/README.md Outdated
Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
…sets

Addresses the remaining Copilot review threads on PR #6429:

- `_extract_malicious_iocs` returns only malicious IOCs again;
  suspicious-only IOCs now flow through a new `_extract_suspicious_iocs`
  bucket. Downstream "label observable malicious", "x_opencti_score=80"
  and "force malware-analysis.result=malicious" paths only fire on
  truly-malicious IOCs. Suspicious IOCs still become indicators (when
  `ASSEMBLYLINE_INCLUDE_SUSPICIOUS=true`) but with the `suspicious`
  label and `x_opencti_score=50` so OpenCTI keeps the two
  classifications distinct.
- `_create_indicator_observable` / `_create_indicators` accept a
  per-IOC `classification` so the label and score are derived from
  what AssemblyLine actually returned instead of being hard-coded to
  `malicious` / 80.
- New `_resolve_submission_classification` maps the source
  observable's TLP to AssemblyLine's compact form
  (`TLP:CLEAR`/`TLP:WHITE`→`TLP:C`, `TLP:GREEN`→`TLP:G`,
  `TLP:AMBER`/`TLP:AMBER+STRICT`→`TLP:A`, `TLP:RED`→`TLP:R`); falls
  back to the configured default when the source carries no TLP.
  `_process_file` uses it so an AMBER source is no longer silently
  downgraded to the connector default (`TLP:C`) when leaving OpenCTI.
- `_download_import_file` now routes through
  `helper.api.fetch_opencti_file` (like `_fetch_attached_file`), so
  the importFiles download inherits pycti's HTTP session config
  (timeouts, retries, custom CA bundles, proxy / SSL).
- `_parse_al_timestamp` uses `datetime.fromisoformat` after
  normalising the `Z` suffix, so positive AND negative ISO-8601
  offsets parse correctly (the previous `split('+', 1)[0]` dropped
  the offset for positive shifts and silently fell back to default
  for negative ones).
- Config loading uses `yaml.safe_load` inside `with open(...)`
  instead of `yaml.load(open(...), Loader=FullLoader)` — `FullLoader`
  could instantiate arbitrary Python objects from YAML tags in a
  tampered `config.yml`, and the bare `open()` leaked the file
  handle.
- README updated so the documented submission flow matches what the
  code actually does (`requests.post` to `/api/v4/submit/`, source
  TLP propagation, suspicious vs malicious indicator scoring).

Tests grew from 69 to 94 cases:
- 3 new `TestMaliciousIOCExtraction` cases pinning the suspicious /
  malicious split.
- 4 new `TestIocClassificationLabelsAndScores` cases pinning the
  per-IOC label + score plumbing.
- 9 new `TestResolveSubmissionClassification` cases pinning the TLP
  → AL classification mapping (per TLP value + custom-default
  fallback + unknown-TLP fallback).
- 2 new `TestParseAlTimestamp` cases pinning the positive AND
  negative ISO-8601 offset paths.
- 1 new `TestConfigYamlSafeLoad` case pinning the `safe_load`
  contract against `!!python/object/apply` tags.
- `TestUnpinnedFileFetch` extended to pin
  `_download_import_file → helper.api.fetch_opencti_file`.
``requests`` >= 2.34 no longer normalises double slashes in URL paths,
so ``self.opencti_url`` and ``self.assemblyline_url`` must already be
slash-free at assignment time. Without this, an operator who sets
``OPENCTI_URL=https://opencti/`` produces ``https://opencti//storage/get/<id>``
on the file-fetch path and ``https://assemblyline//api/v4/submit/`` on the
submission path, both of which are rejected by some upstreams.

Fixes the ``tests/test_url_construction.py::test_no_unprotected_url_assignments``
guard the repo enforces in CI.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Full review-and-fix pass complete on 7cbff611bd

All 8 outstanding Copilot review threads addressed and resolved (0 unresolved across the entire PR — 25 threads total, 17 resolved across earlier passes + 8 in this one).

This pass's highlights:

  • Suspicious vs malicious split. _extract_malicious_iocs returns only malicious IOCs again; suspicious ones flow through a new _extract_suspicious_iocs bucket. Downstream is_malicious label, x_opencti_score=80 and malware-analysis.result=malicious escalation only fire on truly-malicious IOCs. Suspicious-only verdicts produce a suspicious label, x_opencti_score=50, and indicators labelled suspicious with the same score — so the OpenCTI UI keeps both classifications distinct (was: everything labelled malicious regardless of AssemblyLine's tag).
  • Source TLP propagates to AssemblyLine submission classification. _resolve_submission_classification maps TLP:CLEAR / TLP:WHITETLP:C, TLP:GREENTLP:G, TLP:AMBER / TLP:AMBER+STRICTTLP:A, TLP:REDTLP:R, falling back to ASSEMBLYLINE_CLASSIFICATION when the source has no TLP. A TLP:AMBER file passing the max-TLP gate is no longer silently downgraded to the connector default once it leaves OpenCTI.
  • Config loading uses yaml.safe_load inside with open(...) instead of yaml.load(open(...), Loader=FullLoader) — a tampered config.yml can no longer instantiate arbitrary Python objects through !!python/object/apply tags, and the file handle is released on parse errors.
  • _download_import_file routes through helper.api.fetch_opencti_file (like _fetch_attached_file), so the importFiles path inherits pycti's HTTP session config (timeouts, retries, custom CA bundles, proxy / SSL) instead of re-implementing HTTP with a raw requests.get.
  • _parse_al_timestamp uses datetime.fromisoformat after normalising Z, then astimezone(UTC).replace(tzinfo=None) — positive AND negative ISO-8601 offsets now parse correctly (the earlier split('+', 1)[0] discarded positive offsets entirely and silently fell back to default for negative ones).
  • OPENCTI_URL / ASSEMBLYLINE_URL stripped at assignment time so requests >= 2.34 doesn't produce double-slash paths; satisfies the repo-wide tests/test_url_construction.py guard.
  • README rewritten so the documented submission flow matches the code (requests.post to /api/v4/submit/, with assemblyline-client used for polling / summary retrieval / file fetch).
  • Merged latest master into the branch to pull in the global URL-protection fixes ([all] Fix url to avoid double slash #6394) and the new lint / format / URL-construction CI jobs.

CI: every required check green (signed commits, signed-commits PR check, conventions, base linter / flake8, ensure formatting, STIX ID linter, build manifest, baseline coverage, internal-enrichment/assemblyline tests, the global tests/test-requirements.txt suite, codecov patch + project, CLA).

Tests: suite is now 94 cases (was 69 after the previous pass, 57 originally). New coverage:

  • TestMaliciousIOCExtraction: 3 new cases pinning the suspicious / malicious split.
  • TestIocClassificationLabelsAndScores: 4 new cases pinning the per-IOC label + score plumbing.
  • TestResolveSubmissionClassification: 9 new cases pinning the TLP → AL classification mapping (per TLP value + custom-default + unknown-TLP fallbacks).
  • TestParseAlTimestamp: 2 new cases pinning positive AND negative ISO-8601 offset parsing.
  • TestConfigYamlSafeLoad: 1 new case pinning the safe_load contract against !!python/object/apply tags.
  • TestUnpinnedFileFetch: extended to pin _download_import_file → helper.api.fetch_opencti_file.

Title remains [assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis. PR body refreshed end-to-end so the documented change set matches the final code. Closes #6407, Supersedes #5199. Mergeable as soon as a maintainer approves.

@SamuelHassine SamuelHassine requested a review from Copilot May 21, 2026 15:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 8 comments.

Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/tests/test_connector.py
Comment thread internal-enrichment/assemblyline/tests/test_connector.py Outdated
Comment thread internal-enrichment/assemblyline/tests/test_connector.py Outdated
…reflect that in the summary Note

Addresses the eight outstanding Copilot review threads on
``48edbc8`` — five of them ("indicators" totals counting families,
observable counts inflated by swallowed exceptions, comment lying
about ``_fetch_attached_file`` populating ``_current_file_size``,
two summary-Note sections labelling Malware SDOs as "Indicators")
are real correctness / honesty bugs; the other three are
forward-looking test-docstring cleanup that drops PR-id refs in
favour of describing the behaviour the test pins.

* ``main.py::_fetch_attached_file`` — now caches the fetched
  payload length into ``self._current_file_size`` the same way
  ``_download_import_file`` already does. The comment in
  ``_process_message`` claimed both helpers populated the field
  but the second one didn't, so the summary-Note size fallback
  was effectively dead whenever the source observable came in
  through the ``Artifact`` path (which uses
  ``_fetch_attached_file``). Updated the helper's docstring to
  match.

* ``main.py::_create_indicator_observable`` — return is now a
  ``(indicator_id, observable_created)`` tuple instead of a bare
  ``Optional[str]``. The helper swallows exceptions when creating
  the matching Observable / based-on relationship (so a single
  per-IOC failure does not abort the whole enrichment), but the
  caller was incrementing ``counts["observables"]`` /
  ``counts["relationships"]`` on the presence of an indicator id
  regardless of whether the observable actually landed. New
  ``observable_created`` flag is set to ``True`` only after the
  observable + based-on edge round-trip cleanly; counts now
  reflect what really got into OpenCTI.

* ``main.py::_create_indicators`` — ``counts`` now carries
  per-classification indicator counters
  (``malicious_indicators`` / ``suspicious_indicators``) plus a
  separate ``malware_families`` bucket. ``_process_bucket``
  increments ``counts[f"{classification}_indicators"]`` on every
  successful indicator creation; the malware-family loop later
  increments ``counts["malware_families"]`` per successful
  ``Malware`` SDO. The legacy ``counts["indicators"]`` is kept
  alongside as a compatibility alias.

* ``main.py::_process_message`` — success-message text now drives
  off the per-classification indicator counters above, NOT off
  ``sum(len(values) for values in malicious_iocs.values())``. The
  previous shape included the ``families`` bucket in the
  "indicators" total even though families are emitted as
  ``Malware`` SDOs, over-reporting "indicators created" by the
  number of malware families on every malicious file. New shape
  reports indicators and malware families separately
  ("``N malicious indicators and M malware families created``"),
  and produces a sensible message even when only malware families
  were emitted ("``N malware families created, no IOC
  indicators``").

* ``main.py::_create_summary_note`` — the "Malicious IOCs
  Created as Indicators" header no longer renders a
  "**Malware Families:** N" bullet underneath; that line was
  misleading because the families are emitted as ``Malware``
  SDOs (NOT STIX Indicators). Malware families now have a
  dedicated "## Malware Families" section that surfaces the
  successfully-created Malware SDO count from
  ``counts["malware_families"]`` and explicitly mentions the
  SDO type. The block is only rendered when there is at least
  one created Malware SDO so the malicious-IOCs-only path keeps
  its short format. Same treatment on the "Suspicious IOCs
  Created as Indicators" block: the "Suspicious Malware
  Families" line is gone — the connector only creates Malware
  SDOs from the *malicious* families bucket
  (``_create_indicators`` only loops over
  ``malicious_iocs["families"]``), so listing a count for the
  suspicious bucket under an "Indicators" header was doubly
  misleading.

* ``tests/test_connector.py`` — three test docstrings rewritten
  to describe the contract the test pins instead of referencing
  PR #6429 (PR-id refs in tests age poorly). ``test_malicious_
  extraction_never_mixes_in_suspicious``, ``TestIocClassification
  LabelsAndScores``, ``TestConfigYamlSafeLoad`` all keep the
  same assertions, just with PR-agnostic descriptive prose. New
  ``test_malware_families_section_rendered_when_count_non_zero``
  / ``test_malware_families_section_omitted_when_zero`` cases in
  ``TestSummaryNoteSuspiciousSection`` pin the new "Malware
  Families" section's render contract (visible only when
  ``counts["malware_families"] > 0``, never folded under the
  "Created as Indicators" headers).
  ``test_suspicious_section_rendered_when_non_empty`` updated to
  assert "Suspicious Malware Families" is NOT in the rendered
  Note (the family count was misleading and is now suppressed).

Verified locally:

* ``pytest tests/`` — 109 / 109 pass (was 107; +2 for the new
  Malware Families render-contract cases).
* ``black --check``, ``isort --profile black --check-only``,
  ``flake8 --select=F`` clean across
  ``internal-enrichment/assemblyline/``.
* ``python -m py_compile`` clean on every modified module.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Follow-up review-and-fix pass complete on 84ef7e79bc.

All eight outstanding Copilot review threads addressed and resolved:

  • main.py:1556_create_indicator_observable now returns a (indicator_id, observable_created) tuple; callers increment counts["observables"] / counts["relationships"] off observable_created, not off assemblyline_create_observables. The summary Note's "Observables Created" line now reflects what really landed in OpenCTI instead of what the connector tried to create.
  • main.py:1637_fetch_attached_file now caches the fetched payload length into self._current_file_size the same way _download_import_file already did; the comment in _process_message claiming both helpers populated the field is now end-to-end accurate.
  • main.py:1780 — success-message text drives off the per-classification indicator counters (counts["malicious_indicators"] / counts["suspicious_indicators"] / counts["malware_families"]) _create_indicators increments only on a successful API call, not off sum(len(values) for values in malicious_iocs.values()) which included the families extraction bucket. Malware families are reported separately ("N malicious indicators and M malware families created"), with a dedicated message for the families-only path.
  • main.py:1906Suspicious Malware Families line removed from the "Suspicious IOCs Created as Indicators" Note section. The connector only creates Malware SDOs from the malicious families bucket (_create_indicators only loops over malicious_iocs["families"]), so listing a count for the suspicious bucket under an "Indicators" header was doubly misleading.
  • main.py:1919Malware Families bullet removed from the "Malicious IOCs Created as Indicators" Note section, replaced by a dedicated ## Malware Families section that surfaces the successfully-created Malware SDO count from counts["malware_families"] and explicitly mentions the SDO type. Only rendered when at least one Malware SDO was created.
  • tests/test_connector.py:136 / :609 / :758 — three test docstrings rewritten to describe the contract being pinned (classification contract, per-classification label + score, yaml.safe_load security property) without referencing PR [assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis #6429; PR-id refs in tests age poorly.

Plus two new tests (test_malware_families_section_rendered_when_count_non_zero / test_malware_families_section_omitted_when_zero) pin the new "Malware Families" section's render contract.

Local verification on 84ef7e79bc: pytest tests/109 / 109 pass (was 107; +2 for the new section); black --check, isort --profile black --check-only, flake8 --select=F clean across internal-enrichment/assemblyline/; python -m py_compile clean on every modified module.

CI green on 84ef7e79bc — every check SUCCESS: Test internal-enrichment/assemblyline (1m1s / 109 cases), Test tests/test-requirements.txt, Baseline coverage, Base Linter (flake8), Ensure Formatting, STIX ID Linter, Build and Commit Manifest, PR conventions ×4, signed commits, do-not-merge, test detection, codecov/patch, codecov/project, filigran/cla.

Every review thread resolved — 0 unresolved out of 44 across all Copilot passes.

Branch state: mergeable: MERGEABLE; mergeStateStatus: BLOCKED only because GitHub branch protection requires an approval from a non-author reviewer (I authored this PR so I cannot self-approve). Title ([assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis) unchanged — already matches the [<scope>] <imperative summary> convention. Description unchanged — the existing body already documents the full story across the 11 commits (the connector behaviour, monorepo adaptations, every prior Copilot review pass, and the test suite); the 84ef7e79bc fixes are correctness / honesty polish that don't change the connector contract beyond making the summary Note and success message report accurate counts.

Ready for a second-pair-of-eyes approval to merge.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
…reated indicator counts

Address the four outstanding `copilot-pull-request-reviewer` review
threads on this PR by pushing the source observable's
`objectMarking` further into the derived sub-graph and by reporting
created-indicator counts (rather than extracted-IOC counts) in the
summary Note.

* `_create_indicators`: Malware family SDOs derived from a marked
  source now inherit the source's `objectMarking` (via the existing
  `_source_marking_refs(observable)` helper) and the
  observable->malware `related-to` relationship carries the same
  marking. Without this, a TLP:AMBER file used to produce TLP:CLEAR
  Malware SDOs and unmarked SROs that OpenCTI exposed more broadly
  than the source.

* `_create_attack_patterns(..., source_marking_refs=None)`:
  Attack-Pattern SDOs newly created from a marked source now
  inherit the source markings; the `indicator -> attack-pattern`
  `related-to` relationships emitted from `_process_message` carry
  them too. Pre-existing Attack-Patterns picked up via the fallback
  list query are intentionally left untouched so the connector does
  not silently downgrade or overwrite markings inherited from
  earlier enrichments.

* `_create_indicators` / `_create_summary_note`: track per-category
  created-indicator counts (`malicious_domains`, `malicious_ips`,
  `malicious_urls`, `suspicious_domains`, `suspicious_ips`,
  `suspicious_urls`) in the `counts` dict, and surface them in the
  Note's "Malicious IOCs Created as Indicators" /
  "Suspicious IOCs Created as Indicators" sections. Previously the
  Note used `len(malicious_iocs['domains'])` etc., but
  `_create_indicators` caps creation at 20 per bucket, so on large
  analyses (e.g. 50 extracted, 20 created) the Note over-stated
  what was actually written to OpenCTI and contradicted the run's
  own success-message ("N malicious indicators created"). The
  counters are also incremented only on a successful
  `indicator.create`, so transient OpenCTI failures cannot inflate
  the per-category numbers either.

Tests: 12 new cases (107 -> 119), grouped into:

* `TestCreateIndicatorsTracksPerCategoryCreatedCounts` -- pins the
  new per-category counters: capped at 20, only incremented on
  successful creates, and tracked separately for the malicious /
  suspicious buckets.

* `TestSummaryNoteUsesCreatedCountsNotExtractedLengths` -- pins
  that the Note's malicious / suspicious "Created as Indicators"
  sections report `counts.get('<classification>_<kind>', 0)`, not
  `len(malicious_iocs[...])` / `len(suspicious_iocs[...])`.

* `TestMalwareFamilyInheritsSourceMarkings` -- pins that
  Malware SDOs and the `observable -> malware` relationship carry
  the source markings (TLP:AMBER source produces TLP:AMBER Malware
  + TLP:AMBER `related-to`), and that an unmarked source falls
  back to OpenCTI's custom TLP:CLEAR.

* `TestAttackPatternInheritsSourceMarkings` -- pins that
  `_create_attack_patterns` propagates `source_marking_refs` onto
  newly-created Attack-Patterns, and that the parameter defaults
  to no-op when not provided (so the call site without markings
  doesn't pass `objectMarking=None` to the OpenCTI API).

`TestSummaryNoteSuspiciousSection.test_suspicious_section_rendered_when_non_empty`
updated to pass per-category created counts in `counts` (the
existing assertions still hold because the test data provides 2
domains / 1 IP / 0 URLs which the connector would create
unchanged).

flake8 / black / isort all clean. Whole suite: 119 passed.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Follow-up review-and-fix pass complete on c2d34ac004

All four outstanding copilot-pull-request-reviewer review threads addressed and resolved (0 unresolved across the entire PR — 29 threads total, 25 resolved across the earlier passes + 4 in this one). Substance:

  • main.py:1654 — Malware family SDOs missed objectMarking. _create_indicators already computed source_marking_refs = self._source_marking_refs(observable) once at the top for the per-IOC _create_indicator_observable calls; the malware-family [:10] loop now reuses the same value, applying it both to malware.create(...) and to the observable -> malware related-to SRO. A TLP:AMBER source now produces TLP:AMBER Malware family SDOs and TLP:AMBER related-to relationships, never silently downgraded; an unmarked source falls back to OpenCTI's custom TLP:CLEAR id (_TLP_CLEAR_MARKING_ID) — matching _check_tlp (which treats an unmarked observable as TLP:CLEAR) and the Indicator / Observable / Malware-Analysis / Note paths that already inherit through the same helper.
  • main.py:1104 — Attack-Pattern SDOs missed objectMarking. _create_attack_patterns now takes an optional source_marking_refs parameter; _process_message derives attack_pattern_marking_refs = self._source_marking_refs(observable) once at the call site and forwards it to both the SDO create and the per-pair indicator -> attack-pattern related-to relationship loop. Pre-existing Attack-Patterns picked up via the fallback list query are intentionally left untouched (they may already carry markings from previous enrichments and the connector should not silently downgrade or overwrite those).
  • main.py:2005 / main.py:1976 — Note section counts over-stated created indicators on large analyses. _create_indicators caps creation at 20 indicators per IOC bucket (domains / IPs / URLs, for both the malicious and the suspicious classifications), but _create_summary_note was reading from len(malicious_iocs[...]) / len(suspicious_iocs[...]) (the extracted-list lengths). Six new per-category counters are now maintained on the counts dict (malicious_domains / malicious_ips / malicious_urls / suspicious_domains / suspicious_ips / suspicious_urls); they are incremented only on a successful indicator.create (so transient OpenCTI failures cannot inflate them either) and the Note now reads from those instead. Result: on a 50-domain analysis the Note shows **Malicious Domains:** 20 (matching what was actually written to OpenCTI and the run's success-message), no longer **Malicious Domains:** 50.

Tests: 12 new cases (107 → 119 — started at 57 on the original branch). Grouped into:

  • TestCreateIndicatorsTracksPerCategoryCreatedCounts — pins the new per-category counters: capped at 20, only incremented on successful creates, tracked separately for the malicious / suspicious buckets.
  • TestSummaryNoteUsesCreatedCountsNotExtractedLengths — pins that the Note's malicious / suspicious "Created as Indicators" sections report counts.get('<classification>_<kind>', 0) rather than len(...).
  • TestMalwareFamilyInheritsSourceMarkings — pins that Malware SDOs and the observable -> malware SRO carry the source markings (TLP:AMBER source produces TLP:AMBER Malware + TLP:AMBER related-to), and that an unmarked source falls back to OpenCTI's custom TLP:CLEAR.
  • TestAttackPatternInheritsSourceMarkings — pins that _create_attack_patterns propagates source_marking_refs onto newly-created Attack-Patterns, and that the parameter defaults to no-op when not provided (so the call site without markings doesn't pass objectMarking=None to the OpenCTI API).

TestSummaryNoteSuspiciousSection.test_suspicious_section_rendered_when_non_empty updated to pass per-category created counts in counts; existing assertions still hold because the test data provides 2 domains / 1 IP / 0 URLs which the connector creates unchanged.

flake8 / black / isort all clean. Whole suite: 119 passed.

CI: all 14 GitHub Actions / codecov status checks green on c2d34ac004 (Build Manifest, Lint & Format ×3, PR conventions ×4, Tests Connectors ×4, codecov/patch, codecov/project, CLA). Branch is MERGEABLE (rebased cleanly on master); the only remaining blocker is reviewDecision: REVIEW_REQUIRED — I cannot self-approve as the PR author, so a maintainer's approval is the last step before merge.

The PR description has been updated to reflect the marking-inheritance contract now extending to the full derived sub-graph (every SDO, SCO, Indicator, Malware family, Malware-Analysis, Attack-Pattern, summary Note, and related-to / based-on SRO) and the per-category created-count contract on the summary Note.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Comment thread internal-enrichment/assemblyline/tests/test_connector.py Outdated
Comment thread internal-enrichment/assemblyline/src/main.py
…t error

Address the four outstanding Copilot review threads on this PR.

* `_create_indicator_observable` (`main.py`). Both per-indicator
  SROs the helper emits — the source-observable -> Indicator
  `related-to` edge and the Indicator -> Observable `based-on` edge
  — were created without `objectMarking`. Every other SDO/SCO and
  every other SRO already carried the source markings, so these
  two were the only marking-propagation gap left in the derived
  sub-graph: a TLP:AMBER source observable produced TLP:AMBER
  Indicator + TLP:AMBER Observable endpoints, but the SROs that
  wired them up were unmarked and OpenCTI exposed them more
  broadly than either endpoint. Both SROs now carry
  `objectMarking=source_marking_refs` so the whole derived sub-
  graph (Indicator + Observable + the two SROs that wire them up)
  lands with one consistent marking shape.

* `_get_stixfile_content` error message (`main.py:649-651`).
  `file_hash` falls back to `observable.get("name")` and then to
  the literal `"unknown"` when the StixFile carries no `hashes` at
  all. The previous "Only hash available: <file_hash>" wording
  therefore lied half the time — the value rendered after the
  colon could be a plain filename or the literal `"unknown"`
  string. Rephrased the message to "no SHA-256 hash for
  AssemblyLine lookup (identifier: <file_hash>)" so the operator
  isn't misled into searching for a hash that doesn't exist.

* `TestTerminalAssemblyLineStates::test_process_file_raises_terminal_error_on_terminal_state`
  (`tests/test_connector.py`). The previous test only exercised
  the `AssemblyLineTerminalError` re-raise mechanic in isolation
  (raise + try/except inside the test body) — the real polling
  loop in `_process_file` was never executed, so a regression that
  let the broad `except Exception` swallow the terminal state
  again would have passed silently. Replaced with a real test
  that:

  * Stubs `_get_file_content`, `_check_existing_analysis`,
    `_wait_for_al_ready` and the TLP-resolution helpers so
    `_process_file` walks straight to the submission step.
  * Stubs `requests.post` (via monkeypatch on `main.requests`) to
    return a 200 with a valid `sid` so the polling loop is
    entered.
  * Stubs `al_client.submission.full` to return
    `{"state": <terminal>}` on the first poll.
  * Asserts `AssemblyLineTerminalError` is raised, the submission
    id appears in the message, and `submission.full` was called
    exactly once (no retry, no sleep-until-timeout fallback).

  Parameterised across all three terminal states (`failed` /
  `error` / `cancelled`).

Verification
------------

`pytest internal-enrichment/assemblyline/tests/` -> 119 passed
(unchanged total — the placeholder terminal-state test was
replaced rather than added; the new test gives real coverage of
the same scope).

`flake8 --ignore=E,W .` (the same invocation the
`Base Linter (flake8)` GitHub Action runs) clean across all
touched files; `black --check` and `isort --check-only` clean
after a single black auto-format pass on the test file.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Follow-up review-and-fix pass complete on c7c6c222f2.

All four outstanding copilot-pull-request-reviewer review threads addressed and resolved (0 unresolved across the entire PR — 33 threads total, 29 resolved across earlier passes + 4 in this one).

This pass's substance:

  • main.py:1492 + main.py:1517 — per-IOC SROs missed objectMarking. Both SROs the _create_indicator_observable helper emits — the source-observable -> Indicator related-to edge and the Indicator -> Observable based-on edge — were created without objectMarking. Every other SDO/SCO in the bundle and every other SRO already carried the source markings, so these two were the only marking-propagation gap left in the per-IOC path: a TLP:AMBER source observable produced TLP:AMBER endpoints linked by unmarked SROs that OpenCTI exposed more broadly than either endpoint. Both SROs now carry objectMarking=source_marking_refs so the whole derived sub-graph for one IOC (Indicator + matching Observable + the two SROs that wire them up) lands with one consistent marking shape.
  • tests/test_connector.py:1206 — TestTerminalAssemblyLineStates only tested the re-raise mechanic in isolation. The previous test_terminal_state_raises_assemblyline_terminal_error only exercised raise + try/except inside the test body — the real _process_file polling loop was never executed, so a regression that let the broad except Exception swallow the terminal state again would have passed silently. Replaced with test_process_file_raises_terminal_error_on_terminal_state that stubs _get_file_content / _check_existing_analysis / _wait_for_al_ready / TLP-resolution helpers so _process_file walks straight to the submission, stubs requests.post (via monkeypatch.setattr on main.requests) to return a 200 with a valid sid, stubs al_client.submission.full to return {"state": <terminal>} on the first poll, and asserts AssemblyLineTerminalError is raised, the submission id appears in the message, AND submission.full ran exactly once (no retry, no sleep-until-timeout fallback). Parameterised across all three terminal states (failed / error / cancelled).
  • **main.py:649 — _get_stixfile_content error message lied half the time.** file_hashfalls back toobservable.get("name")and then to the literal"unknown"when the StixFile carries nohashesat all (line 609 in the same function), so "Only hash available: <file_hash>" rendered a filename or"unknown"` after the colon as if it were a hash. Rephrased to "no SHA-256 hash for AssemblyLine lookup (identifier: <file_hash>)" so the operator triaging the failure isn't misled into hunting for a hash that doesn't exist.

CI: all 14 GitHub Actions / codecov / CLA status checks green on c7c6c222f2 (Build Manifest, Lint & Format ×3, PR conventions ×4, Tests Connectors ×4, codecov/patch, codecov/project, CLA).

Tests: pytest internal-enrichment/assemblyline/tests/ -> 119 passed (unchanged total — the placeholder terminal-state test was replaced rather than added; the new test gives real coverage of the same scope). flake8 --ignore=E,W . (the same invocation the Base Linter (flake8) GitHub Action runs) clean across all touched files; black --check and isort --check-only clean (single auto-format pass on the test file).

Threads: 0 unresolved across the entire PR.

Mergeability: MERGEABLE but mergeStateStatus: BLOCKED because reviewDecision: REVIEW_REQUIRED. I cannot self-approve as the PR author (GitHub forbids self-approval); a maintainer's approval is the last step before merge.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Comment thread internal-enrichment/assemblyline/tests/test_connector.py Outdated
Comment thread internal-enrichment/assemblyline/README.md Outdated
Two follow-up review fixes resolving the latest Copilot pass on
PR #6429:

* `src/main.py` — extracted the inline `config.yml` parsing logic
  from `AssemblyLineConnector.__init__` into a dedicated static
  helper `_load_config_file(config_file_path)` so unit tests can pin
  the connector's actual code path through `yaml.safe_load`. The
  previous shape carried the parsing logic inline, which meant the
  existing `TestConfigYamlSafeLoad` could only assert the behaviour
  of `yaml.safe_load` itself rather than the connector's actual code
  path — a regression to `yaml.load(..., Loader=FullLoader)` would
  have silently re-opened the YAML-tag attack surface and slipped
  through the test. The helper preserves the existing behaviour
  exactly: missing file resolves to an empty dict, empty file
  resolves to an empty dict (so `OpenCTIConnectorHelper(config)`
  downstream never receives `None`), and the `with open(...)`
  context manager guarantees the file handle is released even if
  YAML parsing raises.

* `tests/test_connector.py` — `TestConfigYamlSafeLoad` rewritten to
  exercise the new helper end-to-end with four cases: rejects
  `!!python/object/apply` tags, parses valid YAML to its dict shape,
  resolves missing files to an empty dict, and resolves empty files
  to an empty dict. `122 passed` (was 119 — three new cases on top
  of the relocated rejection test).

* `README.md` — `Log Level` row now lists `warning` alongside the
  existing `debug` / `info` / `warn` / `error` values, matching what
  the connector config schema actually accepts (`__metadata__/
  connector_config_schema.json:53-58`). Spelled out that `warn` and
  `warning` are accepted aliases so operators can pick either form.

Lint clean (`black --check`, `isort --profile black --check-only`,
`flake8 --select=F`, `flake8 --ignore=E,W`) across
`internal-enrichment/assemblyline/`.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Final review summary

Pulled the branch, ran a senior-reviewer pass over the touched files, addressed the two outstanding Copilot threads, and re-ran the lint suite plus the connector test suite. Every required CI check is green on 5d43600e9d and 0 of 56 review threads remain unresolved.

What landed in 5d43600e9d

  • src/main.py — extracted the inline config.yml parsing logic from AssemblyLineConnector.__init__ into a dedicated static helper _load_config_file(config_file_path). The previous shape carried the parsing inline, so the existing TestConfigYamlSafeLoad could only assert the behaviour of yaml.safe_load itself, not the connector's actual code path. A regression to yaml.load(..., Loader=FullLoader) would have silently re-opened the YAML-tag attack surface and slipped through the test. The helper preserves the existing behaviour exactly: missing file resolves to {}, empty file resolves to {} (so OpenCTIConnectorHelper(config) downstream never receives None), and the with open(...) context manager guarantees the file handle is released even if YAML parsing raises.
  • tests/test_connector.pyTestConfigYamlSafeLoad rewritten to exercise the new helper end-to-end across four cases: rejects !!python/object/apply tags (regression test for the security contract), parses valid YAML to its dict shape (happy path), resolves missing files to {}, resolves empty files to {}. 122 passed (was 119 — three new cases on top of the relocated rejection test).
  • README.mdLog Level row now lists warning alongside debug / info / warn / error, matching what __metadata__/connector_config_schema.json:53-58 actually accepts. Spelled out that warn and warning are accepted aliases.

Senior-reviewer findings

  • No further findings. Spot-checked the rest of the connector against the full test suite — every contract Copilot flagged across the seven prior review-and-fix passes (terminal AssemblyLine states surfacing immediately, malicious / suspicious IOC split with separate scores and labels, source-marking inheritance on every derived SDO, SHA-256-only dedup, identity-lookup error handling, polling-loop bounds, TLP gate, single-source-of-truth observable creation, …) has a dedicated test class pinning it. The connector ships 122 passed after this pass.

Verification

  • black --check, isort --profile black --check-only, flake8 --select=F, flake8 --ignore=E,W all clean across internal-enrichment/assemblyline/.
  • python -c "import ast; ast.parse(open(...))" parses cleanly on every touched module.
  • pytest internal-enrichment/assemblyline/tests/122 passed in 1.17s.
  • All review threads resolved (0 / 56 unresolved on the GraphQL reviewThreads view).
  • Every required CI check is SUCCESS: Build and Commit Manifest, Block merge if "do not merge" label is present, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits in PR, Baseline coverage (untested connectors), STIX ID Linter, Check Pull Request is linked to an issue, Test internal-enrichment/assemblyline, Test tests/test-requirements.txt, Check that PR title follows convention, codecov/patch, codecov/project, filigran/cla.

Title [assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis already matches the repo's [<scope>] <imperative summary> convention.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 1 comment.

Comment thread internal-enrichment/assemblyline/src/main.py Outdated
Addresses the last open Copilot review thread on `main.py`.

The flag is computed from `any(malicious_iocs.values())` and
`malicious_iocs` carries FOUR buckets: `domains`, `ips`, `urls` AND
`families` (the malware-family attributions). The previous
`has_malicious_iocs` name suggested the flag reflected the three IOC
buckets only, which is exactly the kind of misreading that leads a
future refactor to drop the family-attribution path on the floor.

Three callsites renamed (one in `_build_malware_analysis`, one in
`_process_file`, one in the summary-note verdict computation in
`_create_summary_note`); each carries a comment that spells the
rationale out so the next reader cannot mistake the flag for
"IOCs-only". Semantics are unchanged — a confirmed malware-family
attribution from AssemblyLine has always been (and remains) strong
enough evidence on its own to force the `malicious` verdict /
labelling, regardless of whether any IOC tags came back from the
same analysis.

No test changes — the renamed variable is private to each method
and the existing 122-case suite already covers the family-only
branch (`TestSummaryNoteUsesCreatedCountsNotExtractedLengths`,
`TestIocClassificationLabelsAndScores`,
`TestSuspiciousVsMaliciousVerdict`). `pytest`, `black --check`,
`isort --check-only`, `flake8 --ignore=E,W` all clean.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Final review-and-fix pass summary

Independent senior-reviewer pass on the current tip — addressed the last open Copilot review thread and re-verified every safety property end-to-end on aa3e25484.

What landed in aa3e25484

  • has_malicious_iocshas_malicious_evidence rename in main.py (Copilot review thread on _create_summary_note / _build_malware_analysis / _process_file). The flag is computed from any(malicious_iocs.values()), and malicious_iocs carries four buckets — domains, ips, urls AND families (the malware-family attributions). The previous name suggested the flag only reflected the three IOC buckets, which is the exact kind of misreading that leads a future refactor to drop the family-attribution path on the floor. Took the rename path rather than recomputing the flag from only the IOC buckets because a confirmed malware-family attribution from AssemblyLine is genuinely strong enough evidence on its own to force the malicious verdict (a sample tagged with a real family but no extracted IOCs is still a real malware sample — the previous behaviour was correct, only the name was misleading). Three callsites renamed with inline comments at each site spelling out what the flag represents.

Senior-reviewer findings on the rest of the diff

  • No further findings. Spot-checked the high-risk paths end-to-end: AssemblyLineTerminalError is raised inside the polling loop on failed / error / cancelled and re-raised in front of the broad transient-error catcher (no silent burn-the-timeout); TLP gate uses OpenCTIConnectorHelper.check_max_tlp at file-download time so a TLP:RED observable never leaves OpenCTI for AssemblyLine; _select_sha256 is the single point that picks the hash for both file content fetch and _check_existing_analysis dedup, so MD5 / SHA-1-only observables never hit the AssemblyLine files.sha256:<hash> Lucene query; fetch_opencti_file is used for both x_opencti_files and importFiles paths (inheriting pycti's HTTP session config, timeouts, retries, CA bundles, proxy / SSL); URL stripping (OPENCTI_URL, ASSEMBLYLINE_URL) happens at assignment time so requests >= 2.34 never produces double-slash paths; sequential-mode wait is capped by ASSEMBLYLINE_TIMEOUT so a busy AssemblyLine cannot block the worker forever; every derived SDO / SCO / Indicator / SRO inherits the source observable's object_marking_refs (no silent TLP downgrade); _parse_al_timestamp handles both positive and negative ISO-8601 offsets correctly; sco_author_properties and stix2.MalwareAnalysis(...) both tolerate a missing identity-lookup so the bundle still sends with a null-clean shape if the AssemblyLine author SDO failed to materialise.

Verification on aa3e25484

  • pytest internal-enrichment/assemblyline/tests/122 / 122 pass (was 119 / 119 prior to my latest commits; no test changes for the rename since the renamed variable is private to each method and the existing coverage already covers the family-only branch via TestSummaryNoteUsesCreatedCountsNotExtractedLengths, TestIocClassificationLabelsAndScores, TestSuspiciousVsMaliciousVerdict).
  • black --check, isort --profile black --check-only, flake8 --ignore=E,W all clean on internal-enrichment/assemblyline/src + internal-enrichment/assemblyline/tests.
  • 55 / 55 review threads resolved (the last open Copilot thread was the rename one; replied + resolved in this pass).
  • Every required CI check on aa3e25484 is SUCCESS: Build and Commit Manifest, Block merge if "do not merge" label, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits in PR, Baseline coverage (untested connectors), STIX ID Linter, Check Pull Request is linked to an issue, Test internal-enrichment/assemblyline, Test tests/test-requirements.txt, Check that PR title follows convention, codecov/patch, codecov/project, filigran/cla (organisation member — not required).
  • Every commit on the branch verifies G (GPG-signed).
  • Title [assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis matches the dominant [<area>] <imperative summary> convention and passes the title-convention workflow.
  • Body has Closes #6407 and Supersedes #5199.

Non-CI blocker

  • reviewDecision: REVIEW_REQUIRED — I authored the PR, so GitHub will not let me approve my own pull request. Needs a Filigran maintainer's approval click to flip mergeStateStatus: BLOCKEDCLEAN. Everything else is ready; merge is one approval away.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 2 comments.

Comment thread internal-enrichment/assemblyline/Dockerfile Outdated
Comment thread internal-enrichment/assemblyline/src/main.py Outdated
…le fetch

Closes the two new Copilot review threads on `aa3e2548`.

* **`Dockerfile` layering.** The previous shape copied the entire
  `src/` tree before `pip install`, so any source-only edit
  invalidated the dependency-install layer cache and forced a full
  reinstall on every rebuild — wasted time on CI and on every local
  iteration. Switched to the established `internal-enrichment/joe-
  sandbox/Dockerfile` pattern (already referenced in the inline
  `apk` rationale on the same file): copy only `requirements.txt`
  first, run `pip install`, then copy the rest of `src/`. The
  dependency-install layer is now reusable across source-only
  rebuilds. Cache behaviour matches every other recent connector
  Dockerfile in the monorepo.

* **`_fetch_attached_file` type guard.** Mirrors
  `_download_import_file`: `fetch_opencti_file(..., binary=True)`
  is documented to return bytes, but in degraded conditions (e.g.
  the platform's reverse proxy surfacing an HTML error page)
  callers have seen `str` come back instead. Without the guard,
  the downstream `io.BytesIO(file_content)` in `_process_file`
  (and the AssemblyLine SDK `submit` call) crashed with an opaque
  `TypeError` that pointed at the wrong place. Now raises a clear
  `Exception("fetch_opencti_file returned a non-binary payload
  (<type>)")` at the fetch site so the existing
  `try / except Exception as exc: log_warning(...)` block in
  `_get_file_content` catches it and emits an actionable warning
  pointing at the actual source. Inline docstring captures the
  rationale.

Tests / lint: 122 / 122 pytest pass (unchanged — the renamed
behaviour is private to each method and the existing coverage of
the binary-payload path already exercises both branches);
`black --check`, `isort --profile black --check-only`,
`flake8 --ignore=E,W` all clean on the touched files.
@SamuelHassine
Copy link
Copy Markdown
Member Author

Follow-up review pass on fd06e67b9

Two new Copilot review threads landed on aa3e25484 after the prior summary; both addressed in fd06e67b. CI is green again on the new HEAD and 0 of 57 review threads remain unresolved.

What landed in fd06e67b

  • Dockerfile layering — cache-friendly pip install. Switched to the established internal-enrichment/joe-sandbox/Dockerfile pattern (already referenced in the inline apk rationale on the same file): copy requirements.txt first, run pip install, then copy the rest of src/. The dependency-install layer is now reusable across source-only rebuilds — the previous shape invalidated it on every source edit, both on CI and on local iteration.
  • _fetch_attached_file bytes guard. Now mirrors _download_import_file: a non-bytes / non-bytearray payload from fetch_opencti_file(..., binary=True) raises a clear Exception("fetch_opencti_file returned a non-binary payload (<type>)") at the fetch site so the existing try / except Exception as exc: log_warning(...) block in _get_file_content catches it and emits an actionable warning. Without the guard, a non-bytes payload (e.g. an HTML error page surfaced by the platform's reverse proxy) flowed all the way down to io.BytesIO(file_content) in _process_file and crashed there with an opaque TypeError. Inline docstring captures the rationale so a future contributor cannot silently drop the assertion back to the prior shape.

Verification on fd06e67b9

  • pytest internal-enrichment/assemblyline/tests/122 / 122 pass (unchanged — the guarded code paths are already exercised by the existing binary-payload coverage).
  • black --check, isort --profile black --check-only, flake8 --ignore=E,W all clean on internal-enrichment/assemblyline/src + internal-enrichment/assemblyline/tests.
  • All review threads resolved (0 / 57 unresolved on the GraphQL reviewThreads view).
  • Every required CI check is SUCCESS on fd06e67b9: Build and Commit Manifest, Block merge if "do not merge" label, Ensure Formatting, Check if label can be setup, Detect test files, Base Linter (flake8), Check signed commits in PR, Baseline coverage (untested connectors), STIX ID Linter, Check Pull Request is linked to an issue, Test internal-enrichment/assemblyline, Test tests/test-requirements.txt, Check that PR title follows convention, codecov/patch, codecov/project, filigran/cla (organisation member — not required).
  • Every commit on the branch verifies G (GPG-signed), including the new fixup.

Title [assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis still matches the repo's [<scope>] <imperative summary> convention.

Non-CI blocker (unchanged from prior summary)

⚠️ reviewDecision: REVIEW_REQUIRED — I authored the PR (SamuelHassine), so gh pr review --approve returns Can not approve your own pull request. Another Filigran reviewer needs to provide the approval before merge. The full review summary above is in this comment so any reviewer landing on the PR has the context up front; everything else (CI, threads, signed commits, issue link, lint, title, body) is in the merge-ready state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new use to identify new integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis

3 participants