[assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis#6429
[assemblyline] Add new internal-enrichment connector for AssemblyLine 4 sandbox analysis#6429SamuelHassine wants to merge 16 commits into
Conversation
… 4 sandbox analysis
Adds a new internal-enrichment connector for AssemblyLine 4 at
`internal-enrichment/assemblyline/`.
When OpenCTI asks the connector to enrich a `StixFile` or `Artifact`,
the connector:
* downloads the file content from OpenCTI's storage API;
* submits it to the configured AssemblyLine deployment via
`assemblyline-client`;
* polls until the analysis completes (within a configurable timeout);
* pushes the results back into OpenCTI:
* a `Malware-Analysis` SDO (submission id, profile, verdict, score)
that surfaces in the *Malware Analysis* section of the enriched
observable;
* STIX `Indicator` objects (and optional matching `Observable`
objects linked via `based-on` relationships) for every malicious
IOC AssemblyLine extracted (domains, IPs, URLs);
* `Malware` SDOs for every malware family attributed by
AssemblyLine;
* `Attack-Pattern` SDOs for every MITRE ATT&CK technique observed
at runtime, linked to the generated indicators with `related-to`
relationships;
* a `Note` summarising the verdict and counts;
* an `External-Reference` attached to the enriched observable
pointing back to the AssemblyLine submission.
All the "create *X*" toggles and the sequential mode (queue
submissions while AssemblyLine is busy, bounded by the configured
timeout) are configurable via `ASSEMBLYLINE_CREATE_ATTACK_PATTERNS` /
`ASSEMBLYLINE_CREATE_MALWARE_ANALYSIS` /
`ASSEMBLYLINE_CREATE_OBSERVABLES` / `ASSEMBLYLINE_SEQUENTIAL_MODE`.
The connector also enforces a TLP gate (`ASSEMBLYLINE_MAX_TLP`,
default `TLP:AMBER`) before any file content leaves OpenCTI for the
external AssemblyLine deployment.
This connector is a re-submission of #5199 (which became unmergeable
after the head branch was accidentally promoted to master and then
rolled back). All the Copilot review findings on #5199 have been
addressed in this re-submission: SSL-verify normalisation, bounded
sequential-mode wait, HTTP submission timeout, hyphenated MITRE
phase names, ATT&CK→Indicator `related-to` relationships, IPv6
dispatch, STIX-pattern escaping (backslash + quote), no hard-coded
`trojan` label, `CREATE_OBSERVABLES` no longer duplicates,
StixFile `importFiles` lookup, canonical `/storage/get/<id>` URL
for `fetch_opencti_file`, TLP gate, sanitised observable in error
logs, and proper External-Reference attachment to the enriched
observable.
Co-authored-by: V1D1AN <V1D1AN@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6429 +/- ##
===========================================
- Coverage 26.75% 0.62% -26.14%
===========================================
Files 1801 1727 -74
Lines 106940 107281 +341
===========================================
- Hits 28613 666 -27947
- Misses 78327 106615 +28288
📢 Thoughts on this report? Let us know! 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new AssemblyLine internal-enrichment connector that submits OpenCTI Artifact/StixFile content for sandbox analysis and imports analysis results back into OpenCTI.
Changes:
- Adds connector runtime, Docker, compose, config, manifest, and README files under
internal-enrichment/assemblyline. - Implements AssemblyLine submission/polling, TLP gating, IOC extraction, malware-analysis creation, notes, and external references.
- Adds unit tests covering configuration coercion, IOC handling, STIX escaping, TLP gate, sequential timeout, and file fetching.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
internal-enrichment/assemblyline/src/main.py |
Main connector implementation for AssemblyLine enrichment and OpenCTI object creation. |
internal-enrichment/assemblyline/src/requirements.txt |
Runtime Python dependencies. |
internal-enrichment/assemblyline/src/config.yml.sample |
Sample manual configuration. |
internal-enrichment/assemblyline/src/__init__.py |
Package metadata. |
internal-enrichment/assemblyline/tests/test_connector.py |
Unit tests for connector behavior. |
internal-enrichment/assemblyline/tests/test-requirements.txt |
Test dependency list. |
internal-enrichment/assemblyline/tests/__init__.py |
Test package marker. |
internal-enrichment/assemblyline/README.md |
Connector documentation. |
internal-enrichment/assemblyline/Dockerfile |
Container build definition. |
internal-enrichment/assemblyline/entrypoint.sh |
Container entrypoint. |
internal-enrichment/assemblyline/docker-compose.yml |
Example deployment configuration. |
internal-enrichment/assemblyline/.dockerignore |
Docker build exclusions. |
internal-enrichment/assemblyline/__metadata__/connector_manifest.json |
Connector catalog metadata. |
Address the six outstanding Copilot review threads on the connector: 1. **TLP downgrade on derived analysis SCOs** (`src/main.py` lines 1066 / 1085 / 1093 / 1107). The connector built every derived domain / IPv4 / IPv6 / URL with ``object_marking_refs=[stix2.TLP_WHITE]``, which downgraded a file that passed the ``ASSEMBLYLINE_MAX_TLP`` gate with a higher marking (e.g. ``TLP:AMBER``) — the resulting analysis observables would have shown up in OpenCTI under the wrong access bucket. Introduce a new ``_source_marking_refs(observable)`` helper that reads ``observable["objectMarking"][*].standard_id`` from the enriched observable and returns those refs (falling back to ``[stix2.TLP_WHITE["id"]]`` only when the source carries no marking at all — every derived SCO still needs *some* marking so the platform's access-control gates work). ``_create_malware_analysis`` now uses this helper for all four derived SCO types. 2. **Terminal AssemblyLine states masked by the polling loop** (`src/main.py` line 764). The state-machine branches for ``failed`` / ``error`` / ``cancelled`` raised plain ``Exception`` from inside the polling ``try`` block, which the broad ``except Exception`` below then caught and turned into a ``log_warning`` — so a terminal failure kept the connector polling until the global ``ASSEMBLYLINE_TIMEOUT``. Introduce a dedicated ``AssemblyLineTerminalError`` exception type and add a targeted ``except AssemblyLineTerminalError: raise`` before the broad ``except`` so terminal failures surface immediately while the broad handler keeps absorbing transient client errors. 3. **Manifest description / behaviour drift** (`__metadata__/connector_manifest.json` line 4). The manifest advertised "file hashes" indicators that the connector does not actually emit. Update the description to match what the code produces: domains, IPs, URLs, malware families (via dedicated Malware SDOs) and ATT&CK techniques. Tests: * ``tests/test_connector.py``: ``TestSourceMarkingRefs`` (5 cases) pins the helper contract — single TLP source ⇒ single marking ref; multiple markings preserved and deduplicated in encounter order; missing / empty / ``objectMarking``-less observables fall back to ``TLP:WHITE``; markings without ``standard_id`` are ignored gracefully. * ``tests/test_connector.py``: ``TestMalwareAnalysisPropagatesSourceMarkings`` (5 cases) is an end-to-end test that runs ``_create_malware_analysis`` with a TLP:AMBER source observable and asserts every derived SCO (domain / IPv4 / IPv6 / URL) in the captured bundle carries the same ``TLP:AMBER`` marking — explicitly checking that no derived SCO leaks back to ``TLP:WHITE``. * ``tests/test_connector.py``: ``TestTerminalAssemblyLineStates`` (2 cases) pins the new exception's behaviour — a terminal ``AssemblyLineTerminalError`` re-raised through the targeted ``except: raise`` propagates intact, and the new type is a subclass of ``Exception`` (so generic loggers still capture it) but distinct from it (so a polling loop can catch it specifically). Whole suite is now **69 cases** (was 57); ``black --check``, ``isort --profile black --check`` and ``flake8 --select=F`` all clean.
|
Full review and fix pass complete on
Ready for an external reviewer's approval. |
Addresses the six new Copilot review threads on top of c98a242: 1. **API-side marking inheritance** (`src/main.py` line 1218). The STIX bundle path was already propagating the source observable's markings to derived analysis SCOs, but the OpenCTI REST API path that creates Indicator (and optional Observable) objects for every malicious IOC was not. A TLP:AMBER source therefore produced indicators / observables that the platform exposed more broadly than the source SCO. ``_create_indicators`` now takes the full source observable (not just its id), derives the marking refs via ``_source_marking_refs`` and forwards them to ``_create_indicator_observable``, which sets ``objectMarking`` on both the indicator and (when ``ASSEMBLYLINE_CREATE_OBSERVABLES`` is on) the matching observable. 2. **TLP:CLEAR fallback in `_source_marking_refs`** (line 334). The helper fell back to ``stix2.TLP_WHITE`` when the source observable carried no marking at all. That was inconsistent with ``_check_tlp`` (which treats an unmarked observable as ``TLP:CLEAR``) and with the rest of the codebase, which uses OpenCTI's custom ``TLP:CLEAR`` marking via ``pycti.MarkingDefinition.generate_id("TLP", "TLP:CLEAR")``. Fallback now uses the same custom id (exposed at module scope as ``_TLP_CLEAR_MARKING_ID``) so an unmarked source produces ``TLP:CLEAR`` derived objects — never the deprecated TLP:WHITE. 3. **SCO author convention** (lines 1112 / 1131 / 1139 / 1153). The four derived SCOs (DomainName, IPv6Address, IPv4Address, URL) set ``created_by_ref`` in ``custom_properties``, but OpenCTI's observable authoring convention is ``x_opencti_created_by_ref`` (``created_by_ref`` is reserved for SDOs/SROs). Setting the wrong field left the platform's author column empty for analysis observables. All four sites now use a shared ``sco_author_properties`` dict with the OpenCTI-specific key. Tests: - ``TestSourceMarkingRefs`` rewritten to pin the new TLP:CLEAR fallback contract (rather than TLP:WHITE). - New ``TestApiIndicatorAndObservableInheritSourceMarkings`` (3 cases) pins the API-side marking propagation: a TLP:AMBER source produces a TLP:AMBER indicator and observable, and an unmarked source falls back to TLP:CLEAR — never TLP:WHITE. - New ``test_derived_scos_use_x_opencti_created_by_ref`` (in ``TestMalwareAnalysisPropagatesSourceMarkings``) asserts the serialised SCO body contains ``x_opencti_created_by_ref`` and never the standard SDO ``created_by_ref``. - Existing call sites in tests that passed ``"obs-root"`` as a string to ``_create_indicators`` now pass the matching ``{"id": "obs-root", "objectMarking": []}`` dict to match the new signature. Whole suite is now 73 cases (was 69); ``black --check``, ``isort --profile black --check`` and ``flake8 --ignore=E,W`` all clean.
|
Full review-and-fix pass complete on 9ddcf44 — second-pass Copilot threads:
Ready for an external reviewer's approval. |
…assification config key, ship config schema Addresses the five new Copilot review threads on top of 9ddcf44: 1. **Marking propagation on the Malware-Analysis SDO** (`src/main.py` line 1173). `source_marking_refs` was already computed at the top of `_create_malware_analysis` and applied to the four derived analysis SCOs, but the Malware-Analysis SDO itself was emitted unmarked. For a TLP:AMBER source observable that left the verdict / submission id / score visible to users who could not see the underlying file. The SDO now carries `object_marking_refs=source_marking_refs` like the rest of the bundle. Pinned by `test_malware_analysis_inherits_amber`. 2. **Marking propagation on the summary Note** (`src/main.py` line 1610). `_create_summary_note` uses the OpenCTI REST API path (`helper.api.note.create`); the `note_data` dict was missing the `objectMarking` key, so the API-created Note inherited no marking and exposed the verdict / sid / file hash / size / IOC counts more broadly than the source. `note_data` now carries `objectMarking=self._source_marking_refs(observable)` which reuses the same helper as the bundle path (TLP:CLEAR fallback for unmarked sources, not the deprecated TLP:WHITE). Pinned by the new `TestSummaryNoteInheritsSourceMarking` class (2 cases: TLP:AMBER propagation + TLP:CLEAR fallback). 3. **`classification` missing from `config.yml.sample`** (line 30). `AssemblyLineConnector.__init__` already reads `assemblyline.classification` (with the `ASSEMBLYLINE_CLASSIFICATION` env-var fallback), but the sample config omitted the YAML key, so manual deployments that copied the file could not discover the submission classification without using an environment variable. Added `classification: 'TLP:C'` to the sample's `# Submission settings` block. 4. **README documents `classification` as env-var-only** (`README.md` line 109). The configuration table for the `Submission Classification` row had `-` in the `config.yml` column, contradicting the connector code. Updated the cell to `classification` so the documented YAML key matches what `get_config_variable` actually looks up. 5. **`connector_config_schema.json` missing from `__metadata__/`** (`__metadata__/connector_manifest.json` line 1). The global manifest generator embeds a connector's `config_schema` only when this file exists; without it the catalog / manager entry would not expose the AssemblyLine configuration fields. Added a complete JSON-Schema draft-2020-12 document under `__metadata__/connector_config_schema.json` covering every OpenCTI / connector / AssemblyLine env-var the connector reads, including types, defaults, enums (`submission_profile`, `log_level`, `max_tlp`), and the canonical `required` set (`OPENCTI_URL`, `OPENCTI_TOKEN`, `ASSEMBLYLINE_URL`, `ASSEMBLYLINE_USER`, `ASSEMBLYLINE_APIKEY`). Whole suite is now 76 cases (was 73); `black --check` and `isort --profile black --check` clean.
|
Full review-and-fix pass complete on
|
…sets Addresses the remaining Copilot review threads on PR #6429: - `_extract_malicious_iocs` returns only malicious IOCs again; suspicious-only IOCs now flow through a new `_extract_suspicious_iocs` bucket. Downstream "label observable malicious", "x_opencti_score=80" and "force malware-analysis.result=malicious" paths only fire on truly-malicious IOCs. Suspicious IOCs still become indicators (when `ASSEMBLYLINE_INCLUDE_SUSPICIOUS=true`) but with the `suspicious` label and `x_opencti_score=50` so OpenCTI keeps the two classifications distinct. - `_create_indicator_observable` / `_create_indicators` accept a per-IOC `classification` so the label and score are derived from what AssemblyLine actually returned instead of being hard-coded to `malicious` / 80. - New `_resolve_submission_classification` maps the source observable's TLP to AssemblyLine's compact form (`TLP:CLEAR`/`TLP:WHITE`→`TLP:C`, `TLP:GREEN`→`TLP:G`, `TLP:AMBER`/`TLP:AMBER+STRICT`→`TLP:A`, `TLP:RED`→`TLP:R`); falls back to the configured default when the source carries no TLP. `_process_file` uses it so an AMBER source is no longer silently downgraded to the connector default (`TLP:C`) when leaving OpenCTI. - `_download_import_file` now routes through `helper.api.fetch_opencti_file` (like `_fetch_attached_file`), so the importFiles download inherits pycti's HTTP session config (timeouts, retries, custom CA bundles, proxy / SSL). - `_parse_al_timestamp` uses `datetime.fromisoformat` after normalising the `Z` suffix, so positive AND negative ISO-8601 offsets parse correctly (the previous `split('+', 1)[0]` dropped the offset for positive shifts and silently fell back to default for negative ones). - Config loading uses `yaml.safe_load` inside `with open(...)` instead of `yaml.load(open(...), Loader=FullLoader)` — `FullLoader` could instantiate arbitrary Python objects from YAML tags in a tampered `config.yml`, and the bare `open()` leaked the file handle. - README updated so the documented submission flow matches what the code actually does (`requests.post` to `/api/v4/submit/`, source TLP propagation, suspicious vs malicious indicator scoring). Tests grew from 69 to 94 cases: - 3 new `TestMaliciousIOCExtraction` cases pinning the suspicious / malicious split. - 4 new `TestIocClassificationLabelsAndScores` cases pinning the per-IOC label + score plumbing. - 9 new `TestResolveSubmissionClassification` cases pinning the TLP → AL classification mapping (per TLP value + custom-default fallback + unknown-TLP fallback). - 2 new `TestParseAlTimestamp` cases pinning the positive AND negative ISO-8601 offset paths. - 1 new `TestConfigYamlSafeLoad` case pinning the `safe_load` contract against `!!python/object/apply` tags. - `TestUnpinnedFileFetch` extended to pin `_download_import_file → helper.api.fetch_opencti_file`.
``requests`` >= 2.34 no longer normalises double slashes in URL paths, so ``self.opencti_url`` and ``self.assemblyline_url`` must already be slash-free at assignment time. Without this, an operator who sets ``OPENCTI_URL=https://opencti/`` produces ``https://opencti//storage/get/<id>`` on the file-fetch path and ``https://assemblyline//api/v4/submit/`` on the submission path, both of which are rejected by some upstreams. Fixes the ``tests/test_url_construction.py::test_no_unprotected_url_assignments`` guard the repo enforces in CI.
Full review-and-fix pass complete on
|
…reflect that in the summary Note
Addresses the eight outstanding Copilot review threads on
``48edbc8`` — five of them ("indicators" totals counting families,
observable counts inflated by swallowed exceptions, comment lying
about ``_fetch_attached_file`` populating ``_current_file_size``,
two summary-Note sections labelling Malware SDOs as "Indicators")
are real correctness / honesty bugs; the other three are
forward-looking test-docstring cleanup that drops PR-id refs in
favour of describing the behaviour the test pins.
* ``main.py::_fetch_attached_file`` — now caches the fetched
payload length into ``self._current_file_size`` the same way
``_download_import_file`` already does. The comment in
``_process_message`` claimed both helpers populated the field
but the second one didn't, so the summary-Note size fallback
was effectively dead whenever the source observable came in
through the ``Artifact`` path (which uses
``_fetch_attached_file``). Updated the helper's docstring to
match.
* ``main.py::_create_indicator_observable`` — return is now a
``(indicator_id, observable_created)`` tuple instead of a bare
``Optional[str]``. The helper swallows exceptions when creating
the matching Observable / based-on relationship (so a single
per-IOC failure does not abort the whole enrichment), but the
caller was incrementing ``counts["observables"]`` /
``counts["relationships"]`` on the presence of an indicator id
regardless of whether the observable actually landed. New
``observable_created`` flag is set to ``True`` only after the
observable + based-on edge round-trip cleanly; counts now
reflect what really got into OpenCTI.
* ``main.py::_create_indicators`` — ``counts`` now carries
per-classification indicator counters
(``malicious_indicators`` / ``suspicious_indicators``) plus a
separate ``malware_families`` bucket. ``_process_bucket``
increments ``counts[f"{classification}_indicators"]`` on every
successful indicator creation; the malware-family loop later
increments ``counts["malware_families"]`` per successful
``Malware`` SDO. The legacy ``counts["indicators"]`` is kept
alongside as a compatibility alias.
* ``main.py::_process_message`` — success-message text now drives
off the per-classification indicator counters above, NOT off
``sum(len(values) for values in malicious_iocs.values())``. The
previous shape included the ``families`` bucket in the
"indicators" total even though families are emitted as
``Malware`` SDOs, over-reporting "indicators created" by the
number of malware families on every malicious file. New shape
reports indicators and malware families separately
("``N malicious indicators and M malware families created``"),
and produces a sensible message even when only malware families
were emitted ("``N malware families created, no IOC
indicators``").
* ``main.py::_create_summary_note`` — the "Malicious IOCs
Created as Indicators" header no longer renders a
"**Malware Families:** N" bullet underneath; that line was
misleading because the families are emitted as ``Malware``
SDOs (NOT STIX Indicators). Malware families now have a
dedicated "## Malware Families" section that surfaces the
successfully-created Malware SDO count from
``counts["malware_families"]`` and explicitly mentions the
SDO type. The block is only rendered when there is at least
one created Malware SDO so the malicious-IOCs-only path keeps
its short format. Same treatment on the "Suspicious IOCs
Created as Indicators" block: the "Suspicious Malware
Families" line is gone — the connector only creates Malware
SDOs from the *malicious* families bucket
(``_create_indicators`` only loops over
``malicious_iocs["families"]``), so listing a count for the
suspicious bucket under an "Indicators" header was doubly
misleading.
* ``tests/test_connector.py`` — three test docstrings rewritten
to describe the contract the test pins instead of referencing
PR #6429 (PR-id refs in tests age poorly). ``test_malicious_
extraction_never_mixes_in_suspicious``, ``TestIocClassification
LabelsAndScores``, ``TestConfigYamlSafeLoad`` all keep the
same assertions, just with PR-agnostic descriptive prose. New
``test_malware_families_section_rendered_when_count_non_zero``
/ ``test_malware_families_section_omitted_when_zero`` cases in
``TestSummaryNoteSuspiciousSection`` pin the new "Malware
Families" section's render contract (visible only when
``counts["malware_families"] > 0``, never folded under the
"Created as Indicators" headers).
``test_suspicious_section_rendered_when_non_empty`` updated to
assert "Suspicious Malware Families" is NOT in the rendered
Note (the family count was misleading and is now suppressed).
Verified locally:
* ``pytest tests/`` — 109 / 109 pass (was 107; +2 for the new
Malware Families render-contract cases).
* ``black --check``, ``isort --profile black --check-only``,
``flake8 --select=F`` clean across
``internal-enrichment/assemblyline/``.
* ``python -m py_compile`` clean on every modified module.
|
Follow-up review-and-fix pass complete on All eight outstanding Copilot review threads addressed and resolved:
Plus two new tests ( Local verification on CI green on Every review thread resolved — 0 unresolved out of 44 across all Copilot passes. Branch state: Ready for a second-pair-of-eyes approval to merge. |
…reated indicator counts
Address the four outstanding `copilot-pull-request-reviewer` review
threads on this PR by pushing the source observable's
`objectMarking` further into the derived sub-graph and by reporting
created-indicator counts (rather than extracted-IOC counts) in the
summary Note.
* `_create_indicators`: Malware family SDOs derived from a marked
source now inherit the source's `objectMarking` (via the existing
`_source_marking_refs(observable)` helper) and the
observable->malware `related-to` relationship carries the same
marking. Without this, a TLP:AMBER file used to produce TLP:CLEAR
Malware SDOs and unmarked SROs that OpenCTI exposed more broadly
than the source.
* `_create_attack_patterns(..., source_marking_refs=None)`:
Attack-Pattern SDOs newly created from a marked source now
inherit the source markings; the `indicator -> attack-pattern`
`related-to` relationships emitted from `_process_message` carry
them too. Pre-existing Attack-Patterns picked up via the fallback
list query are intentionally left untouched so the connector does
not silently downgrade or overwrite markings inherited from
earlier enrichments.
* `_create_indicators` / `_create_summary_note`: track per-category
created-indicator counts (`malicious_domains`, `malicious_ips`,
`malicious_urls`, `suspicious_domains`, `suspicious_ips`,
`suspicious_urls`) in the `counts` dict, and surface them in the
Note's "Malicious IOCs Created as Indicators" /
"Suspicious IOCs Created as Indicators" sections. Previously the
Note used `len(malicious_iocs['domains'])` etc., but
`_create_indicators` caps creation at 20 per bucket, so on large
analyses (e.g. 50 extracted, 20 created) the Note over-stated
what was actually written to OpenCTI and contradicted the run's
own success-message ("N malicious indicators created"). The
counters are also incremented only on a successful
`indicator.create`, so transient OpenCTI failures cannot inflate
the per-category numbers either.
Tests: 12 new cases (107 -> 119), grouped into:
* `TestCreateIndicatorsTracksPerCategoryCreatedCounts` -- pins the
new per-category counters: capped at 20, only incremented on
successful creates, and tracked separately for the malicious /
suspicious buckets.
* `TestSummaryNoteUsesCreatedCountsNotExtractedLengths` -- pins
that the Note's malicious / suspicious "Created as Indicators"
sections report `counts.get('<classification>_<kind>', 0)`, not
`len(malicious_iocs[...])` / `len(suspicious_iocs[...])`.
* `TestMalwareFamilyInheritsSourceMarkings` -- pins that
Malware SDOs and the `observable -> malware` relationship carry
the source markings (TLP:AMBER source produces TLP:AMBER Malware
+ TLP:AMBER `related-to`), and that an unmarked source falls
back to OpenCTI's custom TLP:CLEAR.
* `TestAttackPatternInheritsSourceMarkings` -- pins that
`_create_attack_patterns` propagates `source_marking_refs` onto
newly-created Attack-Patterns, and that the parameter defaults
to no-op when not provided (so the call site without markings
doesn't pass `objectMarking=None` to the OpenCTI API).
`TestSummaryNoteSuspiciousSection.test_suspicious_section_rendered_when_non_empty`
updated to pass per-category created counts in `counts` (the
existing assertions still hold because the test data provides 2
domains / 1 IP / 0 URLs which the connector would create
unchanged).
flake8 / black / isort all clean. Whole suite: 119 passed.
Follow-up review-and-fix pass complete on
|
…t error
Address the four outstanding Copilot review threads on this PR.
* `_create_indicator_observable` (`main.py`). Both per-indicator
SROs the helper emits — the source-observable -> Indicator
`related-to` edge and the Indicator -> Observable `based-on` edge
— were created without `objectMarking`. Every other SDO/SCO and
every other SRO already carried the source markings, so these
two were the only marking-propagation gap left in the derived
sub-graph: a TLP:AMBER source observable produced TLP:AMBER
Indicator + TLP:AMBER Observable endpoints, but the SROs that
wired them up were unmarked and OpenCTI exposed them more
broadly than either endpoint. Both SROs now carry
`objectMarking=source_marking_refs` so the whole derived sub-
graph (Indicator + Observable + the two SROs that wire them up)
lands with one consistent marking shape.
* `_get_stixfile_content` error message (`main.py:649-651`).
`file_hash` falls back to `observable.get("name")` and then to
the literal `"unknown"` when the StixFile carries no `hashes` at
all. The previous "Only hash available: <file_hash>" wording
therefore lied half the time — the value rendered after the
colon could be a plain filename or the literal `"unknown"`
string. Rephrased the message to "no SHA-256 hash for
AssemblyLine lookup (identifier: <file_hash>)" so the operator
isn't misled into searching for a hash that doesn't exist.
* `TestTerminalAssemblyLineStates::test_process_file_raises_terminal_error_on_terminal_state`
(`tests/test_connector.py`). The previous test only exercised
the `AssemblyLineTerminalError` re-raise mechanic in isolation
(raise + try/except inside the test body) — the real polling
loop in `_process_file` was never executed, so a regression that
let the broad `except Exception` swallow the terminal state
again would have passed silently. Replaced with a real test
that:
* Stubs `_get_file_content`, `_check_existing_analysis`,
`_wait_for_al_ready` and the TLP-resolution helpers so
`_process_file` walks straight to the submission step.
* Stubs `requests.post` (via monkeypatch on `main.requests`) to
return a 200 with a valid `sid` so the polling loop is
entered.
* Stubs `al_client.submission.full` to return
`{"state": <terminal>}` on the first poll.
* Asserts `AssemblyLineTerminalError` is raised, the submission
id appears in the message, and `submission.full` was called
exactly once (no retry, no sleep-until-timeout fallback).
Parameterised across all three terminal states (`failed` /
`error` / `cancelled`).
Verification
------------
`pytest internal-enrichment/assemblyline/tests/` -> 119 passed
(unchanged total — the placeholder terminal-state test was
replaced rather than added; the new test gives real coverage of
the same scope).
`flake8 --ignore=E,W .` (the same invocation the
`Base Linter (flake8)` GitHub Action runs) clean across all
touched files; `black --check` and `isort --check-only` clean
after a single black auto-format pass on the test file.
|
Follow-up review-and-fix pass complete on All four outstanding This pass's substance:
CI: all 14 GitHub Actions / codecov / CLA status checks green on Tests: Threads: 0 unresolved across the entire PR. Mergeability: |
Two follow-up review fixes resolving the latest Copilot pass on PR #6429: * `src/main.py` — extracted the inline `config.yml` parsing logic from `AssemblyLineConnector.__init__` into a dedicated static helper `_load_config_file(config_file_path)` so unit tests can pin the connector's actual code path through `yaml.safe_load`. The previous shape carried the parsing logic inline, which meant the existing `TestConfigYamlSafeLoad` could only assert the behaviour of `yaml.safe_load` itself rather than the connector's actual code path — a regression to `yaml.load(..., Loader=FullLoader)` would have silently re-opened the YAML-tag attack surface and slipped through the test. The helper preserves the existing behaviour exactly: missing file resolves to an empty dict, empty file resolves to an empty dict (so `OpenCTIConnectorHelper(config)` downstream never receives `None`), and the `with open(...)` context manager guarantees the file handle is released even if YAML parsing raises. * `tests/test_connector.py` — `TestConfigYamlSafeLoad` rewritten to exercise the new helper end-to-end with four cases: rejects `!!python/object/apply` tags, parses valid YAML to its dict shape, resolves missing files to an empty dict, and resolves empty files to an empty dict. `122 passed` (was 119 — three new cases on top of the relocated rejection test). * `README.md` — `Log Level` row now lists `warning` alongside the existing `debug` / `info` / `warn` / `error` values, matching what the connector config schema actually accepts (`__metadata__/ connector_config_schema.json:53-58`). Spelled out that `warn` and `warning` are accepted aliases so operators can pick either form. Lint clean (`black --check`, `isort --profile black --check-only`, `flake8 --select=F`, `flake8 --ignore=E,W`) across `internal-enrichment/assemblyline/`.
Final review summaryPulled the branch, ran a senior-reviewer pass over the touched files, addressed the two outstanding Copilot threads, and re-ran the lint suite plus the connector test suite. Every required CI check is green on What landed in
|
Addresses the last open Copilot review thread on `main.py`. The flag is computed from `any(malicious_iocs.values())` and `malicious_iocs` carries FOUR buckets: `domains`, `ips`, `urls` AND `families` (the malware-family attributions). The previous `has_malicious_iocs` name suggested the flag reflected the three IOC buckets only, which is exactly the kind of misreading that leads a future refactor to drop the family-attribution path on the floor. Three callsites renamed (one in `_build_malware_analysis`, one in `_process_file`, one in the summary-note verdict computation in `_create_summary_note`); each carries a comment that spells the rationale out so the next reader cannot mistake the flag for "IOCs-only". Semantics are unchanged — a confirmed malware-family attribution from AssemblyLine has always been (and remains) strong enough evidence on its own to force the `malicious` verdict / labelling, regardless of whether any IOC tags came back from the same analysis. No test changes — the renamed variable is private to each method and the existing 122-case suite already covers the family-only branch (`TestSummaryNoteUsesCreatedCountsNotExtractedLengths`, `TestIocClassificationLabelsAndScores`, `TestSuspiciousVsMaliciousVerdict`). `pytest`, `black --check`, `isort --check-only`, `flake8 --ignore=E,W` all clean.
Final review-and-fix pass summaryIndependent senior-reviewer pass on the current tip — addressed the last open Copilot review thread and re-verified every safety property end-to-end on What landed in
|
…le fetch
Closes the two new Copilot review threads on `aa3e2548`.
* **`Dockerfile` layering.** The previous shape copied the entire
`src/` tree before `pip install`, so any source-only edit
invalidated the dependency-install layer cache and forced a full
reinstall on every rebuild — wasted time on CI and on every local
iteration. Switched to the established `internal-enrichment/joe-
sandbox/Dockerfile` pattern (already referenced in the inline
`apk` rationale on the same file): copy only `requirements.txt`
first, run `pip install`, then copy the rest of `src/`. The
dependency-install layer is now reusable across source-only
rebuilds. Cache behaviour matches every other recent connector
Dockerfile in the monorepo.
* **`_fetch_attached_file` type guard.** Mirrors
`_download_import_file`: `fetch_opencti_file(..., binary=True)`
is documented to return bytes, but in degraded conditions (e.g.
the platform's reverse proxy surfacing an HTML error page)
callers have seen `str` come back instead. Without the guard,
the downstream `io.BytesIO(file_content)` in `_process_file`
(and the AssemblyLine SDK `submit` call) crashed with an opaque
`TypeError` that pointed at the wrong place. Now raises a clear
`Exception("fetch_opencti_file returned a non-binary payload
(<type>)")` at the fetch site so the existing
`try / except Exception as exc: log_warning(...)` block in
`_get_file_content` catches it and emits an actionable warning
pointing at the actual source. Inline docstring captures the
rationale.
Tests / lint: 122 / 122 pytest pass (unchanged — the renamed
behaviour is private to each method and the existing coverage of
the binary-payload path already exercises both branches);
`black --check`, `isort --profile black --check-only`,
`flake8 --ignore=E,W` all clean on the touched files.
Follow-up review pass on
|
Proposed changes
Adds a new internal-enrichment connector for AssemblyLine 4 at
internal-enrichment/assemblyline/.When OpenCTI asks the connector to enrich a
StixFileorArtifact, the connector:pycti'shelper.api.fetch_opencti_file, inheriting the platform's session timeouts, retries, custom CA bundles and proxy / SSL settings);TLP:CLEAR→TLP:C,TLP:GREEN→TLP:G,TLP:AMBER→TLP:A,TLP:RED→TLP:R), falling back toASSEMBLYLINE_CLASSIFICATIONwhen the source has no TLP — so a sample is never silently downgraded once it leaves OpenCTI;failed/error/cancelledstates immediately viaAssemblyLineTerminalErrorinstead of waiting for the global timeout);Malware-AnalysisSDO (submission id, profile, verdict, score) that surfaces in the Malware Analysis section of the enriched observable;Indicatorobjects (and optional matchingObservableobjects linked viabased-onrelationships) for every malicious IOC AssemblyLine extracted (domains, IPs, URLs), labelledmaliciouswithx_opencti_score=80. WhenASSEMBLYLINE_INCLUDE_SUSPICIOUS=true, suspicious-only IOCs are emitted as separateIndicatorSDOs labelledsuspiciouswithx_opencti_score=50so OpenCTI keeps the two classifications distinct (the original implementation labelled everythingmaliciousregardless of AssemblyLine's tag);MalwareSDOs for every malware family attributed by AssemblyLine;Attack-PatternSDOs for every MITRE ATT&CK technique observed at runtime, linked to the generated indicators withrelated-torelationships;Notesummarising the verdict and per-category indicator counts (the per-bucket numbers come from the connector's own bookkeeping in_create_indicators, not the raw IOC-extraction list lengths, so the counts always match what was actually written to OpenCTI even when the per-bucket 20-indicator cap is engaged; the Note also renders aSuspicious IOCs Created as Indicatorsblock wheneverASSEMBLYLINE_INCLUDE_SUSPICIOUS=trueproduces any);External-Referenceattached to the enriched observable pointing back to the AssemblyLine submission.All three "create X" behaviours and the "sequential mode" (queue submissions while AssemblyLine is busy, bounded by
ASSEMBLYLINE_TIMEOUT) are configurable viaASSEMBLYLINE_CREATE_ATTACK_PATTERNS/ASSEMBLYLINE_CREATE_MALWARE_ANALYSIS/ASSEMBLYLINE_CREATE_OBSERVABLES/ASSEMBLYLINE_SEQUENTIAL_MODEso operators can tune the noise level on the OpenCTI side.The connector also enforces a TLP gate (
ASSEMBLYLINE_MAX_TLP, defaultTLP:AMBER) before any file content leaves OpenCTI for the external AssemblyLine deployment. Every derived SDO, SCO, Indicator and SRO inherits the source observable'sobject_marking_refs, so aTLP:AMBERsource producesTLP:AMBERanalysis observables, indicators, Malware family SDOs, Attack-Pattern SDOs, the Malware-Analysis SDO, the summary Note, and everyrelated-to/based-onrelationship between them ΓÇö never silently downgraded toTLP:CLEAR.Source
This is a re-submission of #5199 (which became unmergeable after the head branch was accidentally promoted to master and then rolled back). The implementation is the one from @V1D1AN's standalone connector repo at https://github.com/V1D1AN/connector-assemblyline, adapted to the OpenCTI monorepo conventions and with all the review findings from #5199 fixed in this branch.
Monorepo adaptations applied on top of the upstream code
internal-enrichment/assemblyline/with the standard layout (Dockerfile,entrypoint.sh,docker-compose.yml,README.md,.dockerignore,__metadata__/connector_manifest.json,__metadata__/connector_config_schema.json,__metadata__/logo.png,src/{main.py, requirements.txt, config.yml.sample, __init__.py},tests/{test_connector.py, test-requirements.txt, __init__.py}).Dockerfilerebased onpython:3.12-alpinematching the leaninternal-enrichment/joe-sandboxpattern:libmagic+libffi-devfor the runtime,git+build-baseonly forpip install(removed afterwards).assemblyline-client==4.9.9does not transitively pulllxml, so the previously-installedlibxml2-dev/libxslt-devheaders (never cleaned up) are dropped.docker-compose.ymlrewritten withimage: opencti/connector-assemblyline:rolling, inlineCONNECTOR_*/ASSEMBLYLINE_*environment variables with sensible defaults.src/requirements.txtpinned topycti==7.260521.0(aligned with the rest of the monorepo's current pin),assemblyline-client==4.9.9,stix2==3.0.1,requests~=2.32,PyYAML>=6.0,<7. Removed the bogusuuid>=1.30PyPI package (the stdlibuuidmodule ships with Python and the third-party package is unmaintained).connector_manifest.jsonfixed:sluglowercased toassemblyline,descriptionaccurately reflects what the connector actually emits (domains, IPs, URLs, Malware SDOs, ATT&CK Attack Patterns ΓÇö no spurious "file hashes" claim),logopath adjusted tointernal-enrichment/assemblyline/__metadata__/logo.png,container_imageset toopencti/connector-assemblyline,source_codepointed at the monorepo path.tests/test-requirements.txtadded so the GitHub Actions test runner picks the connector up.Review findings from #5199 addressed in this branch
Correctness
ASSEMBLYLINE_VERIFY_SSLsorequestsdoes not treat the string"false"/"true"as a CA-bundle path.ASSEMBLYLINE_TIMEOUTso the worker can never block forever when AssemblyLine stays busy.ASSEMBLYLINE_POLL_INTERVALis clamped tomax(1, int(...))so a staleconfig.ymlor env var cannot turn the wait into a busy loop.defense-evasion, ...) so OpenCTI can match them against the official kill-chain entries.related-to, matching what the README documents._create_relationshipspass so observables are not created twice andASSEMBLYLINE_CREATE_OBSERVABLESis the single source of truth.ipv6-addr/IPv6-Addr(they were modelled asipv4-addr)."trojan"malware label so families keep their actual classification.importFilesin addition tox_opencti_files, matching the other sandbox connectors./storage/get/<id>URL tohelper.api.fetch_opencti_fileΓÇö and do the same for theimportFilespath so it inherits pycti's HTTP session config (timeouts, retries, CA bundles, proxy / SSL) instead of using a rawrequests.getwith a manually-setAuthorizationheader.failed/error/cancelled) now surface immediately through a dedicatedAssemblyLineTerminalErrorexception that the polling loop re-raises in front of its broad transient-error catcher ΓÇö the previous code masked terminal failures and let the enrichment run until the global timeout._extract_malicious_iocsreturns only IOCs AssemblyLine taggedmalicious; a new_extract_suspicious_iocsreturns the suspicious subset (whenASSEMBLYLINE_INCLUDE_SUSPICIOUS=true). Downstream: themaliciouslabel on the source observable, the highx_opencti_score, and themalware-analysis.result=maliciousupgrade only fire on actually-malicious IOCs. Suspicious-only verdicts produce asuspiciouslabel,x_opencti_score=50, and indicators labelledsuspiciouswith the same score. The summary Note now renders aSuspicious IOCs Created as Indicatorssection whenever the suspicious bucket is non-empty so the user-facing summary cannot contradict what the connector actually sent._parse_al_timestampparses positive AND negative ISO-8601 offsets correctly. The earliersplit('+', 1)[0]discarded positive offsets entirely and silently fell back to the caller's default for negative ones. Switched todatetime.fromisoformatafter normalising theZsuffix, thenastimezone(UTC).replace(tzinfo=None)so the rest of the connector keeps using a single naive-UTC representation.OPENCTI_URL,ASSEMBLYLINE_URL) at assignment time sorequests>= 2.34 doesn't produce double-slash paths ΓÇö also satisfies the repo-widetests/test_url_construction.pyguard._check_existing_analysissearches AssemblyLine submissions viafiles.sha256:<hash>ΓÇö_process_fileand_get_artifact_contentnow pick the SHA-256 explicitly via_select_sha256(observable["hashes"])instead of any-hash, so MD5 / SHA-1 inputs no longer generate audit-log noise hitting an unmatched Lucene query and cannot accidentally hit an unrelated submission._get_assemblyline_identityswallows errors and storesNone;sco_author_propertiesnow omitsx_opencti_created_by_refandstix2.MalwareAnalysis(...)now omitscreated_by_refwhen the identity is unavailable, so the rest of the enrichment still lands._create_indicatorscaps creation at 20 indicators per IOC bucket (domains / IPs / URLs, for both the malicious and the suspicious classifications); the Note's "Malicious IOCs Created as Indicators" / "Suspicious IOCs Created as Indicators" sections now read from per-category counters in thecountsdict (malicious_domains/malicious_ips/malicious_urls/suspicious_domains/suspicious_ips/suspicious_urls) instead oflen(malicious_iocs[...])/len(suspicious_iocs[...]). Counters are incremented only on a successfulindicator.create, so transient OpenCTI failures cannot inflate them either. Result: on a 50-domain analysis the Note now shows**Malicious Domains:** 20(matching what was actually written to OpenCTI and the run's success-message) instead of**Malicious Domains:** 50.Security
ASSEMBLYLINE_MAX_TLP, defaultTLP:AMBER) before any file content leaves OpenCTI for the external AssemblyLine deployment.TLP:CLEAR/TLP:WHITE→TLP:C,TLP:GREEN→TLP:G,TLP:AMBER/TLP:AMBER+STRICT→TLP:A,TLP:RED→TLP:R, falling back toASSEMBLYLINE_CLASSIFICATIONwhen the source has no TLP marking. ATLP:AMBERfile passing the max-TLP gate is no longer silently downgraded to the connector default (TLP:C) once it leaves OpenCTI.entity_type+idto avoid leakingpayload_binor other sensitive observable fields.object_marking_refs(every derived SDO / SCO / Indicator / Malware family / Malware-Analysis / Attack-Pattern / summary Note, and everyrelated-to/based-onSRO between them) instead of being hard-coded toTLP:WHITEor left unmarked. A file that passes theASSEMBLYLINE_MAX_TLPgate withTLP:AMBERnow producesTLP:AMBERanalysis objects throughout — never downgraded. The_source_marking_refshelper readsobservable["objectMarking"][*]["standard_id"]and falls back to OpenCTI's customTLP:CLEARmarking only when the source carries no marking at all.yaml.safe_loadinside awith open(...)context manager rather thanyaml.load(open(...), Loader=FullLoader)—FullLoadercould instantiate arbitrary Python objects through YAML tags from a tamperedconfig.yml, and the bareopen()leaked the file handle on parse errors.UX
External-Referenceto the enriched observable, not just to the Malware-Analysis SDO.max_tlpknob indocker-compose.yml,config.yml.sampleand the README.requests.postto/api/v4/submit/, withassemblyline-clientused for polling / summary retrieval / file fetch), and the TLP propagation contract is spelt out.Tests
CREATE_OBSERVABLESflag, malware-family label, file-fetch URL.TestSourceMarkingRefs,TestMalwareAnalysisPropagatesSourceMarkings,TestApiIndicatorAndObservableInheritSourceMarkings,TestSummaryNoteInheritsSourceMarking,TestMalwareFamilyInheritsSourceMarkingsandTestAttackPatternInheritsSourceMarkingspin the marking-inheritance contract end-to-end across the full derived sub-graph (SCOs, Indicators, Observables, Malware family SDOs, Malware-Analysis, Attack-Patterns, summary Note, and therelated-to/based-onSROs between them).TestTerminalAssemblyLineStates::test_process_file_raises_terminal_error_on_terminal_statepins the real_process_filepolling-loop contract end-to-end: stubsrequests.post(200 + validsid) andal_client.submission.fullto return{"state": <terminal>}on the first poll, assertsAssemblyLineTerminalErroris raised, the submission id appears in the message, andsubmission.fullran exactly once (no retry, no sleep-until-timeout fallback). Parameterised across all three terminal states (failed/error/cancelled).TestMaliciousIOCExtractionextended to pin the suspicious / malicious split.TestIocClassificationLabelsAndScorespins the per-IOC label + score plumbing.TestResolveSubmissionClassificationpins the TLP → AL classification mapping.TestParseAlTimestampextended to pin positive AND negative ISO-8601 offset parsing.TestConfigYamlSafeLoadpins thesafe_loadcontract against!!python/object/applytags.TestUnpinnedFileFetchextended to pin_download_import_filerouting throughhelper.api.fetch_opencti_file.TestMissingIdentityDoesNotEmitNoneAuthorpins both thesco_author_propertiesandstix2.MalwareAnalysis(...)paths when_get_assemblyline_identityreturnedNone— the bundle is still sent and the derived SCOs / Malware-Analysis serialise without anullauthor.TestProcessFileShaDeduppins the SHA-256-only dedup contract (called once for a SHA-256, never for an MD5 / SHA-1-only observable).TestSummaryNoteSuspiciousSectionpins that the Note'sSuspicious IOCsblock is rendered with correct counts when present and omitted otherwise.TestCreateIndicatorsTracksPerCategoryCreatedCountspins the per-category created-indicator counters on thecountsdict (capped at 20 per bucket; only incremented on a successfulindicator.create; tracked separately for the malicious and suspicious buckets).TestSummaryNoteUsesCreatedCountsNotExtractedLengthspins that the Note's malicious / suspicious "Created as Indicators" sections reportcounts.get('<classification>_<kind>', 0)rather thanlen(...)so the displayed numbers match what the connector actually wrote to OpenCTI.Related issues
Checklist
Further comments
The connector is published as a new entry under
internal-enrichment/assemblyline/; no existing connector is modified. The author is recorded as the AssemblyLine identity so generated SDOs are clearly attributable in the OpenCTI UI.