Skip to content

Commit 1003ce1

Browse files
authored
Remove Starter Digital Fingerprinting (DFP) (nv-morpheus#1903)
- Remove all references to the Starter DFP in docs - Remove classes which only exist for the Starter DFP. - Remove tests and associated test data for the Starter DFP - Remove Starter DFP from CLI Closes nv-morpheus#1715 Closes nv-morpheus#1713 Closes nv-morpheus#1641 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md). - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Eli Fajardo (https://github.com/efajardo-nv) Approvers: - David Gardner (https://github.com/dagardner-nv) URL: nv-morpheus#1903
1 parent a2949af commit 1003ce1

40 files changed

+33
-2683
lines changed

ci/release/update-version.sh

-1
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,6 @@ sed_runner "s/v${CURRENT_FULL_VERSION}-runtime/v${NEXT_FULL_VERSION}-runtime/g"
9191
examples/digital_fingerprinting/production/docker-compose.yml \
9292
examples/digital_fingerprinting/production/Dockerfile
9393
sed_runner "s/v${CURRENT_FULL_VERSION}-runtime/v${NEXT_FULL_VERSION}-runtime/g" examples/digital_fingerprinting/production/Dockerfile
94-
sed_runner "s|blob/branch-${CURRENT_SHORT_TAG}|blob/branch-${NEXT_SHORT_TAG}|g" examples/digital_fingerprinting/starter/README.md
9594

9695
# examples/developer_guide
9796
sed_runner 's/'"VERSION ${CURRENT_FULL_VERSION}.*"'/'"VERSION ${NEXT_FULL_VERSION}"'/g' \

docs/source/basics/overview.rst

+1-4
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The Morpheus CLI is built on the Click Python package which allows for nested co
2727
together. At a high level, the CLI is broken up into two main sections:
2828

2929
* ``run``
30-
* For running AE, FIL, NLP or OTHER pipelines.
30+
* For running FIL, NLP or OTHER pipelines.
3131
* ``tools``
3232
* Tools/Utilities to help set up, configure and run pipelines and external resources.
3333

@@ -58,16 +58,13 @@ run:
5858
--help Show this message and exit.
5959
6060
Commands:
61-
pipeline-ae Run the inference pipeline with an AutoEncoder model
6261
pipeline-fil Run the inference pipeline with a FIL model
6362
pipeline-nlp Run the inference pipeline with a NLP model
6463
pipeline-other Run a custom inference pipeline without a specific model type
6564
6665
6766
Currently, Morpheus pipeline can be operated in four different modes.
6867

69-
* ``pipeline-ae``
70-
* This pipeline mode is used to run training/inference on the AutoEncoder model.
7168
* ``pipeline-fil``
7269
* This pipeline mode is used to run inference on FIL (Forest Inference Library) models such as XGBoost, RandomForestClassifier, etc.
7370
* ``pipeline-nlp``

docs/source/cloud_deployment_guide.md

+3-43
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@ limitations under the License.
3232
- [Verify Model Deployment](#verify-model-deployment)
3333
- [Create Kafka Topics](#create-kafka-topics)
3434
- [Example Workflows](#example-workflows)
35-
- [Run AutoEncoder Digital Fingerprinting Pipeline](#run-autoencoder-digital-fingerprinting-pipeline)
3635
- [Run NLP Phishing Detection Pipeline](#run-nlp-phishing-detection-pipeline)
3736
- [Run NLP Sensitive Information Detection Pipeline](#run-nlp-sensitive-information-detection-pipeline)
3837
- [Run FIL Anomalous Behavior Profiling Pipeline](#run-fil-anomalous-behavior-profiling-pipeline)
@@ -383,10 +382,9 @@ kubectl -n $NAMESPACE exec deploy/broker -c broker -- kafka-topics.sh \
383382

384383
This section describes example workflows to run on Morpheus. Four sample pipelines are provided.
385384

386-
1. AutoEncoder pipeline performing Digital Fingerprinting (DFP).
387-
2. NLP pipeline performing Phishing Detection (PD).
388-
3. NLP pipeline performing Sensitive Information Detection (SID).
389-
4. FIL pipeline performing Anomalous Behavior Profiling (ABP).
385+
1. NLP pipeline performing Phishing Detection (PD).
386+
2. NLP pipeline performing Sensitive Information Detection (SID).
387+
3. FIL pipeline performing Anomalous Behavior Profiling (ABP).
390388

391389
Multiple command options are given for each pipeline, with varying data input/output methods, ranging from local files to Kafka Topics.
392390

@@ -424,44 +422,6 @@ helm install --set ngc.apiKey="$API_KEY" \
424422
morpheus-sdk-client
425423
```
426424

427-
428-
### Run AutoEncoder Digital Fingerprinting Pipeline
429-
The following AutoEncoder pipeline example shows how to train and validate the AutoEncoder model and write the inference results to a specified location. Digital fingerprinting has also been referred to as **HAMMAH (Human as Machine <> Machine as Human)**.
430-
These use cases are currently implemented to detect user behavior changes that indicate a change from a human to a machine or a machine to a human, thus leaving a "digital fingerprint." The model is an ensemble of an autoencoder and fast Fourier transform reconstruction.
431-
432-
Inference and training based on a user ID (`user123`). The model is trained once and inference is conducted on the supplied input entries in the example pipeline below. The `--train_data_glob` parameter must be removed for continuous training.
433-
434-
```bash
435-
helm install --set ngc.apiKey="$API_KEY" \
436-
--set sdk.args="morpheus --log_level=DEBUG run \
437-
--edge_buffer_size=4 \
438-
--pipeline_batch_size=1024 \
439-
--model_max_batch_size=1024 \
440-
pipeline-ae \
441-
--columns_file=data/columns_ae_cloudtrail.txt \
442-
--userid_filter=user123 \
443-
--feature_scaler=standard \
444-
--userid_column_name=userIdentitysessionContextsessionIssueruserName \
445-
--timestamp_column_name=event_dt \
446-
from-cloudtrail --input_glob=/common/models/datasets/validation-data/dfp-cloudtrail-*-input.csv \
447-
--max_files=200 \
448-
train-ae --train_data_glob=/common/models/datasets/training-data/dfp-cloudtrail-*.csv \
449-
--source_stage_class=morpheus.stages.input.cloud_trail_source_stage.CloudTrailSourceStage \
450-
--seed 42 \
451-
preprocess \
452-
inf-pytorch \
453-
add-scores \
454-
timeseries --resolution=1m --zscore_threshold=8.0 --hot_start \
455-
monitor --description 'Inference Rate' --smoothing=0.001 --unit inf \
456-
serialize \
457-
to-file --filename=/common/data/<YOUR_OUTPUT_DIR>/cloudtrail-dfp-detections.csv --overwrite" \
458-
--namespace $NAMESPACE \
459-
<YOUR_RELEASE_NAME> \
460-
morpheus-sdk-client
461-
```
462-
463-
For more information on the Digital Fingerprint use cases, refer to the starter example and a more production-ready example that can be found in the `examples` source directory.
464-
465425
### Run NLP Phishing Detection Pipeline
466426

467427
The following Phishing Detection pipeline examples use a pre-trained NLP model to analyze emails (body) and determine phishing or benign. Here is the sample data as shown below is used to pass as an input to the pipeline.

docs/source/developer_guide/contributing.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -375,7 +375,7 @@ Launching a full production Kafka cluster is outside the scope of this project;
375375

376376
### Pipeline Validation
377377

378-
To verify that all pipelines are working correctly, validation scripts have been added at `${MORPHEUS_ROOT}/scripts/validation`. There are scripts for each of the main workflows: Anomalous Behavior Profiling (ABP), Humans-as-Machines-Machines-as-Humans (HAMMAH), Phishing Detection (Phishing), and Sensitive Information Detection (SID).
378+
To verify that all pipelines are working correctly, validation scripts have been added at `${MORPHEUS_ROOT}/scripts/validation`. There are scripts for each of the main workflows: Anomalous Behavior Profiling (ABP), Phishing Detection (Phishing), and Sensitive Information Detection (SID).
379379

380380
To run all of the validation workflow scripts, use the following commands:
381381

docs/source/developer_guide/guides/5_digital_fingerprinting.md

+12-43
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Every account, user, service, and machine has a digital fingerprint that represe
2323
To construct this digital fingerprint, we will be training unsupervised behavioral models at various granularities, including a generic model for all users in the organization along with fine-grained models for each user to monitor their behavior. These models are continuously updated and retrained over time​, and alerts are triggered when deviations from normality occur for any user​.
2424

2525
## Training Sources
26-
The data we will want to use for the training and inference will be any sensitive system that the user interacts with, such as VPN, authentication and cloud services. The digital fingerprinting example (`examples/digital_fingerprinting/README.md`) included in Morpheus ingests logs from [AWS CloudTrail](https://docs.aws.amazon.com/cloudtrail/index.html), [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/concept-sign-ins), and [Duo Authentication](https://duo.com/docs/adminapi).
26+
The data we will want to use for the training and inference will be any sensitive system that the user interacts with, such as VPN, authentication and cloud services. The digital fingerprinting example (`examples/digital_fingerprinting/README.md`) included in Morpheus ingests logs from [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/concept-sign-ins), and [Duo Authentication](https://duo.com/docs/adminapi).
2727

2828
The location of these logs could be either local to the machine running Morpheus, a shared file system like NFS, or on a remote store such as [Amazon S3](https://aws.amazon.com/s3/).
2929

@@ -44,54 +44,23 @@ Adding a new source for the DFP pipeline requires defining five critical pieces:
4444
1. A [`DataFrameInputSchema`](6_digital_fingerprinting_reference.md#dataframe-input-schema-dataframeinputschema) for the [`DFPFileToDataFrameStage`](6_digital_fingerprinting_reference.md#file-to-dataframe-stage-dfpfiletodataframestage) stage.
4545
1. A [`DataFrameInputSchema`](6_digital_fingerprinting_reference.md#dataframe-input-schema-dataframeinputschema) for the [`DFPPreprocessingStage`](6_digital_fingerprinting_reference.md#preprocessing-stage-dfppreprocessingstage).
4646

47-
## DFP Examples
48-
The DFP workflow is provided as two separate examples: a simple, "starter" pipeline for new users and a complex, "production" pipeline for full scale deployments. While these two examples both perform the same general tasks, they do so in very different ways. The following is a breakdown of the differences between the two examples.
49-
50-
### The "Starter" Example
51-
52-
This example is designed to simplify the number of stages and components and provide a fully contained workflow in a single pipeline.
53-
54-
Key Differences:
55-
* A single pipeline which performs both training and inference
56-
* Requires no external services
57-
* Can be run from the Morpheus CLI
58-
59-
This example is described in more detail in `examples/digital_fingerprinting/starter/README.md`.
60-
61-
### The "Production" Example
47+
## Production Deployment Example
6248

6349
This example is designed to illustrate a full-scale, production-ready, DFP deployment in Morpheus. It contains all of the necessary components (such as a model store), to allow multiple Morpheus pipelines to communicate at a scale that can handle the workload of an entire company.
6450

65-
Key Differences:
51+
Key Features:
6652
* Multiple pipelines are specialized to perform either training or inference
67-
* Requires setting up a model store to allow the training and inference pipelines to communicate
53+
* Uses a model store to allow the training and inference pipelines to communicate
6854
* Organized into a docker-compose deployment for easy startup
6955
* Contains a Jupyter notebook service to ease development and debugging
7056
* Can be deployed to Kubernetes using provided Helm charts
7157
* Uses many customized stages to maximize performance.
7258

7359
This example is described in `examples/digital_fingerprinting/production/README.md` as well as the rest of this document.
7460

75-
### DFP Features
61+
## DFP Features
7662

77-
#### AWS CloudTrail
78-
| Feature | Description |
79-
| ------- | ----------- |
80-
| `userIdentityaccessKeyId` | for example, `ACPOSBUM5JG5BOW7B2TR`, `ABTHWOIIC0L5POZJM2FF`, `AYI2CM8JC3NCFM4VMMB4` |
81-
| `userAgent` | for example, `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 10.0; Trident/5.1)`, `Mozilla/5.0 (Linux; Android 4.3.1) AppleWebKit/536.1 (KHTML, like Gecko) Chrome/62.0.822.0 Safari/536.1`, `Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10 7_0; rv:1.9.4.20) Gecko/2012-06-10 12:09:43 Firefox/3.8` |
82-
| `userIdentitysessionContextsessionIssueruserName` | for example, `role-g` |
83-
| `sourceIPAddress` | for example, `208.49.113.40`, `123.79.131.26`, `128.170.173.123` |
84-
| `userIdentityaccountId` | for example, `Account-123456789` |
85-
| `errorMessage` | for example, `The input fails to satisfy the constraints specified by an AWS service.`, `The specified subnet cannot be found in the VPN with which the Client VPN endpoint is associated.`, `Your account is currently blocked. Contact [email protected] if you have questions.` |
86-
| `userIdentitytype` | for example, `FederatedUser` |
87-
| `eventName` | for example, `GetSendQuota`, `ListTagsForResource`, `DescribeManagedPrefixLists` |
88-
| `userIdentityprincipalId` | for example, `39c71b3a-ad54-4c28-916b-3da010b92564`, `0baf594e-28c1-46cf-b261-f60b4c4790d1`, `7f8a985f-df3b-4c5c-92c0-e8bffd68abbf` |
89-
| `errorCode` | for example, success, `MissingAction`, `ValidationError` |
90-
| `eventSource` | for example, `lopez-byrd.info`, `robinson.com`, `lin.com` |
91-
| `userIdentityarn` | for example, `arn:aws:4a40df8e-c56a-4e6c-acff-f24eebbc4512`, `arn:aws:573fd2d9-4345-487a-9673-87de888e4e10`, `arn:aws:c8c23266-13bb-4d89-bce9-a6eef8989214` |
92-
| `apiVersion` | for example, `1984-11-26`, `1990-05-27`, `2001-06-09` |
93-
94-
#### Azure Active Directory
63+
### Azure Active Directory
9564
| Feature | Description |
9665
| ------- | ----------- |
9766
| `appDisplayName` | for example, `Windows sign in`, `MS Teams`, `Office 365`|
@@ -104,14 +73,14 @@ This example is described in `examples/digital_fingerprinting/production/README.
10473
| `location.countryOrRegion` | country or region name​ |
10574
| `location.city` | city name |
10675

107-
##### Derived Features
76+
#### Derived Features
10877
| Feature | Description |
10978
| ------- | ----------- |
11079
| `logcount` | tracks the number of logs generated by a user within that day (increments with every log)​ |
11180
| `locincrement` | increments every time we observe a new city (`location.city`) in a user's logs within that day​ |
11281
| `appincrement` | increments every time we observe a new app (`appDisplayName`) in a user's logs within that day​ |
11382

114-
#### Duo Authentication
83+
### Duo Authentication
11584
| Feature | Description |
11685
| ------- | ----------- |
11786
| `auth_device.name` | phone number​ |
@@ -121,7 +90,7 @@ This example is described in `examples/digital_fingerprinting/production/README.
12190
| `reason` | reason for the results, for example, `User Cancelled`, `User Approved`, `User Mistake`, `No Response`|
12291
| `access_device.location.city` | city name |
12392

124-
##### Derived Features
93+
#### Derived Features
12594
| Feature | Description |
12695
| ------- | ----------- |
12796
| `logcount` | tracks the number of logs generated by a user within that day (increments with every log)​ |
@@ -133,16 +102,16 @@ DFP in Morpheus is accomplished via two independent pipelines: training and infe
133102

134103
![High Level Architecture](img/dfp_high_level_arch.png)
135104

136-
#### Training Pipeline
105+
### Training Pipeline
137106
* Trains user models and uploads to the model store​
138107
* Capable of training individual user models or a fallback generic model for all users​
139108

140-
#### Inference Pipeline
109+
### Inference Pipeline
141110
* Downloads user models from the model store​
142111
* Generates anomaly scores per log​
143112
* Sends detected anomalies to monitoring services
144113

145-
#### Monitoring
114+
### Monitoring
146115
* Detected anomalies are published to an S3 bucket, directory or a Kafka topic.
147116
* Output can be integrated with a monitoring tool.
148117

docs/source/extra_info/known_issues.md

-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@ limitations under the License.
1717

1818
# Known Issues
1919

20-
- TrainAEStage fails with a Segmentation fault ([#1641](https://github.com/nv-morpheus/Morpheus/issues/1641))
2120
- `vdb_upload` example pipeline triggers an internal error in Triton ([#1649](https://github.com/nv-morpheus/Morpheus/issues/1649))
2221

2322
Refer to [open issues in the Morpheus project](https://github.com/nv-morpheus/Morpheus/issues)

docs/source/getting_started.md

-30
Original file line numberDiff line numberDiff line change
@@ -375,36 +375,6 @@ Commands:
375375
trigger Buffer data until the previous stage has completed.
376376
validate Validate pipeline output for testing.
377377
```
378-
379-
And for the AE pipeline:
380-
381-
```
382-
$ morpheus run pipeline-ae --help
383-
Usage: morpheus run pipeline-ae [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...
384-
385-
<Help Paragraph Omitted>
386-
387-
Commands:
388-
add-class Add detected classifications to each message.
389-
add-scores Add probability scores to each message.
390-
buffer (Deprecated) Buffer results.
391-
delay (Deprecated) Delay results for a certain duration.
392-
filter Filter message by a classification threshold.
393-
from-azure Source stage is used to load Azure Active Directory messages.
394-
from-cloudtrail Load messages from a CloudTrail directory.
395-
from-duo Source stage is used to load Duo Authentication messages.
396-
inf-pytorch Perform inference with PyTorch.
397-
inf-triton Perform inference with Triton Inference Server.
398-
monitor Display throughput numbers at a specific point in the pipeline.
399-
preprocess Prepare Autoencoder input DataFrames for inference.
400-
serialize Includes & excludes columns from messages.
401-
timeseries Perform time series anomaly detection and add prediction.
402-
to-file Write all messages to a file.
403-
to-kafka Write all messages to a Kafka cluster.
404-
train-ae Train an Autoencoder model on incoming data.
405-
trigger Buffer data until the previous stage has completed.
406-
validate Validate pipeline output for testing.
407-
```
408378
Note: The available commands for different types of pipelines are not the same. This means that the same stage, when used in different pipelines, may have different options. Check the CLI help for the most up-to-date information during development.
409379
410380
## Next Steps

0 commit comments

Comments
 (0)