183 lines
7.3 KiB
Markdown
183 lines
7.3 KiB
Markdown
# Machine Learning on Domain Generation Algorithm (DGA)
|
|
|
|
Several blogs were put out on how you can create and leverage supervised DGA ML models to enrich data within the stack.
|
|
* Part 1: [Machine learning in cybersecurity: Training supervised models to detect DGA activity](https://www.elastic.co/blog/machine-learning-in-cybersecurity-training-supervised-models-to-detect-dga-activity)
|
|
* Part 2: [Machine learning in cybersecurity: Detecting DGA activity in network data](https://www.elastic.co/blog/machine-learning-in-cybersecurity-detecting-dga-activity-in-network-data)
|
|
|
|
You can also find some supplementary and examples [here](https://github.com/elastic/examples/tree/master/Machine%20Learning/DGA%20Detection)
|
|
|
|
For questions, please reach out to the ML team in the #machine-learning channel of the
|
|
[Elastic public slack channel](https://www.elastic.co/blog/join-our-elastic-stack-workspace-on-slack)
|
|
|
|
They can also be reached by using the `stack-machine-learning` tag in the [discuss forums](https://discuss.elastic.co/tags/c/elastic-stack/stack-machine-learning)
|
|
|
|
*Note: in order to use these ML features, you must have a platinum or higher [subscription](https://www.elastic.co/subscriptions)*
|
|
*Note: the ML features are considered experimental in Kibana as well as this rules CLI*
|
|
|
|
## Releases
|
|
|
|
Models and dependencies will be [released](https://github.com/elastic/detection-rules/releases) as `ML-DGA-YYYMMDD-N`.
|
|
This tag name is what will need to be passed to the CLI command.
|
|
|
|
## Uploading a model and dependencies using the CLI
|
|
|
|
### Usage
|
|
|
|
```console
|
|
python -m detection_rules es experimental setup-dga-model -h
|
|
|
|
Elasticsearch client:
|
|
Options:
|
|
-et, --timeout INTEGER Timeout for elasticsearch client
|
|
-ep, --es-password TEXT
|
|
-eu, --es-user TEXT
|
|
--cloud-id TEXT
|
|
--elasticsearch-url TEXT
|
|
|
|
|
|
* experimental commands are use at your own risk and may change without warning *
|
|
|
|
Usage: detection_rules es experimental setup-dga-model [OPTIONS]
|
|
|
|
Upload ML DGA model and dependencies and enrich DNS data.
|
|
|
|
Options:
|
|
-t, --model-tag TEXT Release tag for model files staged in detection-
|
|
rules (required to download files)
|
|
-r, --repo TEXT GitHub repository hosting the model file releases
|
|
(owner/repo)
|
|
-d, --model-dir DIRECTORY Directory containing local model files
|
|
--overwrite Overwrite all files if already in the stack
|
|
-h, --help Show this message and exit.
|
|
|
|
```
|
|
|
|
### Detailed steps
|
|
|
|
#### 1. Upload and setup the model file and dependencies
|
|
|
|
Run `python -m detection_rules es <args_or_config> experimental setup-dga-model -t <release-tag>`
|
|
|
|
*If updating a new model, you should first uninstall any existing models using `remove-dga-model`*
|
|
|
|
You can also upload files locally using the `-d` option, so long as the naming convention of the files match the
|
|
expected pattern for the filenames.
|
|
|
|
#### 2. Update packetbeat configuration
|
|
|
|
You will need to update your packebeat.yml config file to point to the enrichment pipeline
|
|
|
|
Under `Elasticsearch Output` add the following:
|
|
|
|
```yaml
|
|
output.elasticsearch:
|
|
hosts: ["your-hostname:your-port"]
|
|
pipeline: dns_enrich_pipeline
|
|
```
|
|
|
|
#### 3. Refresh your packetbeat index
|
|
|
|
You can optionally choose to refresh your packetbeat index mapping within Kibana:
|
|
* navigate to `Stack Management > (Kibana) Index Patterns`
|
|
* select the applicable packetbeat index
|
|
* click `refresh field list`
|
|
|
|
#### 4. Verify enrichment fields
|
|
|
|
Any packetbeat documents with the field `dns.question.registered_domain` should now have the enriched data:
|
|
`ml_is_dga.*`
|
|
|
|
|
|
## Experimental DGA ML Jobs and Rules
|
|
|
|
Once packetbeat data is being enriched, there are some rules and ML jobs which can leverage the enriched fields.
|
|
The experimental rules and jobs will be staged separate from the model bundle under the [releases](https://github.com/elastic/detection-rules/releases)
|
|
as `ML-experimental-detections-YYYMMDD-N`.
|
|
|
|
Note that if a rule is of `type = "machine_learning"`, then it may be dependent on a uploading and running a machine
|
|
learning job first. If this is the case, it will likely be annotated within the `note` field of the rule.
|
|
|
|
#### Uploading rules
|
|
|
|
You can then individually upload these rules using the [kibana upload-rule](../CLI.md#uploading-rules-to-kibana) command
|
|
|
|
#### Uploading ML Jobs
|
|
|
|
Unzip released jobs and then run `python -m detection_rules es <args> experimental upload-ml-job <ml_job.json>`
|
|
|
|
To delete a job, run `python -m detection_rules es <args> experimental delete-ml-job <job-name> <job-type>`
|
|
|
|
Take note of any errors as some jobs may have dependencies on each other which may require stopping and or removing
|
|
referenced jobs first.
|
|
|
|
|
|
## For Maintainers
|
|
|
|
### Validating release bundles and releasing
|
|
|
|
Release assets are expected to be in certain formats with specific naming patterns and json structures.
|
|
|
|
#### Filename patterns
|
|
|
|
DGA model file naming convention should match the following patterns
|
|
|
|
```json
|
|
{
|
|
"model": "dga_*_model.json",
|
|
"dga_ngrams_create": "dga_*_ngrams_create.json",
|
|
"dga_ngrams_transform_delete": "dga_*_ngrams_transform_delete.json",
|
|
"dns_enrich_pipeline": "dga_*_ingest_pipeline1.json",
|
|
"dns_dga_inference_enrich_pipeline": "dga_*_ingest_pipeline2.json"
|
|
}
|
|
```
|
|
|
|
Experimental detections do not have to match a specific naming pattern but should be in the following file formats:
|
|
* rules: toml
|
|
* jobs: json
|
|
|
|
#### Uniqueness
|
|
|
|
The model file name and hash should be unique or else it will raise a warning in validation. This is important to allow
|
|
distinction and ascertain information about a bundle by consulting the manifest, based on a unique name
|
|
|
|
Release zipped assets, name, and tag name all share the same name. These should follow the following format:
|
|
* Model releases: `ML-DGA-YYYYMMDD-1`
|
|
* Detection releases: `ML-experimental-detections-YYYYMMDD-1`
|
|
|
|
The trailing digit should be incremented for each release
|
|
|
|
Rule and Job names should also be unique
|
|
|
|
#### Rule and job structure
|
|
|
|
Rules files are only check if they are valid toml, nothing more. Consult existing production rules and schemas for API
|
|
expectations
|
|
|
|
Job files are checked if they are valid toml and contain the following top level fields:
|
|
* name - job name
|
|
* type - job type
|
|
* body - the actual ML job data. The contents are not validated
|
|
|
|
#### Validation
|
|
|
|
All of these checks are automated and can be called with:
|
|
`python -m detection_rules dev gh-release validate-ml-dga-asset` - for model bundles
|
|
`python -m detection_rules dev gh-release validate-ml-detections-asset` for rule/job bundles
|
|
|
|
Pay attention to the output to determine any necessary changes. This may not be all inclusive and actual testing on a
|
|
live stack should always occur even with passing validation before saving to a GitHub release
|
|
|
|
#### Releasing
|
|
|
|
Install dependencies with `pip install -r requirements-dev.txt`
|
|
|
|
A release can be created via the cli using `python -m detection_rules dev gh-release create-ml`
|
|
|
|
* you can only use a github token
|
|
* the base directory name and release name must match
|
|
* you must have write permissions to the repo to create a release
|
|
* validation also occurs on this, with a prompt to proceed
|
|
* upon completion, a manifest is uploaded as an asset to the GitHub release
|
|
|
|
To test, you can fork the repo and use `--repo <your-fork>` to validate a release is working as expected
|