Files
malware-analysis-pipeline/TECHNICAL-RUNBOOK.md
T

362 lines
11 KiB
Markdown
Raw Normal View History

2026-05-08 17:45:23 -05:00
# GreySec MAL — Technical Runbook
**Product:** GreySec Malware Analysis Lab
**Version:** 1.0
**Updated:** 2026-05-07
**Parent:** `~/greysec/tools/malware-analysis-pipeline/kanban.md`
---
## Architecture
```
Docker Host (Linux)
172.28.0.1
┌───────────────────[Docker Bridge: 172.28.0.0/24]────────────────────┐
│ │
LitterBox API Windows 11 VM
:1337 172.28.0.10
(upload portal) │
(orchestration) Fibratus (kernel)
(result storage) │
Whiskers (:8080)
RabbitMQ │
:5672 │
│ RedEdr reporting
variant_event_consumer │
│ │
Supabase │
(results DB) │
│ [SHARE MOUNT]
Web Dashboard C:\analysis\
```
**Key address:** LitterBox API = `http://172.28.0.1:1337`
**Key address:** Whiskers (inside VM) = `http://172.28.0.10:8080`
---
## Component Inventory
| Component | Location | Port | Purpose | Credentials |
|-----------|----------|------|---------|-------------|
| LitterBox API | `~/greysec/tools/LitterBox/` | 1337 | Upload portal + orchestration | None (local) |
| RabbitMQ | Docker container | 5672 | Event queue | guest/guest (local) |
| variant_event_consumer | `~/bin/greysec/variant_event_consumer.py` | — | Parse events → Supabase | Via env |
| fibratus_rabbitmq_bridge | `~/bin/greysec/fibratus_rabbitmq_bridge.py` | — | Bridge Fibratus to RabbitMQ | Via env |
| Whiskers | Inside Windows VM | 8080 | EDR REST API | None |
| Fibratus | Inside Windows VM | — | Kernel event capture | — |
| RedEdr | Inside Windows VM | — | EDR reporting (RedEdr.exe) | — |
| Supabase | Cloud (or local) | 3000 | Results database | greysec-dev-key-2026 |
| pre-flight-vm-check.sh | `~/bin/greysec/pre-flight-vm-check.sh` | — | VM health check script | — |
---
## Prerequisites
Before running the pipeline:
1. Docker daemon running on Linux host
2. Windows 11 VM running at 172.28.0.10
3. Kali container reachable from host
4. Supabase accessible at localhost:3000 (or cloud)
5. MacBook Ollama reachable at 100.127.137.64 (for AI augmentations)
---
## Pre-Flight Checklist
Run before every session:
```bash
# 1. Check Docker containers
docker ps --format "table {{.Names}}\t{{.Status}}" | grep -E "litterbox|rabbitmq|fibratus"
# 2. Check VM is running
ping -c 2 172.28.0.10
# 3. Check Whiskers is up
curl -s http://172.28.0.10:8080/health
# Expected: {"status":"ok"} or similar
# 4. Check RabbitMQ is up
curl -s -u guest:guest http://localhost:15672/api/overview | jq '.queue_messages'
# Expected: {"count": N, "message": "ok"}
# 5. Check Supabase reachable
curl -s http://localhost:3000/health | jq '.status'
# Expected: "ready"
# 6. Check share mount from VM side (AFTER FIX — currently broken)
# From inside VM:
# curl -F "file=@test.exe" http://172.28.0.1:1337/upload
```
If any check fails, resolve before uploading payloads.
---
## Startup Sequence
Run in order. Wait for each to be healthy before moving to the next.
```bash
# 1. Start Docker stack
cd ~/greysec/tools/LitterBox
docker-compose up -d
# Wait 30 seconds
# 2. Verify containers are up
docker ps | grep -E "litterbox|rabbitmq"
# 3. Start variant_event_consumer
cd ~/greysec/tools/LitterBox
python3 ~/bin/greysec/variant_event_consumer.py &
# Or use supervisor/systemd if running as service
# 4. Verify VM is running
ping -c 1 172.28.0.10
# 5. Start Whiskers (manual PAExec — until Task 3 is done)
# From inside VM or via PAExec:
# PAExec \\172.28.0.10 -u administrator -p [password] "C:\path\to\whiskers.exe"
# Until Task 3 is done, this is manual and needs to be redone after VM reboot
# 6. Verify Whiskers is responding
curl -s http://172.28.0.10:8080/health
# 7. Verify Fibratus is running inside VM
# On VM: sc query fibratus
# Should show RUNNING
# 8. Verify RabbitMQ connection from consumer
curl -s -u guest:guest http://localhost:15672/api/overview | jq '.message_stats'
```
---
## Shutdown Sequence
```bash
# 1. Stop uploading new payloads (drain queue)
# Check RabbitMQ for pending messages
curl -s -u guest:guest http://localhost:15672/api/queues | jq '.[] | select(.messages > 0)'
# 2. Stop variant_event_consumer
pkill -f variant_event_consumer
# 3. Stop Whiskers (if Task 3 not done — manual)
# On VM: taskkill /IM whiskers.exe /F
# 4. Stop Docker containers
cd ~/greysec/tools/LitterBox
docker-compose down
# 5. Shutdown VM
# virsh shutdown greysec-win11
# or from inside VM: shutdown /s /t 0
```
---
## Payload Upload Procedure
### Via CLI (current method)
```bash
# Upload a payload
curl -X POST http://172.28.0.1:1337/upload \
-F "file=@ransomware_sim_v1.py" \
-F "timeout=30" \
-F "metadata={\"name\":\"ransomware_sim_v1\",\"category\":\"test\",\"submitted_by\":\"operator\"}"
# Check job status
curl http://172.28.0.1:1337/jobs/[JOB_ID]
# Get results
curl http://172.28.0.1:1337/results/[JOB_ID]
```
### Via Client Portal (future — Task 7)
```bash
# Authenticated upload (future)
curl -X POST https://[client-portal-host]/upload \
-H "Authorization: Bearer [API_KEY]" \
-F "file=@payload.exe" \
-F "timeout=60"
# Returns job_id for polling
```
---
## Reading the Results
### Detection Score (0-100)
The primary deliverable metric.
| Score | Interpretation | Action |
|-------|---------------|--------|
| 0-20 | Clean — no suspicious syscalls | Deployable in most environments |
| 21-40 | Low — minor suspicious activity | Review behavioral summary before deployment |
| 41-60 | Medium — multiple suspicious syscalls | Modify payload or test in isolated environment |
| 61-80 | High — significant EDR coverage | Likely to be blocked by most EDR products |
| 81-100 | Critical — extensive offensive tooling | Not recommended for production use |
### MITRE ATT&CK Kill Chain
The sequence of ATT&CK tactics and techniques the payload used.
**Format:**
```
[1] T1086 — PowerShell: one-liner downloader
[2] T1055 — Process Injection: VirtualAllocEx + WriteProcessMemory
[3] T1055 — Process Injection: CreateRemoteThread
[4] T1105 — Ingress Tool Transfer: URLDownloadToFile
```
**What to look for:**
- Technique count > 3: sophisticated payload
- T1055 (Process Injection): likely evasion attempt
- T1105 (Ingress Tool Transfer): network Indicators
- T1486 (Data Encrypted for Impact): ransomware behavior
### Behavioral Summary
Text summary of what the payload did:
- File operations (created/modified/deleted)
- Network operations (outbound connections, DNS queries)
- Process operations (spawned children, injected into processes)
- Registry operations (modified keys)
---
## Troubleshooting Guide
### Problem: Payload never starts processing
**Symptoms:** Upload returns 200 OK but no job in queue.
**Diagnosis:**
1. Check share mount is reachable from VM (see Issue 1)
2. Check `curl -v http://172.28.0.1:1337/jobs` — does job appear?
3. Check LitterBox logs: `docker logs litterbox-api`
**Fix order:** Verify share mount → verify upload endpoint → check LitterBox logs
---
### Problem: Payload killed at exactly 5 seconds
**Symptoms:** All payloads die at 5 seconds, regardless of timeout setting.
**Diagnosis:** This is Issue 4. Check `manager.py` lines 418-419.
```bash
grep -n "init_wait_time" ~/greysec/tools/LitterBox/app/analyzers/manager.py
# Should show hardcoded value = 5
```
**Fix:** Change to respect config.yaml value.
---
### Problem: Whiskers endpoint returns 502 or timeout
**Symptoms:** `curl http://172.28.0.10:8080/api/alerts/fibratus/since` fails.
**Diagnosis:** Whiskers process died (Issue 3 — no keepalive).
**Fix (immediate):** PAExec back into VM and restart Whiskers.
**Fix (permanent):** Task 3 — install as Windows service.
---
### Problem: RedEdr report is empty despite real syscalls
**Symptoms:** Whiskers returns events but RedEdr shows nothing.
**Diagnosis:** This is Issue 2. Fibratus sees events but they don't reach the final report.
**Fix:** Trace the event path:
1. Inside VM: run `fibratus dump` — are events being captured by Fibratus?
2. `curl http://172.28.0.10:8080/api/alerts/fibratus/since` — does Whiskers see them?
3. Check `variant_event_consumer` logs — is it receiving from RabbitMQ?
4. Check Supabase `malware_analyses` table — are events stored?
Find the break point and fix at that layer.
---
### Problem: RabbitMQ queue not draining
**Symptoms:** `curl -u guest:guest http://localhost:15672/api/queues` shows messages accumulating.
**Diagnosis:** `variant_event_consumer` is not running or is crashing on messages.
**Fix:**
```bash
# Restart consumer with verbose logging
python3 -v ~/bin/greysec/variant_event_consumer.py
# Check consumer is running
ps aux | grep variant_event_consumer
```
---
### Problem: VM unreachable at 172.28.0.10
**Symptoms:** `ping 172.28.0.10` fails.
**Diagnosis:** VM is down or Docker bridge network changed.
**Fix:**
```bash
# Check VM status
virsh list --all
# Restart VM
virsh start greysec-win11
# Verify Docker bridge
docker network inspect bridge | jq '.[0].IPAM.Config[0].Subnet'
```
---
## Escalation Path
**If you encounter any of these, ping @Adam immediately:**
1. VM will not start or boots to BSOD
2. Docker stack fails to start after host reboot
3. Supabase is unreachable and not recoverable within 5 minutes
4. MacBook Ollama needs to be re-authenticated (token expired)
5. Any of the 4 critical bugs cannot be resolved within 2 hours of focused work
**Before escalating:**
- Document what you tried
- Note exact error messages
- Note which component is failing (ping the exact hop)
**Format for escalation:**
```
[@Adam] [COMPONENT] is broken: [ONE-LINE DESCRIPTION]
What I tried: [SHORT LIST]
Error: [EXACT ERROR]
Last working: [WHEN — if known]
```
---
## Appendix: Test Payloads
| Name | Path | Purpose | Expected Behavior |
|------|------|---------|-------------------|
| ransomware_sim_v1.py | `~/greysec/engagements/litterbox-fibratus-deploy/payloads/` | Detection Score test | 60-80 score, multiple ATT&CK techniques |
| ransomware_sim_v2.c | `~/greysec/engagements/litterbox-fibratus-deploy/payloads/` | Extended run test | Run > 5 seconds, capture output |
| ransomware_sim_v3.c | `~/greysec/engagements/litterbox-fibratus-deploy/payloads/` | Extended run test | Same as v2 |
| calc.exe | Windows system binary | Clean baseline test | Score < 20 |
| notepad.exe | Windows system binary | Clean baseline test | Score < 20 |