Files
malware-analysis-pipeline/kanban.md
T

321 lines
14 KiB
Markdown
Raw Normal View History

2026-05-08 17:45:23 -05:00
# GreySec MAL — Master Kanban
**Product:** GreySec Malware Analysis Lab
**Type:** Internal Build Project
**Status:** BUILDING
**Updated:** 2026-05-07
**Parent debrief:** `~/greysec/ops/debriefs/malware-lab-2026-05-07.md`
---
## Background
GreySec MAL is a self-hosted malware analysis sandbox for red team operators. It takes a binary payload, detonates it in an isolated Windows 11 VM instrumented with EDR (Fibratus + Whiskers + RedEdr), captures behavioral events via RabbitMQ, and produces a client-facing analysis report with a Detection Score (0-100) and MITRE ATT&CK kill chain map.
**Architecture:**
```
Payload Upload → LitterBox (:1337) → SMB Share Mount → Windows VM (:1337)
Fibratus (kernel events)
Whiskers (REST API :8080)
RedEdr (EDR reporting)
RabbitMQ (event queue)
variant_event_consumer (Python)
Supabase (structured data)
Detection Score + MITRE ATT&CK Report
```
**Current status:** ARCHITECTURE VERIFIED. 4 critical bugs block end-to-end operation. Fix order is strict.
---
## Pipeline Definition
**What the product IS:**
Drop a binary. Get a Detection Score + MITRE ATT&CK kill chain. Client data never leaves your infrastructure.
**What the client receives:**
- Detection Score (0-100) — how likely this payload is to be flagged by EDR
- MITRE ATT&CK kill chain map — which tactics and techniques the payload uses
- Behavioral analysis summary — what the payload actually did (file ops, network ops, process ops)
- Raw event log (optional) — full Fibratus event stream for manual review
**Target buyer:**
- Red team operators testing C2 payloads before deployment
- MSSPs running adversary simulation for clients
- Security teams with HIPAA/BAA obligations that prevent cloud malware analysis
- Law firms and financial institutions with strict client confidentiality requirements
**SLA (target):**
- Analysis turnaround: < 5 minutes for typical payloads (< 10MB)
- Report available: via web dashboard or API
- Uptime: 99% (target, TBD with Adam)
---
## Current State
### What Works
- v1 Python payload: ran for 16 seconds, generated real EDR events, Fibratus saw them, Whiskers returned them via `/api/alerts/fibratus/since` — core event path verified
- RabbitMQ → variant_event_consumer → Supabase: working
- Docker-compose stack: LitterBox, RabbitMQ, Fibratus bridge, consumer all start cleanly
- Pre-flight check script exists at `~/bin/greysec/pre-flight-vm-check.sh` (not yet run in a session)
### What Is Broken
| # | Bug | Severity | Fix Time | Cascade |
|---|-----|----------|----------|---------|
| 1 | VM share mount `\\172.28.0.1\share` unreachable from Windows VM — payloads may not reach analysis dir | CRITICAL | 30 min | Blocks all testing |
| 2 | RedEdr returns zero events despite Fibratus seeing real syscalls — event data doesn't reach final report | CRITICAL | 30-60 min | Blocks EDR validation |
| 3 | Whiskers has no Windows service wrapper — dies when parent process exits, requires manual PAExec restart | CRITICAL | 1 hour | Blocks reliability |
| 4 | manager.py lines 418-419 hardcodes `init_wait_time = 5` regardless of config — payloads killed at 5s | DEGRADED | 30 min | Blocks extended runs |
**Fix order:** 1 → 2 → 3 → 4. Issue 4 is blocked by Issue 1 (can't test 4 until share mount works).
---
## BOARD
### BACKLOG
- [ ] Build Detection Score algorithm (0-100 from Fibratus event frequency + severity + MITRE technique count)
- [ ] Build web dashboard for results (currently Supabase only — no client-facing UI)
- [ ] Build client upload portal (currently manual `curl` to localhost:1337)
- [ ] Build MITRE ATT&CK kill chain mapper (Fibratus events → ATT&CK tactic/technique IDs)
- [ ] Write `greysec-malware-pipeline` skill (standalone — not yet created)
- [ ] Add payload hardening guidance output (what to change in the binary to lower Detection Score)
- [ ] Set up TLS for LitterBox API (currently plain HTTP — fine for internal, not for client-facing portal)
- [ ] Build multi-user access control (when portal is client-facing, need auth)
- [ ] Benchmark performance: typical payload analysis time, max payload size, concurrent analysis capacity
### IN PROGRESS
_(empty — no work currently active)_
### VALIDATING
_(empty)_
### DONE
- [x] Architecture design (RabbitMQ + Fibratus + Whiskers + Supabase)
- [x] Docker-compose stack (LitterBox + RabbitMQ + bridges)
- [x] v1 Python payload proves end-to-end event path
- [x] Pre-flight VM check script written (`~/bin/greysec/pre-flight-vm-check.sh`)
- [x] Supabase schema for analysis results
### BLOCKED
- [ ] **ISSUE 1: VM share mount** — Cannot test payloads until SMB share is reachable from inside VM
- [ ] **ISSUE 2: RedEdr zero events** — Cannot validate EDR reporting until share mount works
---
## Technical Fix Tasks
### Task 1: Fix VM Share Mount (CRITICAL — do first)
**What:** `\\172.28.0.1\share` (SMB) not reachable from inside Windows VM at 172.28.0.10
**Root cause:** Docker bridge network (172.28.0.0/24) may not be attached to VM network interface. SMB port 445 may be blocked by Windows Firewall.
**Fix approach A:** Verify Docker bridge attachment and open Windows Firewall for SMB.
**Fix approach B (preferred):** Replace SMB mount with HTTP upload endpoint inside VM — more reliable across Docker bridge, no firewall holes.
**Files to touch:**
- `~/greysec/tools/LitterBox/docker-compose.yml` (change mount mechanism)
- May need new endpoint in `~/greysec/tools/LitterBox/app/analyzers/payload_receiver.py`
**Who:** qwen2.5-coder:14b
**Time:** ~30 minutes
**Verification:** From inside VM: `curl -F "file=@test.exe" http://172.28.0.1:PORT/upload` returns 200
**Acceptance criteria:**
- VM can reach LitterBox upload endpoint
- Payload file appears in VM analysis directory
- LitterBox begins processing within 10 seconds of upload
---
### Task 2: Fix RedEdr Zero Events (CRITICAL — do second)
**What:** Fibratus sees real syscalls. Whiskers `/api/alerts/fibratus/since` returns events. But RedEdr report shows nothing.
**Root cause:** Trace path: Fibratus writes to Windows Application Event Log → Whiskers reads via `wevtutil` → publishes over HTTP → consumer receives. Something breaks between Whiskers and final report.
**Fix approach:**
1. Check Fibratus filter rules — are they capturing the right event types?
2. Check Whiskers polling interval — is it fast enough?
3. Check `variant_event_consumer.py` — is it parsing Whiskers output correctly?
4. Run a known-syscall payload and trace events at each hop
**Files to touch:**
- `~/bin/greysec/fibratus_rabbitmq_bridge.py`
- `~/bin/greysec/variant_event_consumer.py`
- Fibratus config `~/greysec/tools/fibratus/config.yaml`
**Who:** qwen2.5-coder:14b
**Time:** ~30-60 minutes (diagnosis + fix)
**Verification:** Run ransomware_sim_v1.py payload → confirm events in RedEdr report, not just Whiskers endpoint
**Acceptance criteria:**
- Payload makes real OpenProcess/CreateFile syscalls
- Fibratus events appear in Whiskers `/api/alerts/fibratus/since` output
- Events are parsed and stored in Supabase
- RedEdr-format report shows the events with correct timestamps
---
### Task 3: Install Whiskers as Windows Service (CRITICAL — do third)
**What:** Whiskers dies when PAExec parent exits. No persistence across VM restart or process crash.
**Fix:** Install Whiskers as a Windows service using `nssm` (Non-Sucking Service Manager) or `instsrv`.
**Files to touch:**
- VM-side setup: install nssm, run `nssm install Whiskers "C:\path\to\whiskers.exe" "--port 8080"`
**Who:** qwen2.5-coder:14b
**Time:** ~1 hour
**Verification:** Reboot VM → wait 5 minutes → confirm Whiskers still reachable at `http://172.28.0.10:8080/api/alerts/fibratus/since`
**Acceptance criteria:**
- Whiskers survives VM reboot without manual intervention
- Whiskers survives its own parent process exiting
- Health check `curl http://172.28.0.10:8080/health` returns 200
---
### Task 4: Fix manager.py Timeout Handler (DEGRADED — do fourth)
**What:** `~/greysec/tools/LitterBox/app/analyzers/manager.py` lines 418-419 hardcode `init_wait_time = 5` in the `"terminated after"` error handler, overriding `config.yaml`.
**Fix:** Change `init_wait_time = 5` to `init_wait_time = config.get('wait_time', 15)` or similar.
**Files to touch:**
- `~/greysec/tools/LitterBox/app/analyzers/manager.py` (lines ~418-419)
**Who:** qwen2.5-coder:14b
**Time:** ~30 minutes
**Verification:** Set `wait_time: 30` in config.yaml → run a 20-second payload → confirm it runs for 20+ seconds, not 5
**Acceptance criteria:**
- Config value respected, not hardcoded fallback
- C payloads (v2, v3) that need > 5 seconds run to completion
---
## Product Build Tasks
### Task 5: Detection Score Algorithm
**What:** The primary client deliverable. A score from 0-100 that rates how likely this payload is to be detected by EDR.
**Approach:** Combine:
- Event count: how many syscalls per minute
- Event severity: which syscalls (OpenProcess = medium, VirtualAlloc + WriteProcess = high)
- MITRE technique count: how many distinct ATT&CK techniques used
- Network indicators: outbound connections = higher score
- Process injection indicators: highest score
**Output:** JSON field in Supabase + dashboard display
**Formula (target):** `score = min(100, (event_count * 0.1) + (technique_count * 15) + (severity_multiplier * 20) + (network_indicator * 25))`
**Who:** qwen2.5-coder:14b or glm-5.1:cloud for algorithm design
**Time:** ~2 hours
**Verification:** Run 3 known-clean files (calc.exe, notepad.exe) → score < 20. Run ransomware_sim payload → score > 60.
---
### Task 6: Web Dashboard
**What:** Client-facing results dashboard. Currently Supabase only — no UI.
**Stack:** TBD (recommend: Simple Python Flask or FastAPI + HTMX for simplicity, or integrate into existing GreySec dashboard)
**Pages:**
- Upload page: drag-and-drop binary, job ID returned
- Results page: Detection Score, MITRE kill chain visualization, behavioral summary
- History: past analyses for the client's org
**Who:** qwen2.5-coder:14b (or Adam if design decision needed)
**Time:** ~4 hours
**Dependencies:** Task 1, 2, 5 complete first
---
### Task 7: Client Upload Portal
**What:** Authenticated API endpoint for clients to submit binaries. Currently manual `curl` to localhost.
**Features:**
- API key auth per client org
- File type validation (.exe, .dll, .bin, .ps1, .py)
- Max file size: 50MB
- Sandbox: each org gets isolated analysis environment (future scope — V1 is shared infra)
**Files to touch:**
- `~/greysec/tools/LitterBox/app/analyzers/payload_receiver.py` (new endpoints)
- `~/greysec/tools/LitterBox/Config/config.yaml` (API key config)
**Who:** qwen2.5-coder:14b
**Time:** ~2 hours
**Dependencies:** Task 1 (share mount fix) must be complete
---
### Task 8: MITRE ATT&CK Kill Chain Mapper
**What:** Map Fibratus syscall events to MITRE ATT&CK tactic and technique IDs automatically.
**Approach:** Build a mapping table:
- `NtOpenProcess` → T1086 (PowerShell), T1055 (Process Injection)
- `NtCreateFile` on sensitive paths → T1005 (Data from System Files)
- `VirtualAllocEx` + `WriteProcessMemory` → T1055 (Process Injection)
- `CreateRemoteThread` → T1055 (Process Injection)
- ` RegSetValue` → T1112 (Modify Registry)
- `URLDownloadToFile` → T1105 (Ingress Tool Transfer)
**Output:** Kill chain visualization (text or SVG) showing sequence of ATT&CK techniques used
**Files to touch:** `~/bin/greysec/variant_event_consumer.py` (add mapping logic)
**Who:** qwen2.5-coder:14b
**Time:** ~2 hours (building the mapping table is the work)
**Dependencies:** Task 2 (RedEdr events must flow)
---
## Definition of Done
GreySec MAL is operational when:
1. All 4 critical bugs are fixed and verified
2. A known-malicious payload (ransomware_sim_v1.py) produces a Detection Score > 60
3. MITRE ATT&CK kill chain shows at least 3 techniques for that payload
4. A known-clean payload (notepad.exe) produces a Detection Score < 20
5. Analysis turnaround is < 5 minutes for a 1MB binary
6. Client upload portal accepts a binary via API and returns a job ID
7. Results are accessible via web dashboard within 5 minutes of upload
8. Skill file `greysec-malware-pipeline` exists and documents the full operational procedure
9. Time tracking is hooked into the pipeline (AI minutes logged to TIME-LOG)
10. gbrain logging is hooked into the pipeline (findings logged post-analysis)
---
## DEBT (Action Items from This Kanban)
| Action Item | Priority | Status | Notes |
|------------|----------|--------|-------|
| Fix VM share mount (Task 1) | CRITICAL | open | Do first — blocks all testing |
| Fix RedEdr zero events (Task 2) | CRITICAL | open | Do second — blocks reporting |
| Install Whiskers as Windows service (Task 3) | CRITICAL | open | Do third — blocks reliability |
| Fix manager.py timeout (Task 4) | DEGRADED | open | Do fourth |
| Build Detection Score algorithm (Task 5) | HIGH | open | Primary deliverable metric |
| Build web dashboard (Task 6) | HIGH | open | Client-facing UI |
| Build client upload portal (Task 7) | HIGH | open | API for clients |
| Build MITRE ATT&CK mapper (Task 8) | HIGH | open | Kill chain output |
| Write greysec-malware-pipeline skill | MEDIUM | open | Docs |
| Add TIME-LOG hook | MEDIUM | open | Cost tracking |
| Add gbrain logging hook | MEDIUM | open | Knowledge capture |