feat: add optional CSV export for findings and warnings

stacknil · stacknil · commit 1e1684cf7f4d · 2026-03-24T12:32:25.000+08:00
diff --git a/README.md b/README.md
@@ -1,213 +1 @@
-# LogLens
-
-[![CI](https://github.com/stacknil/LogLens/actions/workflows/ci.yml/badge.svg)](https://github.com/stacknil/LogLens/actions/workflows/ci.yml)
-[![CodeQL](https://github.com/stacknil/LogLens/actions/workflows/codeql.yml/badge.svg)](https://github.com/stacknil/LogLens/actions/workflows/codeql.yml)
-
-C++20 defensive log analysis CLI for Linux authentication logs, with parser coverage telemetry, configurable detection rules, CI, and CodeQL.
-
-It parses `auth.log` / `secure`-style syslog input and `journalctl --output=short-full`-style input, normalizes authentication evidence, applies configurable rule-based detections, and emits deterministic Markdown and JSON reports.
-
-## Project Status
-
-LogLens is an MVP / early release. The repository is stable enough for public review, local experimentation, and extension, but the parser and detection coverage are intentionally narrow.
-
-## Why This Project Exists
-
-Many small security tools can detect a handful of known log patterns. Fewer tools make their parsing limits visible.
-
-LogLens is built around three ideas:
-
-- detection engineering over offensive functionality
-- parser observability over silent failure
-- repository discipline over throwaway scripts
-
-The project reports suspicious login activity while also surfacing parser coverage, unknown-line buckets, CI status, and code scanning hygiene.
-
-## Scope
-
-LogLens is a defensive, public-safe repository.
-It is intended for log parsing, detection experiments, and engineering practice.
-It does not provide exploitation, persistence, credential attack automation, or live offensive capability.
-
-## Repository Checks
-
-LogLens includes two minimal GitHub Actions workflows:
-
-- `CI` builds and tests the project on `ubuntu-latest` and `windows-latest`
-- `CodeQL` runs GitHub code scanning for C/C++ on pushes, pull requests, and a weekly schedule
-
-Both workflows are intended to stay stable enough to require on pull requests to `main`. Release-facing documentation is split across [`CHANGELOG.md`](./CHANGELOG.md), [`docs/release-process.md`](./docs/release-process.md), [`docs/release-v0.1.0.md`](./docs/release-v0.1.0.md), and the repository's GitHub release notes. The repository hardening note is in [`docs/repo-hardening.md`](./docs/repo-hardening.md), and vulnerability reporting guidance is in [`SECURITY.md`](./SECURITY.md).
-
-## Threat Model
-
-LogLens is designed for offline review of `auth.log` and `secure` style text logs collected from systems you own or administer. The MVP focuses on common, high-signal patterns that often appear during credential guessing, username enumeration, or bursty privileged command use.
-
-The current tool helps answer:
-
-- Is one source IP generating repeated SSH failures in a short window?
-- Is one source IP trying several usernames in a short window?
-- Is one account running sudo unusually often in a short window?
-
-It does not attempt to replace a SIEM, correlate across hosts, enrich IPs, or decide whether a finding is malicious on its own.
-
-## Detections
-
-LogLens currently detects:
-
-- Repeated SSH failed password attempts from the same IP within 10 minutes
-- One IP trying multiple usernames within 15 minutes
-- Bursty sudo activity from the same user within 5 minutes
-
-LogLens currently parses and reports these additional auth patterns beyond the core detector inputs:
-
-- `Accepted publickey` SSH successes
-- `Failed publickey` SSH failures, which count toward SSH brute-force detection by default
-- `pam_unix(...:auth): authentication failure`
-- `pam_unix(...:session): session opened`
-- selected `pam_faillock(...:auth)` failure variants
-- selected `pam_sss(...:auth)` failure variants
-
-LogLens also tracks parser coverage telemetry for unsupported or malformed lines, including:
-
-- `total_lines`
-- `parsed_lines`
-- `unparsed_lines`
-- `parse_success_rate`
-- `top_unknown_patterns`
-
-LogLens does not currently detect:
-
-- Lateral movement
-- MFA abuse
-- SSH key misuse
-- Many PAM-specific failures beyond the parsed `pam_unix`, `pam_faillock`, and `pam_sss` sample patterns
-- Cross-file or cross-host correlation
-
-## Build
-
-```bash
-cmake -S . -B build
-cmake --build build
-ctest --test-dir build --output-on-failure
-```
-
-For fresh-machine setup and repeatable local presets, see [`docs/dev-setup.md`](./docs/dev-setup.md).
-
-## Run
-
-```bash
-./build/loglens --mode syslog --year 2026 ./assets/sample_auth.log ./out
-./build/loglens --mode journalctl-short-full ./assets/sample_journalctl_short_full.log ./out-journal
-./build/loglens --config ./assets/sample_config.json ./assets/sample_auth.log ./out-config
-```
-
-The CLI writes:
-
-- `report.md`
-- `report.json`
-
-into the output directory you provide. If you omit the output directory, the files are written into the current working directory.
-
-When an input spans multiple hostnames, both reports add compact host-level summaries without changing detector thresholds or introducing cross-host correlation logic.
-
-## Sample Output
-
-For sanitized sample input, see [`assets/sample_auth.log`](./assets/sample_auth.log) and [`assets/sample_journalctl_short_full.log`](./assets/sample_journalctl_short_full.log).
-
-`report.md` summary excerpt:
-
-```markdown
-## Summary
-- Input mode: syslog_legacy
-- Parsed events: 14
-- Findings: 3
-- Parser warnings: 2
-```
-
-`report.json` summary excerpt:
-
-```json
-{
-  "input_mode": "syslog_legacy",
-  "parsed_event_count": 14,
-  "finding_count": 3,
-  "warning_count": 2
-}
-```
-
-The config file schema is intentionally small and strict:
-
-```json
-{
-  "input_mode": "syslog_legacy",
-  "timestamp": {
-    "assume_year": 2026
-  },
-  "brute_force": { "threshold": 5, "window_minutes": 10 },
-  "multi_user_probing": { "threshold": 3, "window_minutes": 15 },
-  "sudo_burst": { "threshold": 3, "window_minutes": 5 },
-  "auth_signal_mappings": {
-    "ssh_failed_password": {
-      "counts_as_attempt_evidence": true,
-      "counts_as_terminal_auth_failure": true
-    },
-    "ssh_invalid_user": {
-      "counts_as_attempt_evidence": true,
-      "counts_as_terminal_auth_failure": true
-    },
-    "ssh_failed_publickey": {
-      "counts_as_attempt_evidence": true,
-      "counts_as_terminal_auth_failure": true
-    },
-    "pam_auth_failure": {
-      "counts_as_attempt_evidence": true,
-      "counts_as_terminal_auth_failure": false
-    }
-  }
-}
-```
-
-This mapping lets LogLens normalize parsed events into detection signals before applying brute-force or multi-user rules. By default, `pam_auth_failure` is treated as lower-confidence attempt evidence and does not count as a terminal authentication failure unless the config explicitly upgrades it.
-
-Timestamp handling is now explicit:
-
-- `--mode syslog` or `input_mode: syslog_legacy` requires `--year` or `timestamp.assume_year`
-- `--mode journalctl-short-full` or `input_mode: journalctl_short_full` parses the embedded year and timezone and ignores `assume_year`
-
-## Example Input
-
-```text
-Mar 10 08:11:22 example-host sshd[1234]: Failed password for invalid user admin from 203.0.113.10 port 51022 ssh2
-Mar 10 08:12:10 example-host sshd[1235]: Accepted password for alice from 203.0.113.20 port 51111 ssh2
-Mar 10 08:15:00 example-host sudo:    alice : TTY=pts/0 ; PWD=/home/alice ; USER=root ; COMMAND=/usr/bin/systemctl restart ssh
-Mar 10 08:27:10 example-host sshd[1243]: Failed publickey for invalid user svc-backup from 203.0.113.40 port 51240 ssh2
-Mar 10 08:28:33 example-host pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.41  user=alice
-Mar 10 08:29:50 example-host pam_unix(sudo:session): session opened for user root by alice(uid=0)
-Mar 10 08:30:12 example-host sshd[1244]: Connection closed by authenticating user alice 203.0.113.50 port 51290 [preauth]
-Mar 10 08:31:18 example-host sshd[1245]: Timeout, client not responding from 203.0.113.51 port 51291
-```
-
-`journalctl --output short-full` style example:
-
-```text
-Tue 2026-03-10 08:11:22 UTC example-host sshd[2234]: Failed password for invalid user admin from 203.0.113.10 port 51022 ssh2
-Tue 2026-03-10 08:13:10 UTC example-host sshd[2236]: Failed password for test from 203.0.113.10 port 51040 ssh
-Tue 2026-03-10 08:18:05 UTC example-host sshd[2238]: Failed publickey for invalid user deploy from 203.0.113.10 port 51060 ssh2
-Tue 2026-03-10 08:31:18 UTC example-host sshd[2245]: Connection closed by authenticating user alice 203.0.113.51 port 51291 [preauth]
-```
-
-## Known Limitations
-
-- `syslog_legacy` requires an explicit year; LogLens does not guess one implicitly.
-- `journalctl_short_full` currently supports `UTC`, `GMT`, `Z`, and numeric timezone offsets, not arbitrary timezone abbreviations.
-- Parser coverage is still selective: it covers common `sshd`, `sudo`, `pam_unix`, and selected `pam_faillock` / `pam_sss` variants rather than broad Linux auth-family support.
-- Unsupported lines are surfaced as parser telemetry and warnings, not as detector findings.
-- `pam_unix` auth failures remain lower-confidence by default unless signal mappings explicitly upgrade them.
-- Detector configuration uses a fixed `config.json` schema rather than partial overrides or alternate config formats.
-- Findings are rule-based triage aids, not incident verdicts or attribution.
-
-## Future Roadmap
-
-- Additional auth patterns and PAM coverage
-- Optional CSV export
-- Larger sanitized test corpus
+# LogLens  [![CI](https://github.com/stacknil/LogLens/actions/workflows/ci.yml/badge.svg)](https://github.com/stacknil/LogLens/actions/workflows/ci.yml) [![CodeQL](https://github.com/stacknil/LogLens/actions/workflows/codeql.yml/badge.svg)](https://github.com/stacknil/LogLens/actions/workflows/codeql.yml)  C++20 defensive log analysis CLI for Linux authentication logs, with parser coverage telemetry, configurable detection rules, CI, and CodeQL.  It parses `auth.log` / `secure`-style syslog input and `journalctl --output=short-full`-style input, normalizes authentication evidence, applies configurable rule-based detections, and emits deterministic Markdown and JSON reports, with optional CSV exports for findings and warnings.  ## Project Status  LogLens is an MVP / early release. The repository is stable enough for public review, local experimentation, and extension, but the parser and detection coverage are intentionally narrow.  ## Why This Project Exists  Many small security tools can detect a handful of known log patterns. Fewer tools make their parsing limits visible.  LogLens is built around three ideas:  - detection engineering over offensive functionality - parser observability over silent failure - repository discipline over throwaway scripts  The project reports suspicious login activity while also surfacing parser coverage, unknown-line buckets, CI status, and code scanning hygiene.  ## Scope  LogLens is a defensive, public-safe repository. It is intended for log parsing, detection experiments, and engineering practice. It does not provide exploitation, persistence, credential attack automation, or live offensive capability.  ## Repository Checks  LogLens includes two minimal GitHub Actions workflows:  - `CI` builds and tests the project on `ubuntu-latest` and `windows-latest` - `CodeQL` runs GitHub code scanning for C/C++ on pushes, pull requests, and a weekly schedule  Both workflows are intended to stay stable enough to require on pull requests to `main`. Release-facing documentation is split across [`CHANGELOG.md`](./CHANGELOG.md), [`docs/release-process.md`](./docs/release-process.md), [`docs/release-v0.1.0.md`](./docs/release-v0.1.0.md), and the repository's GitHub release notes. The repository hardening note is in [`docs/repo-hardening.md`](./docs/repo-hardening.md), and vulnerability reporting guidance is in [`SECURITY.md`](./SECURITY.md).  ## Threat Model  LogLens is designed for offline review of `auth.log` and `secure` style text logs collected from systems you own or administer. The MVP focuses on common, high-signal patterns that often appear during credential guessing, username enumeration, or bursty privileged command use.  The current tool helps answer:  - Is one source IP generating repeated SSH failures in a short window? - Is one source IP trying several usernames in a short window? - Is one account running sudo unusually often in a short window?  It does not attempt to replace a SIEM, correlate across hosts, enrich IPs, or decide whether a finding is malicious on its own.  ## Detections  LogLens currently detects:  - Repeated SSH failed password attempts from the same IP within 10 minutes - One IP trying multiple usernames within 15 minutes - Bursty sudo activity from the same user within 5 minutes  LogLens currently parses and reports these additional auth patterns beyond the core detector inputs:  - `Accepted publickey` SSH successes - `Failed publickey` SSH failures, which count toward SSH brute-force detection by default - `pam_unix(...:auth): authentication failure` - `pam_unix(...:session): session opened` - selected `pam_faillock(...:auth)` failure variants - selected `pam_sss(...:auth)` failure variants  LogLens also tracks parser coverage telemetry for unsupported or malformed lines, including:  - `total_lines` - `parsed_lines` - `unparsed_lines` - `parse_success_rate` - `top_unknown_patterns`  LogLens does not currently detect:  - Lateral movement - MFA abuse - SSH key misuse - Many PAM-specific failures beyond the parsed `pam_unix`, `pam_faillock`, and `pam_sss` sample patterns - Cross-file or cross-host correlation  ## Build  ```bash cmake -S . -B build cmake --build build ctest --test-dir build --output-on-failure ```  For fresh-machine setup and repeatable local presets, see [`docs/dev-setup.md`](./docs/dev-setup.md).  ## Run  ```bash ./build/loglens --mode syslog --year 2026 ./assets/sample_auth.log ./out ./build/loglens --mode journalctl-short-full ./assets/sample_journalctl_short_full.log ./out-journal ./build/loglens --config ./assets/sample_config.json ./assets/sample_auth.log ./out-config ./build/loglens --mode syslog --year 2026 --csv ./assets/sample_auth.log ./out-csv ```  The CLI writes:  - `report.md` - `report.json`  into the output directory you provide. If you omit the output directory, the files are written into the current working directory.  When you add `--csv`, LogLens also writes:  - `findings.csv` - `warnings.csv`  The CSV schema is intentionally small and stable:  - `findings.csv`: `rule`, `subject_kind`, `subject`, `event_count`, `window_start`, `window_end`, `usernames`, `summary` - `warnings.csv`: `kind`, `message`  When an input spans multiple hostnames, both reports add compact host-level summaries without changing detector thresholds or introducing cross-host correlation logic.  ## Sample Output  For sanitized sample input, see [`assets/sample_auth.log`](./assets/sample_auth.log) and [`assets/sample_journalctl_short_full.log`](./assets/sample_journalctl_short_full.log).  `report.md` summary excerpt:  ```markdown ## Summary - Input mode: syslog_legacy - Parsed events: 14 - Findings: 3 - Parser warnings: 2 ```  `report.json` summary excerpt:  ```json {   "input_mode": "syslog_legacy",   "parsed_event_count": 14,   "finding_count": 3,   "warning_count": 2 } ```  The config file schema is intentionally small and strict:  ```json {   "input_mode": "syslog_legacy",   "timestamp": {     "assume_year": 2026   },   "brute_force": { "threshold": 5, "window_minutes": 10 },   "multi_user_probing": { "threshold": 3, "window_minutes": 15 },   "sudo_burst": { "threshold": 3, "window_minutes": 5 },   "auth_signal_mappings": {     "ssh_failed_password": {       "counts_as_attempt_evidence": true,       "counts_as_terminal_auth_failure": true     },     "ssh_invalid_user": {       "counts_as_attempt_evidence": true,       "counts_as_terminal_auth_failure": true     },     "ssh_failed_publickey": {       "counts_as_attempt_evidence": true,       "counts_as_terminal_auth_failure": true     },     "pam_auth_failure": {       "counts_as_attempt_evidence": true,       "counts_as_terminal_auth_failure": false     }   } } ```  This mapping lets LogLens normalize parsed events into detection signals before applying brute-force or multi-user rules. By default, `pam_auth_failure` is treated as lower-confidence attempt evidence and does not count as a terminal authentication failure unless the config explicitly upgrades it.  Timestamp handling is now explicit:  - `--mode syslog` or `input_mode: syslog_legacy` requires `--year` or `timestamp.assume_year` - `--mode journalctl-short-full` or `input_mode: journalctl_short_full` parses the embedded year and timezone and ignores `assume_year`  ## Example Input  ```text Mar 10 08:11:22 example-host sshd[1234]: Failed password for invalid user admin from 203.0.113.10 port 51022 ssh2 Mar 10 08:12:10 example-host sshd[1235]: Accepted password for alice from 203.0.113.20 port 51111 ssh2 Mar 10 08:15:00 example-host sudo:    alice : TTY=pts/0 ; PWD=/home/alice ; USER=root ; COMMAND=/usr/bin/systemctl restart ssh Mar 10 08:27:10 example-host sshd[1243]: Failed publickey for invalid user svc-backup from 203.0.113.40 port 51240 ssh2 Mar 10 08:28:33 example-host pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.41  user=alice Mar 10 08:29:50 example-host pam_unix(sudo:session): session opened for user root by alice(uid=0) Mar 10 08:30:12 example-host sshd[1244]: Connection closed by authenticating user alice 203.0.113.50 port 51290 [preauth] Mar 10 08:31:18 example-host sshd[1245]: Timeout, client not responding from 203.0.113.51 port 51291 ```  `journalctl --output short-full` style example:  ```text Tue 2026-03-10 08:11:22 UTC example-host sshd[2234]: Failed password for invalid user admin from 203.0.113.10 port 51022 ssh2 Tue 2026-03-10 08:13:10 UTC example-host sshd[2236]: Failed password for test from 203.0.113.10 port 51040 ssh Tue 2026-03-10 08:18:05 UTC example-host sshd[2238]: Failed publickey for invalid user deploy from 203.0.113.10 port 51060 ssh2 Tue 2026-03-10 08:31:18 UTC example-host sshd[2245]: Connection closed by authenticating user alice 203.0.113.51 port 51291 [preauth] ```  ## Known Limitations  - `syslog_legacy` requires an explicit year; LogLens does not guess one implicitly. - `journalctl_short_full` currently supports `UTC`, `GMT`, `Z`, and numeric timezone offsets, not arbitrary timezone abbreviations. - Parser coverage is still selective: it covers common `sshd`, `sudo`, `pam_unix`, and selected `pam_faillock` / `pam_sss` variants rather than broad Linux auth-family support. - Unsupported lines are surfaced as parser telemetry and warnings, not as detector findings. - `pam_unix` auth failures remain lower-confidence by default unless signal mappings explicitly upgrade them. - Detector configuration uses a fixed `config.json` schema rather than partial overrides or alternate config formats. - Findings are rule-based triage aids, not incident verdicts or attribution.  ## Future Roadmap  - Additional auth patterns and PAM coverage - Larger sanitized test corpus