Awareness Software Group LLC
Threat intelligence datasets

Immutable Threat Intelligence Snapshots

Versioned, audit-ready datasets designed for AI evaluation, regression testing, and security operations. Each snapshot is a frozen slice of reality — so if your model regresses, you can prove it.

Immutable monthly snapshots JSONL + Parquet + HTML reports CISA KEV + RSS + Web sources NVD enrichment + confidence scoring Interactive visualizations included
Built for: ML teams • Detection engineering • Vulnerability management • Evaluation & audit

Key selling point

"This snapshot will never change." Buyers can reproduce results, compare models over time, and separate data drift from model regressions.

Delivery: private S3 (signed links), or your preferred secure method • License: internal use / no redistribution

Monthly Snapshot (Combined)

Immutable release containing all CISA KEV entries (as of snapshot date) plus validated non-KEV signals published during the month. Includes HTML reports, interactive visualizations, and a sample folder for evaluation.

Best seller • Audit-ready • Training + eval

Typical pricing: $500–$5,000 / snapshot (starter buyers)

Includes: JSONL, Parquet, JSON, HTML reports, visualizations, manifest, and sample folder (25-50 representative records)

KEV Ground Truth

Complete CISA Known Exploited Vulnerabilities catalog with normalization, NVD enrichment (CVSS scores, severity), and confidence scoring. All KEV entries as of the snapshot generation date.

High precision • Ground truth labels • Low noise

Typical pricing: bundled or $250–$2,000 one-time

Note: KEV entries are included in monthly snapshots based on their status as of snapshot date, regardless of original publication date.

Emerging Signals Feed (Non‑KEV)

Rolling 30-day window of newly published non‑KEV CISA advisories from RSS feeds and web scraping. Excludes KEV entries (which appear only in full snapshots). May be quiet during low-activity periods.

Rolling 30d • Strict validation • Operational

Typical pricing: $50–$500 / month (subscription)

Sources: CISA RSS feeds + web-scraped advisory pages

What you're really buying

Not just data — reproducibility, trust, and time savings. Each snapshot is a release artifact you can reference forever. That enables objective model comparisons, audit trails, and defensible decisions.

Key Differentiators

  • Processed & Ready: Normalized, enriched, and scored—not raw feeds requiring hours of preprocessing
  • Immutable: Snapshots never change, enabling reproducible ML experiments and audit trails
  • Explainable: Confidence scores include explainable factors (KEV presence, CVE count, NVD enrichment)
  • Documented: HTML reports, interactive visualizations, and complete schema documentation included
  • Sample Included: Every snapshot includes a sample folder (25-50 records) for evaluation before purchase
  • Authoritative Sources: Only CISA government sources—no noise, no speculation, no padding

How it works

  • Ingest: CISA KEV JSON feed, CISA RSS advisories, and web-scraped advisory pages. All from authoritative government sources.
  • Normalize: CVE extraction, HTML cleaning, deduplication by CVE overlap, canonical field names, stable schemas.
  • Enrich: NVD API integration (cached) adds CVSS scores, severity ratings, CWE classifications, and detailed vulnerability metadata.
  • Score: Explainable confidence scoring (v2) with factors like KEV presence, CVE count, NVD data availability. Confidence = trust/reliability, not risk.
  • Release: Monthly immutable snapshots with JSONL, Parquet, JSON, HTML reports, interactive visualizations, manifest, and sample folder.

Confidence scoring (what 100% means)

100% confidence indicates confirmed exploitation (KEV-backed ground truth). It does not mean highest impact, prevalence, or likelihood in a specific environment.

Confidence = "Is this real?"
Severity (CVSS) = "How bad could it be?"
Risk = "How likely are we impacted?"

Snapshot package format

Each snapshot is delivered as a versioned folder with immutable artifacts:

snapshots/YYYY_MM/ cisa_snapshot_combined.jsonl cisa_snapshot_combined.parquet cisa_snapshot_combined.json cisa_snapshot_combined.html cisa_snapshot_combined_visualizations.html manifest.json sample/ sample_snapshot_YYYY_MM.jsonl sample_snapshot_YYYY_MM.parquet sample_snapshot_YYYY_MM.json sample_snapshot_YYYY_MM.html sample_snapshot_YYYY_MM_visualizations.html sample_manifest.json manifest.json (full) README_SAMPLE.md

The manifest includes record counts, coverage window, KEV status timestamp, and generation metadata. HTML reports provide interactive exploration. The sample folder contains 25-50 representative records for evaluation.

Delivery & access

  • Direct: private S3 (signed download links) or secure file share.
  • Enterprise option: customer-managed bucket delivery, or scheduled drops.
  • License: internal use, no redistribution (see license page).
  • Evaluation: Each snapshot includes a sample folder with 25-50 representative records for inspection before purchase.

What you get: Not just raw data—you get normalized, enriched, scored, and documented datasets with HTML reports and visualizations ready for immediate use.

FAQ

Why can the emerging signals feed be empty?

Because we don't pad the feed with UI pages, social links, or speculative content. Quiet periods are normal for authoritative sources like CISA and are a sign of signal integrity. We prioritize quality over quantity.

What's the difference between this and free CISA feeds?

Free CISA feeds are raw, unprocessed, and constantly changing. Our snapshots are normalized (consistent schema, cleaned HTML, extracted CVEs), enriched (NVD CVSS scores, severity ratings), scored (explainable confidence factors), immutable (never change after release), and documented (HTML reports, visualizations, manifests). You're buying processed, reproducible data ready for ML/AI pipelines, not raw feeds.

Is one snapshot "enough data" for AI?

Snapshots aren't for pretraining from scratch. They're for supervision, evaluation, continual updates, and audit-grade regression testing. Snapshots compound into a trusted corpus over time.

Do snapshots change after release?

No. Snapshots are immutable. If corrections are needed, a new version (e.g., v2) may be issued while preserving prior versions for reproducibility.

Can we share the dataset with our customers?

Typically no. The license is internal-use and prohibits redistribution of raw records. Sharing derived outputs (models, analyses, reports) is allowed if it does not enable reconstruction of the dataset. For redistribution rights, contact us for custom licensing terms.

What's included in the HTML reports and visualizations?

Each snapshot includes interactive HTML reports with searchable tables, charts showing source distribution, severity breakdown, CVSS scores, confidence scores, timeline analysis, and complete schema documentation. The visualizations help you understand the dataset structure and data quality before writing code.

How are snapshots generated?

Snapshots are generated monthly, typically within the first few days after the month ends to ensure completeness. Each snapshot includes all KEV entries as of the generation date (regardless of when they were originally published) plus all non-KEV signals published during that calendar month. The snapshot is then frozen and never modified.

Request access / pricing

Tell us what you're building and we'll recommend the right SKU (snapshot vs feed) and provide a sample.

Email: awarenesssoftwaregroup@gmail.com

Preferred info to include: use case (ML eval, detection engineering, vulnerability management), desired cadence (monthly snapshots vs. rolling feed), team size, and whether you need Parquet format.