Skip to content

Latest commit

ย 

History

History
378 lines (283 loc) ยท 18.4 KB

File metadata and controls

378 lines (283 loc) ยท 18.4 KB

๐Ÿ“Œ Lecture 6 โ€” Infrastructure-as-Code Security: Scanning Your Cloud Before It Burns


๐Ÿ“ Slide 1 โ€“ โ˜๏ธ The S3 Bucket That Cost $190M

  • ๐Ÿ—“๏ธ July 19, 2019 โ€” a former AWS employee uses a misconfigured WAF rule in Capital One's infrastructure to mount an SSRF attack against the EC2 metadata service
  • ๐ŸŽซ The IAM role attached to the WAF has wildcard s3:Get* and s3:List* permissions across 700+ buckets
  • ๐Ÿ’พ Attacker exfiltrates 106 million records โ€” names, addresses, credit scores, 140,000 Social Security numbers
  • ๐Ÿ’ฐ Settlement + remediation: ~$190 million
  • ๐Ÿง  The vulnerable WAF, the over-privileged IAM role, and the exposed metadata endpoint were all declared in Terraform โ€” and never scanned
  • ๐Ÿชœ By 2020 Capital One had Checkov in their pipeline. Two years and $190M too late

๐Ÿค” Think: Lecture 5 covered SAST scanning your application code. What scans your infrastructure code before terraform apply lights it on fire?


๐Ÿ“ Slide 2 โ€“ ๐ŸŽฏ Learning Outcomes

# ๐ŸŽ“ Outcome
1 โœ… Define Infrastructure-as-Code and explain why it created a new class of vulnerability
2 โœ… Recognize the top IaC misconfiguration categories (CIS / NIST mapping)
3 โœ… Run Checkov against Terraform + Pulumi and read its JSON output
4 โœ… Run KICS against an Ansible playbook and triage the Rego-based findings
5 โœ… Explain why tfsec was retired and where IaC scanning lives in Trivy today
6 โœ… Write a custom Checkov policy in YAML for a project-specific rule

๐Ÿ“ Slide 3 โ€“ ๐Ÿ—บ๏ธ Where Lecture 6 Sits

graph LR
    L4["๐Ÿš€ L4 CI/CD<br/>Pipelines"] --> L6
    L5["๐Ÿงช L5 SAST/DAST<br/>App code"] --> L6
    L6["๐Ÿ—๏ธ L6 IaC scan<br/>Infra code (here)"] --> L7["๐Ÿ“ฆ L7 Container<br/>Image scan"]
    L6 -.feeds.-> L9["๐Ÿ“Š L9 Policy-as-Code<br/>at admission"]

    style L6 fill:#FF9800,color:#fff
Loading
  • ๐Ÿ” Building on L4 (CI/CD): IaC scanning runs as a pipeline stage โ€” same gates, new file types
  • ๐Ÿ” Building on L5 (SAST): same idea (analyze static text), but the language is HCL/YAML/Python-Pulumi and the bugs are misconfigurations, not memory corruption
  • ๐Ÿ›ฃ๏ธ Setting up L9: Conftest/Rego in Lecture 9 reuses the policy-as-code idea you meet here

๐Ÿ“ Slide 4 โ€“ ๐Ÿ“œ What Is Infrastructure-as-Code?

๐Ÿ’ฌ "Treat infrastructure the same way you treat application code: version it, test it, review it, deploy it from a pipeline." โ€” Kief Morris, Infrastructure as Code (O'Reilly, 2nd ed., 2020)

๐Ÿท๏ธ Tool ๐Ÿ“ Language ๐ŸŽฏ Model ๐Ÿ—“๏ธ Origin
๐ŸŸฆ Terraform / OpenTofu HCL Declarative, state-driven HashiCorp 2014; OpenTofu fork September 2023
๐ŸŸช Pulumi Python/TS/Go/.NET Declarative via real code Joe Duffy + team, 2017
๐Ÿ”ด Ansible YAML Imperative push (SSH) Michael DeHaan, 2012, acquired by Red Hat 2015
โ˜๏ธ CloudFormation YAML/JSON AWS-native declarative AWS, 2011
โ˜ธ๏ธ Helm Templated YAML K8s package manager 2016 (Deis); CNCF Graduated 2020
  • ๐Ÿชœ All five are static text files. Every one of them can be scanned before it ships.

๐Ÿ“ Slide 5 โ€“ ๐Ÿงจ Why IaC Creates a New Bug Class

flowchart LR
    Dev[๐Ÿ‘ฉโ€๐Ÿ’ป Developer] -->|git push| Repo[๐Ÿ“ฆ Git repo]
    Repo -->|terraform apply| Cloud[โ˜๏ธ Cloud provider]
    Cloud -->|provisions| Resource[๐Ÿชฃ S3 bucket, IAM role, SG...]

    DevMistake[๐Ÿ˜ฑ Typo: '0.0.0.0/0'] -.-> Repo
    Resource -.-> Internet[๐ŸŒ World-readable]

    style DevMistake fill:#F44336,color:#fff
    style Internet fill:#F44336,color:#fff
Loading
  • โšก Mistakes that used to be one engineer ร— one resource are now one git push ร— N replicas
  • ๐Ÿง  IBM's 2024 Cost of a Data Breach report attributes ~45% of cloud breaches to misconfiguration โ€” more than any other root cause
  • ๐ŸŽฏ The whole point of IaC scanning: catch the typo before terraform apply does it to 200 buckets

๐Ÿค” Think: Lecture 5's SAST checks application code. IaC scanning checks infrastructure code. Same shift-left philosophy; different file type.


๐Ÿ“ Slide 6 โ€“ ๐Ÿ”ฅ The Misconfiguration Top Hits

These are the categories every scanner ships rules for. Memorize them โ€” they make the news.

๐Ÿšจ Category ๐Ÿ’ฅ Typical mistake ๐Ÿ›ก๏ธ Mitigation
๐ŸŒ Public network exposure cidr_blocks = ["0.0.0.0/0"] on SSH/RDP Restrict CIDR or use bastion/SSM
๐Ÿ”‘ Hard-coded secrets password = "admin123" in HCL Vault / cloud secret manager (links to L3)
๐Ÿชฃ Public storage S3 bucket without block_public_access Default-deny ACL + bucket policy
๐Ÿง“ Over-privileged IAM "Action": "*" / "Resource": "*" Least privilege + permission boundaries
๐Ÿ”“ Unencrypted at rest EBS/RDS/S3 without encryption = true Encrypt-by-default + customer-managed keys
๐Ÿ“œ No logging CloudTrail/VPC flow logs disabled Centralized log destination + retention SLA
๐ŸŒ Cross-account trust Principal = "*" in resource policy Specific account IDs only
๐Ÿ—๏ธ Old TLS min_tls_version = "1.0" TLS 1.2+ enforced
  • ๐Ÿ“š These map directly to the CIS Benchmarks (Center for Internet Security) and NIST 800-53 controls; every scanner ships them as rule IDs like CKV_AWS_19

๐Ÿ“ Slide 7 โ€“ ๐Ÿ› ๏ธ The Scanner Field Today

graph TB
    subgraph "Active in 2026"
        CK[โœ… Checkov<br/>Bridgecrew/Palo Alto<br/>3.x, 2,500+ rules]
        KI[โœ… KICS<br/>Checkmarx<br/>2,400+ Rego queries]
        TV[โœ… Trivy IaC mode<br/>Aqua<br/>via 'trivy config']
        TR[โœ… Terrascan<br/>Tenable]
    end
    subgraph "Retired"
        TS[โŒ tfsec<br/>archived, merged into Trivy<br/>Feb 2023]
    end

    style CK fill:#4CAF50,color:#fff
    style KI fill:#4CAF50,color:#fff
    style TV fill:#4CAF50,color:#fff
    style TS fill:#9E9E9E,color:#fff
Loading
  • ๐Ÿชฆ tfsec is dead. Aqua consolidated tfsec into Trivy in February 2023. Last release v1.28.14 was a dependency-CVE fix only. New scans go through trivy config <path> โ€” same rule heritage, broader format coverage
  • ๐ŸŽฏ This course pins Checkov 3.x for Task 1 (Terraform + Pulumi) and KICS for Task 2 (Ansible) โ€” both free, OSS, and represent the two dominant rule-language families

๐Ÿ“ Slide 8 โ€“ ๐Ÿ Checkov in 5 Minutes

  • ๐Ÿข Built by Bridgecrew (acquired by Palo Alto Networks, March 2021); open-sourced 2019
  • ๐Ÿ Written in Python (pip install checkov); ships rules in YAML + Python
  • ๐Ÿ”ข Latest major: Checkov 3.x (2026) โ€” 2,500+ built-in policies, 800+ graph-based checks
  • ๐Ÿ“‚ Scans: Terraform, OpenTofu, CloudFormation, Kubernetes, Helm, Dockerfile, GitHub Actions, ARM, Bicep, OpenAPI, Pulumi, Ansible (basic)
# Quick start used in the lab
pip install checkov
checkov -d ./terraform/ --output cli --output json --output-file-path results
๐Ÿ“ Output ๐ŸŽฏ Meaning
--output cli Human-readable, colored summary
--output json Machine-readable, importable to DefectDojo (L10)
--output sarif GitHub Code Scanning format
--skip-check CKV_AWS_19 Skip a rule (justify in PR description)
  • ๐Ÿง  Each finding ships with fix guidance โ€” Checkov is one of the few scanners that points you at a remediation line, not just a problem

๐Ÿ“ Slide 9 โ€“ ๐Ÿงฑ A Checkov Finding Read Aloud

Check: CKV_AWS_18: "Ensure the S3 bucket has access logging enabled"
        FAILED for resource: aws_s3_bucket.user_uploads
        File: /modules/storage/main.tf:14-22
        Guide: https://docs.bridgecrew.io/docs/s3_13-enable-logging
๐Ÿท๏ธ Element ๐ŸŽฏ Meaning
CKV_AWS_18 Stable rule ID โ€” use it for suppress lists
FAILED One of PASSED / FAILED / SKIPPED
Resource The HCL block that triggered
File + line Exact remediation location
Guide Bridgecrew's narrative explanation
  • ๐Ÿง  Critical reading skill: when reviewing a Checkov report, sort by rule ID frequency first โ€” one missing default in a module replicates as 30 findings; fix the module, fix all 30

๐Ÿ“ Slide 10 โ€“ ๐ŸŒ KICS for Multi-Language Estates

  • ๐Ÿข Built by Checkmarx, open-sourced November 2020; written in Go
  • ๐Ÿ“œ Rules in Rego (same language as OPA โ€” directly relevant to Lecture 9)
  • ๐Ÿ”ข Latest stable: 2.x (last release March 2025) โ€” 2,400+ Rego queries
  • ๐ŸŒ Scans: Terraform, K8s, Ansible, Docker/Compose, CloudFormation, OpenAPI, Helm, Bicep, Pulumi, Crossplane, GitHub Workflows, gRPC
# Used for Task 2 (Ansible) in the lab
docker run -v "$PWD:/path" checkmarx/kics:latest \
  scan -p /path/ansible/ -o /path/results --report-formats json,sarif
  • ๐Ÿ†š Checkov vs KICS โ€” when to use which?
    • Checkov has deeper Terraform-specific checks (graph relationships across resources)
    • KICS has wider language coverage (Ansible, Helm templates, OpenAPI) and a uniform Rego rule format โ€” easier to write a custom rule once it works for one input type

๐Ÿ“ Slide 11 โ€“ ๐Ÿ“œ Policy-as-Code: A First Look

flowchart LR
    R[๐Ÿ“ Policy in Rego/YAML/Python] --> S[๐Ÿ” Scanner]
    F[๐Ÿ“„ Your IaC file] --> S
    S --> O[๐Ÿ“Š Allowed / Denied + reason]

    style R fill:#9C27B0,color:#fff
    style S fill:#FF9800,color:#fff
Loading
  • ๐Ÿ’ก Policy-as-Code = your security rules live in version control, are reviewed in PRs, and execute deterministically in CI
  • ๐Ÿชœ Both Checkov and KICS implement this; the same idea powers Conftest (Lecture 9) and Gatekeeper (admission control)
  • ๐ŸŽ Bonus task in Lab 6 asks you to write a Checkov custom policy โ€” your first Policy-as-Code rule
  • โœ‹ This lecture introduces PaC for IaC; Lecture 9 expands it to runtime admission control. Don't try to do both in your head

๐Ÿ“ Slide 12 โ€“ โœ๏ธ A Custom Checkov Policy in YAML

metadata:
  id: "CKV2_CUSTOM_1"
  name: "Ensure S3 buckets have lifecycle policy"
  category: "BACKUP_AND_RECOVERY"
  severity: "MEDIUM"
definition:
  and:
    - cond_type: "filter"
      attribute: "resource_type"
      value: ["aws_s3_bucket"]
      operator: "within"
    - cond_type: "connection"
      resource_types: ["aws_s3_bucket_lifecycle_configuration"]
      connected_resource_types: ["aws_s3_bucket"]
      operator: "exists"
๐Ÿงฉ Section ๐ŸŽฏ What it does
metadata.id CKV2_* prefix for graph (cross-resource) rules; CKV_* for single-resource
category Used for Checkov's default policy groups
definition.and All conditions must hold; supports or, not
cond_type: connection "Is this resource referenced by another?" โ€” the graph engine
  • ๐Ÿง  This is exactly the bonus task in the lab. Read it twice; we'll write one together in office hours

๐Ÿ“ Slide 13 โ€“ ๐Ÿค– Wiring Scanners Into CI (Building on L4)

# .github/workflows/iac-scan.yml โ€” extends what you built in L4
name: IaC Scan
on: [pull_request]
jobs:
  checkov:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@b4ffde6...
      - uses: bridgecrewio/checkov-action@v12      # pin to digest in real life
        with:
          directory: terraform/
          framework: terraform
          output_format: sarif
          output_file_path: results/
      - uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: results/results.sarif
  • ๐Ÿชœ Three layers, same pipeline:
    1. PR scan (this job) โ€” fail the PR on HIGH+
    2. Nightly full scan on main โ€” catches new rules added to the scanner
    3. Drift detection (Lecture 9 covers this) โ€” compares declared state to actual cloud
  • ๐Ÿง  continue-on-error: true is a smell. Recall from Lecture 4: if you can't fail the build, you're not gating โ€” you're decorating

๐Ÿ“ Slide 14 โ€“ ๐Ÿ Where Pulumi Differs

  • ๐Ÿ“ Pulumi programs are real code (Python/TS/Go/.NET) that emits a declarative state graph
  • ๐Ÿงช Static analyzers can scan two layers:
    1. The source code (looks like normal Python โ€” SAST tools see it as Python)
    2. The rendered state (pulumi preview --json) โ€” what will actually be created
  • ๐ŸŽฏ Checkov scans the rendered state (pulumi preview JSON), not your TypeScript directly โ€” which is exactly right, because IaC misconfigs live in the resource graph, not the loop that built it

๐Ÿ’ฌ "Pulumi's superpower is that you write infrastructure in your favorite language. Pulumi's superpower is also that you can write a for loop that provisions 500 buckets." โ€” paraphrasing the Pulumi team at KubeCon 2023


๐Ÿ“ Slide 15 โ€“ ๐Ÿ”ฌ Case Study: Tesla's Exposed Kubernetes Dashboard (2018)

  • ๐Ÿ—“๏ธ February 20, 2018 โ€” RedLock researchers find a Kubernetes admin console on Tesla's AWS, internet-exposed, no authentication
  • ๐Ÿช™ Attackers had been using it to mine Monero, dialing CPU to stay under radar
  • ๐Ÿ” Root cause: Terraform module spun up the EKS cluster with endpoint_public_access = true and no IAM auth configuration
  • ๐Ÿ›ก๏ธ A Checkov scan (rule CKV_AWS_38 or equivalent today) would have flagged the public endpoint
  • ๐Ÿง  Tesla's response was fast โ€” the deeper lesson is how easy this is to ship. Every EKS module's first version since 2018 has defaulted to private; the rule exists because the default wasn't private

๐Ÿ“ Slide 16 โ€“ ๐Ÿ”ฌ Case Study: Imperva (2019)

  • ๐Ÿ—“๏ธ October 2019 โ€” Imperva discloses a 2018 breach traced to a misconfigured snapshot
  • ๐Ÿงช A pre-prod database snapshot is created with an embedded AWS API key
  • ๐Ÿชฃ The snapshot's S3 bucket lacked default-deny ACL; attacker enumerates and exfiltrates customer email + hashed passwords
  • ๐Ÿชœ Two IaC rules would have caught this:
    • CKV_AWS_18 (S3 logging) โ€” would have shown the access
    • CKV_AWS_56 (S3 public access block) โ€” would have prevented the access
  • ๐Ÿ’ญ Imperva is a security company. No one is immune to misconfiguration. This is precisely why the scanner runs in CI, not in someone's head

๐Ÿ“ Slide 17 โ€“ ๐Ÿงฎ Triage: 1,000 Findings on Day One

The first scan of a real codebase will find hundreds of issues. A program rule of thumb (matches Lecture 5's SAST triage):

๐Ÿชœ Phase ๐ŸŽฏ What you do ๐Ÿ“… Timeline
0๏ธโƒฃ Baseline Scan, count by severity, don't fix yet Day 1
1๏ธโƒฃ Triage Sort by rule ID frequency; group by module Week 1
2๏ธโƒฃ Module fixes Fix the top 5 modules โ†’ kills 60-80% of findings Weeks 2-3
3๏ธโƒฃ Gate Add Checkov to PR; fail on HIGH+ new findings only Week 4
4๏ธโƒฃ Burndown Suppress existing findings with explicit expiry; track in DefectDojo (L10) Ongoing
  • ๐ŸŽฏ Don't try to fix everything in week one. A blocked CI on day 2 makes the security team an obstacle, not a partner
  • ๐Ÿชœ The gate-on-new pattern (also called "diff scanning" or "delta gating") is how mature programs avoid bankruptcy. Same discipline you saw in SAST (Lecture 5)

๐Ÿ“ Slide 18 โ€“ ๐Ÿชœ Sharing IaC Across Teams: Modules + Policy

  • ๐Ÿ“ฆ Module ownership pattern: platform team ships hardened modules (e.g. s3_secure) that wrap raw providers with safe defaults; application teams consume modules, not raw resources
  • ๐Ÿ›ก๏ธ Pre-commit hook (extending L3): run checkov -d . --quiet before commit; same scanner, earlier
  • ๐Ÿงช Drift detection is the topic of L9 โ€” a scanner only checks what you declared; cloud changes can still happen out-of-band (root console, mythical Tuesday 4pm hotfix)

๐Ÿค” Think: A scanner can prove your IaC is safe. Can a scanner prove your cloud is safe? (Trick: only if it also reads live cloud state โ€” which Trivy, Checkov, and Prowler now do, but with different trade-offs.)


๐Ÿ“ Slide 19 โ€“ โญ๏ธ What's Next + Lab Preview

  • ๐Ÿงช Lab 6 (this week):
    • Task 1 (6 pts): Checkov on a Terraform + Pulumi sample with planted misconfigs
    • Task 2 (4 pts): KICS on an Ansible playbook; compare ruleset coverage to Checkov
    • Bonus (2 pts): Write a custom Checkov policy โ€” your first PaC rule
  • ๐Ÿš€ Lecture 7 (next week): Container & Kubernetes Security โ€” Trivy on the Juice Shop image, Pod Security Standards, baseline K8s hardening. The next layer of the stack
  • ๐Ÿชœ You're now scanning code (L5), infra (L6); next we scan the artifact itself

๐Ÿ“ Slide 20 โ€“ ๐Ÿ“š Resources & Takeaways

Books:

๐Ÿ“– Book โœ๏ธ Why
Terraform: Up & Running โ€” Yevgeniy Brikman (O'Reilly, 3rd ed. 2022) Ch. 10 "Production-Grade Terraform Code" covers module testing + policy
Infrastructure as Code โ€” Kief Morris (O'Reilly, 2nd ed. 2020) Ch. 7 "Configuration Registries" + ch. 11 "Testing Infrastructure" โ€” broad framing
Securing DevOps โ€” Julien Vehent (Manning, 2018) Ch. 3 "Hardening AWS" maps cloud misconfigs to specific scanner rules
Pulumi: Continuous Deployment in the Cloud โ€” Will Boyd (Pulumi, free e-book) Best free intro to scanning Pulumi state graphs

Talks & specs:

  • ๐ŸŽฅ "Securing Infrastructure as Code" โ€” Barak Schoster (Bridgecrew/Checkov), Black Hat 2020
  • ๐ŸŽฅ "From tfsec to Trivy: Consolidating IaC Scanning" โ€” Aqua team, KubeCon NA 2023
  • ๐Ÿ“œ CIS Benchmarks โ€” the source of most rules
  • ๐Ÿ“œ Checkov rule index โ€” every CKV_AWS_* with description
  • ๐Ÿ“œ KICS query catalogue โ€” all 2,400+ Rego queries

Takeaways:

# ๐Ÿง  Insight
1 IaC turned single-host typos into 200-replica disasters. Scanning is the cheapest insurance.
2 Misconfiguration is the leading cloud breach cause โ€” and the most automatable to prevent.
3 Use Checkov for Terraform-heavy estates; KICS for multi-language. Trivy now covers the tfsec heritage.
4 Fix at the module level, not at the resource level โ€” one bug fix can close 30 findings.
5 Day-one full scan is a learning exercise. Gate on new is the operational pattern.
6 Custom policies turn your team's tribal knowledge into a CI-enforced rule. Write the bonus-task policy seriously โ€” it's how programs scale.

๐Ÿ’ฌ "The cloud is just someone else's computer โ€” and now you're declaring it as text. Read your declarations before AWS does." โ€” paraphrased from too many KubeCon hallway tracks to count