Skip to content

Latest commit

 

History

History
341 lines (252 loc) · 10.9 KB

File metadata and controls

341 lines (252 loc) · 10.9 KB

S3 — Simple Storage Service

What Is It?

S3 is AWS's infinitely scalable object store. Think of it as a giant key-value store where the key is the "path" (e.g., images/user123/avatar.jpg) and the value is the file bytes. It's not a file system — it's object storage.


Core Concepts

Storage Classes — Choose Based on Access Pattern

Class Use Case Retrieval Min Duration Cost
Standard Frequently accessed data Instant None Highest
Standard-IA Infrequent access, rapid retrieval Instant 30 days Lower storage
One Zone-IA IA data, tolerate AZ loss Instant 30 days Cheaper
Intelligent-Tiering Unknown access pattern Instant None Per-monitoring fee
Glacier Instant Archives, once a quarter Instant 90 days Very low
Glacier Flexible Archives, 1-5 min – 12 hrs Minutes-hours 90 days Lower
Glacier Deep Archive 7-10 year compliance archives 12-48 hours 180 days Cheapest

Real-World: A media company stores raw video uploads in Standard, transcoded copies in Standard-IA after 30 days, and original masters in Glacier Deep Archive after 1 year. S3 Lifecycle policies automate this.


Lifecycle Policies — Automate Tiering

<LifecycleConfiguration>
  <Rule>
    <ID>ArchiveOldLogs</ID>
    <Status>Enabled</Status>
    <Filter><Prefix>logs/</Prefix></Filter>
    <Transition>
      <Days>30</Days>
      <StorageClass>STANDARD_IA</StorageClass>
    </Transition>
    <Transition>
      <Days>90</Days>
      <StorageClass>GLACIER</StorageClass>
    </Transition>
    <Expiration>
      <Days>365</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>

Versioning

Enabling versioning means every PUT creates a new version. DELETE just adds a "delete marker" — data not lost.

Real-World: A document management system uses S3 versioning so users can restore previous versions of contracts.

bucket/contract.pdf  → version 1 (original)
bucket/contract.pdf  → version 2 (modified)  ← current
bucket/contract.pdf  → delete marker          ← "deleted" but v1, v2 still there

MFA Delete: Requires MFA to permanently delete versions. Use for compliance buckets.


Encryption

Four Encryption Options

Method Key Management Performance Use Case
SSE-S3 AWS manages everything Fast Default, no compliance requirements
SSE-KMS KMS manages keys, you control API call per encrypt/decrypt Audit trail needed, HIPAA/PCI
SSE-C YOU provide key per request Client sends key in header Regulatory: you must hold key
Client-Side You encrypt before upload Client CPU overhead Maximum control

Enforcing Encryption via Bucket Policy

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:::my-bucket/*",
  "Condition": {
    "StringNotEquals": {
      "s3:x-amz-server-side-encryption": "aws:kms"
    }
  }
}

Any upload WITHOUT SSE-KMS is denied. This enforces encryption at the policy level.

Enforce HTTPS Only

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
  "Condition": {
    "Bool": {"aws:SecureTransport": "false"}
  }
}

Access Control — Who Can Access What?

Decision Flow

Is there an explicit DENY? → YES → DENY (end)
↓ NO
Is it a cross-account request?
  → Bucket policy must allow + IAM must allow
Is it same-account request?
  → Bucket policy allows OR IAM allows (either is enough)

Bucket Policy vs ACL vs IAM

  • Bucket Policy: Resource-based, JSON, grants access to accounts/users/public
  • ACL: Legacy, grant to canonical user IDs, avoid for new setups
  • IAM Policy: Identity-based, controls what your users/roles can do to S3

Best Practice: Use bucket policies for cross-account and public access. Use IAM policies for your own users.


Presigned URLs

Problem: Your app needs to let users upload a profile picture directly to S3 without going through your server.

Solution: Generate a presigned URL server-side, give it to the client.

import boto3

s3 = boto3.client('s3')

# Generate presigned upload URL (PUT)
url = s3.generate_presigned_url(
    'put_object',
    Params={
        'Bucket': 'user-uploads',
        'Key': f'profiles/{user_id}/avatar.jpg',
        'ContentType': 'image/jpeg'
    },
    ExpiresIn=3600  # 1 hour
)
# Send URL to client — they PUT directly to S3

Key Facts:

  • Inherits the permissions of the IAM entity that generated it
  • Max expiry: 7 days with SigV4 (604800 seconds)
  • Great for: downloads of private files, direct-to-S3 uploads

S3 Events

Trigger Lambda/SQS/SNS when objects are created/deleted.

Real-World: Image upload triggers Lambda to resize and create thumbnails.

User uploads image → S3 → S3 Event Notification → Lambda → 
  → Resize image → store thumbnail back in S3

Limitation: S3 events don't guarantee exactly-once delivery. Use SQS as buffer for reliable processing.

S3 Event Bridge integration: For more advanced filtering and routing, send S3 events to EventBridge → route to multiple targets.


Multipart Upload

For files > 100MB (required > 5GB):

# Boto3 handles this automatically with transfer_config
from boto3.s3.transfer import TransferConfig

config = TransferConfig(multipart_threshold=1024*1024*100)  # 100MB
s3.upload_file('large-file.zip', 'bucket', 'large-file.zip', Config=config)

Lifecycle rule to clean up incomplete uploads:

<AbortIncompleteMultipartUpload>
  <DaysAfterInitiation>7</DaysAfterInitiation>
</AbortIncompleteMultipartUpload>

S3 Transfer Acceleration

Speeds up cross-region uploads by routing through CloudFront edge locations.

Real-World: Users in Australia uploading to a US-East bucket — enable Transfer Acceleration, uploads route through Sydney edge → faster.

URL format: bucket.s3-accelerate.amazonaws.com


CORS Configuration

Problem: Your JavaScript app at app.example.com calls S3 to load images. Browser blocks it with CORS error.

[{
  "AllowedHeaders": ["*"],
  "AllowedMethods": ["GET", "PUT"],
  "AllowedOrigins": ["https://app.example.com"],
  "ExposeHeaders": ["ETag"],
  "MaxAgeSeconds": 3000
}]

Replication (CRR & SRR)

Type Cross-Region Same-Region Use Case
CRR Yes No Disaster recovery, latency
SRR No Yes Log aggregation, compliance copy

Requirements: Source bucket must have versioning enabled.

Key: Replication is asynchronous — not instant. Existing objects NOT replicated automatically (use S3 Batch Operations).


S3 Object Lock & WORM

For regulatory compliance (SEC Rule 17a-4, FINRA):

  • Governance Mode: Users with special IAM permission can override
  • Compliance Mode: No one (not even root) can delete/modify before retention expires
s3.put_object_retention(
    Bucket='compliance-bucket',
    Key='audit-log-2024.gz',
    Retention={
        'Mode': 'COMPLIANCE',
        'RetainUntilDate': datetime(2031, 1, 1)
    }
)

Performance Optimization

S3 Request Rate Limits

  • 3,500 PUT/COPY/POST/DELETE per prefix per second
  • 5,500 GET/HEAD per prefix per second

Real-World Problem: Your app was uploading everything to uploads/ prefix — hitting rate limits. Fix: Spread across prefixes: uploads/a/, uploads/b/, uploads/c/ — 3x the throughput.

S3 Select

Instead of downloading a 1GB CSV to find 10 rows, query inside S3:

response = s3.select_object_content(
    Bucket='data-lake',
    Key='sales-2024.csv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object WHERE region = 'us-east'",
    InputSerialization={'CSV': {'FileHeaderInfo': 'USE'}},
    OutputSerialization={'CSV': {}}
)

Reduces data transfer and Lambda cost dramatically.


Good Practices

Practice Reason
Block Public Access at account level Prevents accidental data exposure
Enable S3 Access Logs Audit trail for compliance
Enable versioning on critical buckets Accidental delete protection
Use Lifecycle policies Automatic cost optimization
Use VPC Endpoints for S3 Traffic stays in AWS network, no NAT gateway costs
Enforce SSE-KMS via bucket policy Compliance + audit trail in CloudTrail
Use presigned URLs for user uploads Never expose AWS credentials to clients
Enable MFA Delete on sensitive buckets Prevent accidental/malicious permanent deletion

Bad Practices

Anti-Pattern Impact Fix
"Principal": "*" with no conditions Public bucket = data breach Add conditions or use presigned URLs
Storing credentials in S3 public bucket Game over — credentials stolen Use Secrets Manager
Not enabling versioning on app assets Can't recover from accidental delete Enable versioning + lifecycle
Using ACLs for access control ACLs are legacy, hard to audit Use bucket policies + IAM
One prefix for all uploads Rate limit bottleneck Distribute across multiple prefixes
Not cleaning up incomplete multipart uploads Stealth storage cost Add lifecycle rule to abort after N days

Exam Tips

  1. S3 is eventually consistent — but for new object PUTs, it's strongly consistent since Dec 2020. Exam may have old questions — the answer is "strong consistency" now.
  2. Bucket names are globally unique across all AWS accounts.
  3. S3 does NOT support append operations — you must rewrite the whole object.
  4. ACLs are legacy — exam will push you toward bucket policies.
  5. Cross-account S3 access: BOTH the bucket policy AND the IAM policy must allow it.
  6. Glacier retrieval types: Expedited (1-5 min), Standard (3-5 hours), Bulk (5-12 hours).
  7. Server Access Logging vs CloudTrail: Server Access Logs = S3 API-level detail. CloudTrail = management events + data events (if enabled).
  8. Static website hosting: CORS on S3 is required when browser JS calls the bucket directly.

Common Exam Scenarios

Q: Cheapest way to store backups accessed once a year? → S3 Glacier Deep Archive

Q: S3 objects deleted accidentally, how to protect? → Enable versioning + MFA Delete

Q: Lambda processes S3 uploads but sometimes misses events? → Add SQS between S3 and Lambda as event buffer (Dead Letter Queue on SQS)

Q: How to allow a different AWS account to access your S3 bucket? → Add bucket policy with Principal: {"AWS": "arn:aws:iam::OTHER_ACCOUNT:root"}

Q: Developer wants to enforce all uploads use KMS encryption? → Bucket policy with Deny if s3:x-amz-server-side-encryption != aws:kms