Skip to content

Latest commit

 

History

History
158 lines (108 loc) · 6.18 KB

File metadata and controls

158 lines (108 loc) · 6.18 KB

AWS S3 & Python Data Processing Lab

A hands-on lab for researchers: store data in Amazon S3 and process it with Python in a web-based VS Code environment.

Region: Singapore (ap-southeast-1)
Time: ~45–60 minutes

Project structure

.
├── LAB.md                 # This lab (instructions)
├── requirements.txt       # Python dependencies (boto3, pandas)
├── process_data.py        # Script: read from S3, process, write to S3
└── sample-data/
    ├── genomics_sample.csv
    └── sensor_readings.json

Objectives

  • Create an S3 bucket in the Singapore region via the AWS Management Console.
  • Upload sample research datasets (CSV and JSON) to S3.
  • Run a Python script that reads from S3, computes statistics, filters data, converts formats, and writes results back to S3.

Prerequisites

  • AWS account with console access.
  • Web-based VS Code environment with Python 3 and terminal access.
  • Basic familiarity with Python and CSV/JSON.

Part 1: Create an S3 Bucket (AWS Management Console)

  1. Sign in to AWS
    Go to https://console.aws.amazon.com and sign in.

  2. Open S3
    In the search bar, type S3 and open S3 (or go to Services → Storage → S3).

  3. Create bucket
    Click Create bucket.

  4. Bucket settings

    • Bucket name: Choose a globally unique name (e.g. research-data-yourname-2025).
      S3 bucket names must be unique across all AWS accounts and regions.
    • Region: Select Asia Pacific (Singapore) ap-southeast-1.
    • Block Public Access: Leave the default (Block all public access) unless your use case requires public access.
    • Bucket Versioning: Optional; you can leave Disable for this lab.
    • Default encryption: Optional; Server-side encryption (SSE-S3) is a good choice for research data.
  5. Create
    Click Create bucket at the bottom.

  6. Note your bucket name
    You will need it for uploading data and for the Python script.


Part 2: Upload Sample Datasets to S3

You will upload two sample datasets into your bucket: a genomics-style CSV and a sensor JSON file. Use the files provided in this lab in the sample-data/ folder:

File Description
sample-data/genomics_sample.csv Sample IDs, gene names, expression values, condition (control/treatment)
sample-data/sensor_readings.json Sensor IDs, timestamps, numeric values, units (temperature, pH)

Upload steps (AWS Console)

  1. In the S3 console, click your bucket name.
  2. Click Upload.
  3. Add files:
    • sample-data/genomics_sample.csv → upload as genomics_sample.csv (or keep path; see script config below)
    • sample-data/sensor_readings.json → upload as sensor_readings.json
  4. Under Destination, leave the prefix empty so files land in the bucket root. The script expects keys genomics_sample.csv and sensor_readings.json. If you use a prefix (e.g. raw/), set GENOMICS_KEY and SENSOR_KEY in process_data.py accordingly (e.g. raw/genomics_sample.csv).
  5. Click Upload, then Done.

Check: In the bucket, you should see your CSV and JSON files (e.g. in the root or under raw/).


Part 3: Python Environment and Dependencies

In your VS Code environment, open a terminal and install dependencies:

pip install -r requirements.txt

Ensure you have AWS credentials configured (e.g. environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN, or an AWS CLI profile). The script will use the default credential chain (environment variables or ~/.aws/credentials).


Part 4: Configure and Run the Processing Script

  1. Set your bucket name
    Open process_data.py and set the variable BUCKET_NAME to your bucket name (e.g. research-data-yourname-2025).

  2. Set region
    Ensure the script uses region ap-southeast-1 (Singapore) for S3; the sample script does this by default.

  3. Run the script

    python process_data.py

What the script does

  • Reads from your S3 bucket:
    • genomics_sample.csv (expression data) and sensor_readings.json (sensor readings).
  • Processes data:
    • Statistics: mean, median, min, max, count for genomics expression and for sensor value.
    • Filtering: genomics rows with expression >= 10; sensor records with value >= 22.
    • Format conversion: filtered genomics written as CSV and JSON; filtered sensor as JSON and CSV.
  • Writes results to folder processed/ inside the same bucket:
    • processed/genomics_stats.json, processed/genomics_filtered.csv, processed/genomics_filtered.json
    • processed/sensor_stats.json, processed/sensor_filtered.json, processed/sensor_filtered.csv

After a successful run, refresh your bucket in the S3 console and open the output folder to see the generated files.


Part 5: Verify Results in S3

  1. In the S3 console, open your bucket.
  2. Open the output folder (e.g. processed/ or results/).
  3. Download and open the generated files (e.g. summary JSON, processed CSV/JSON) and confirm they match what the script describes (statistics, filtered records, format conversion).

Extensions and Troubleshooting

  • Permissions: The IAM user/role used must have s3:GetObject and s3:PutObject (and s3:ListBucket if the script lists objects) on your bucket.
  • Region: Bucket and script must use the same region (e.g. ap-southeast-1).
  • Paths: If you uploaded files under a prefix (e.g. raw/), set the script’s input key names to match (e.g. raw/genomics_sample.csv).

Summary

Step Action
1 Create S3 bucket in Singapore (ap-southeast-1) via AWS Console.
2 Upload genomics_sample.csv and sensor_readings.json to the bucket.
3 Install dependencies (pip install -r requirements.txt) and configure AWS credentials.
4 Set BUCKET_NAME in process_data.py and run python process_data.py.
5 Check the output folder in S3 for processed and summary files.

Lab designed for researchers learning AWS S3 and Python data processing.