Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions docs/project_description_explanation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Explanation: `project_description.py`

<!-- toc -->

- [Introduction and motivation](#introduction-and-motivation)
- [Core Concepts](#core-concepts)
- [How It Works](#how-it-works)
- [Design Rationale](#design-rationale)
- [Trade-offs and Alternatives](#trade-offs-and-alternatives)

<!-- tocstop -->

## Introduction and motivation

- This tool automates the generation of academic project descriptions by
integrating Google Sheets input with OpenAI API.
- It addresses the need for scalable, consistent, and high-quality project
documentation based on dynamic student or faculty input.
- It is intended to streamline and automate project generation and
documentation.

## Core Concepts

- **Google Sheets Integration:** Uses Google Sheets as the dynamic data source
for project names and difficulty levels.
- **Prompt Engineering:** A pre-defined prompt template guides GPT to produce
structured project descriptions.
- **Markdown Generation:** Outputs the generated content into a formatted
Markdown file for easy distribution.
- **Helper Modules:** External utility modules (`hgoogle_file_api`, `hopenai`,
`hio`) abstract authentication, I/O, and API interaction.

## How It Works

- The script follows this control flow:

```markdown
[Google Sheet URL] → read_google_sheet() → [DataFrame of projects] → loop →
Create prompt and feed into GPT → [GPT-generated text] → create_markdown_file()
→ [Markdown output]
```

- Key Functions:
- `read_google_sheet(url)`: Reads spreadsheet and returns a pandas DataFrame.
- `generate_project_description(project_name, difficulty)`: Sends input to
GPT-4o-mini model and returns generated text.
- `create_markdown_file(df, markdown_file_path)`: Iterates over the DataFrame,
generates description for each row, and writes it to a Markdown file.

## Design Rationale

- **Automation Focus:** Built to minimize manual work for faculty managing large
project datasets.
- **Modular Helpers:** Offloading I/O and API logic to separate modules makes
this script easier to maintain or port.
- **GPT as Content Generator:** Using GPT-4o-mini allows flexibility and
high-quality text output with minimal prompt tuning.

## Trade-offs and Alternatives

- **Current Approach:**
- Advantages:
- Automated, reproducible, and scalable.
- Maintains separation of logic (reading input, generating content, writing
file).
- Drawbacks:
- Dependent on OpenAI and Google APIs (connectivity and API keys required).
- Limited error handling and logging for individual failures.

- **Alternative Approach:**
- Using a GUI-based application or Jupyter notebook for manual review and
editing.
- Advantages:
- Allows user customization and validation at each step.
- Drawbacks:
- Slower and less scalable; not suitable for batch generation.
80 changes: 80 additions & 0 deletions docs/project_description_how_to_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# How To Guide: `project_description.py`

<!-- toc -->

- [What It Does](#what-it-does)
- [Assumptions / Requirements](#assumptions--requirements)
- [Instructions](#instructions)
* [Step 1: Fetch Input](#step-1-fetch-input)
* [Step 2: Script Execution](#step-2-script-execution)
* [Step 3: Review Output](#step-3-review-output)
- [Troubleshooting](#troubleshooting)

<!-- tocstop -->

## What It Does

- Automates the process of generating academic project descriptions by:
- Reading project data from a Google Sheet.
- Using OpenAI's API to auto-generate detailed project descriptions.
- Saving the final output in a formatted Markdown file for distribution.

## Assumptions / Requirements

- Google Cloud service key file ready to use
- Docker running
- Valid OpenAI API key for model access
- Project-specific helper modules must be available:
- Helpers.hgoogle_file_api
- Helpers.hio
- Helpers.hopenai

## Instructions

### Step 1: Fetch Input

Ensure the Google Sheet is publicly accessible or shared with the configured
service account.

For instructions on how to configure google sheets API, follow this link:
[https://github.com/causify-ai/helpers/blob/c50fddfdffccdccb1b2d963b729ab9674d8fda8f/docs/tools/notebooks/all.gsheet_into_pandas.how_to_guide.md](https://github.com/causify-ai/helpers/blob/c50fddfdffccdccb1b2d963b729ab9674d8fda8f/docs/tools/notebooks/all.gsheet_into_pandas.how_to_guide.md)

The Google Sheet should contain:

- Project name

- Difficulty

## Step 2: Script Execution

- Run the script directly using Python
- This will:

Authenticate and read the Google Sheet

Generate a project description using OpenAI for each row

Save the top N (or all if MAX_PROJECTS=None) projects in a file called
`./projects/DATA605_Projects.md`

Code to run script:

```bash
python <file_path>/project_description.py --sheet_url <file_path> --secret_path <file_path> --openai_key key --markdown_path <file_path> -v INFO
```

Edit Google Sheet URL inside the script or pass a new one through CLI

### Step 3: Review Output

- Markdown stored at DATA605/projects/MSML610_Projects.md.

## Troubleshooting

Issue: google.auth.exceptions.DefaultCredentialsError Cause: Google service key
not found at the expected path. Fix: Place the correct google_secret.json file
in /app/DATA605/.

Issue: Empty or incomplete output file Cause: API failure or invalid sheet
format. Fix: Check logs, verify if the OpenAI and Google API calls are working,
and ensure data in the Google Sheet is structured correctly.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
#!/usr/bin/env python
"""
Generate project descriptions from a Google Sheet and save them to a Markdown
file.

> project_description.py \
--sheet_url "https://docs.google.com/spreadsheets/d/1abc...gid=0" \
--markdown_path ./projects/MSML610_Projects.md \
--max_projects 3 \
-v INFO

Import as:

import DATA605.project_description as dprodesc
"""

import argparse
import logging
import pathlib
import time
from typing import Any, Optional

import pandas as pd

import helpers_root.helpers.hdbg as hdbg
import helpers_root.helpers.hgoogle_drive_api as hgofiapi
import helpers_root.helpers.hio as hio
import helpers_root.helpers.hopenai as hopenai
import helpers_root.helpers.hparser as hparser

_LOG = logging.getLogger(__name__)

# Set Constants.
if True:
DEFAULT_SHEET_URL = (
"https://docs.google.com/"
"spreadsheets/d/"
"1Ez5uRvOgvDMkFc9c6mI21kscTKnpiCSh4UkUh_ifLIw/"
"edit?gid=0#gid=0"
)
# Set to True to use the actual spreadsheet link
else:
# Set to False for testing purposes
fake_url = "https://docs.google.com/fake-sheet-url"
DEFAULT_SHEET_URL = fake_url
GLOBAL_PROMPT = """Act as a data science professor.
I will give you a tool (XYZ) and difficulty level (1–3).
Write a short bullet-point project brief on how XYZ can be
used for real-time Bitcoin data ingestion in Python.
Include:

- Title
- Difficulty (1 means easy, should take around 7 days to develop, 2 is medium difficulty, should take around 10 days to complete, 3 is hard,should take 14 days to complete)
- Tech Description
- Project Idea
- Python libs
- Is it Free?
- Relevant tool(XYZ) related Resource Links

Avoid long texts or steps
"""
EXAMPLE = """Example:
Title: Ingest bitcoin prices using AWS Glue (AWS Glue is technology XYZ)
Difficulty: 1
Description
AWS Glue is a fully managed extract, transform, and load (ETL) service...
Useful resources: AWS Glue Docs
Is it free?: Free tier available with limits
Python libraries: boto3, PySpark
"""
DEFAULT_MARKDOWN_PATH = "./projects/MSML610_Projects.md"
# The maximum number of projects.
# Set the value to None to disable the limit.
DEFAULT_MAX_PROJECTS = None


def _read_google_sheet(url: str, secret_path: str) -> pd.DataFrame:
"""
Read the Google Sheet and return the data as a pandas DataFrame.

:param url: the URL of the Google Sheet to read
:param secret_path: path to google_secret.json
:return: the data
"""
_LOG.info("Reading Google Sheet %s: ", url)
_LOG.info("Using credentials from: %s", secret_path)
credentials = hgofiapi.get_credentials(service_key_path=secret_path)
df = hgofiapi.read_google_file(url, credentials=credentials)
return df


def _generate_project_description(project_name: str, difficulty: str) -> Any:
"""
Generate a project description.

:param project_name: the name of the project
:param difficulty: the difficulty level of the project
:return: the project description
"""
if False:
# Potential (v3) prompt if needed to use.
# Change False to True to use it.
prompt = (
f"Write a professional and detailed project description"
f"for a data project titled '{project_name}'. "
f"Indicate the difficulty level as '{difficulty}', and include objectives, "
f"technologies used, and expected outcomes."
)
# Will use more tokens, but might help produce a better result.
elif False:
# v1 (Original) prompt.
# Change False to True to use it.
prompt = (
f"Generate a project description for '{project_name}',"
f"with difficulty level '{difficulty}'."
)
else:
# v2: Added by Aayush as an improvement to optimize tokens
# while conveying the same information.
prompt = f"Technology: {project_name}\nDifficulty: {difficulty}"
# Short, to the point and concise. Saves the most tokens while achieving similar results.
project_desc = hopenai.get_completion(
prompt,
system_prompt=GLOBAL_PROMPT,
model="gpt-4o-mini",
cache_mode="FALLBACK",
temperature=0.3,
max_tokens=400,
print_cost=True,
)
return project_desc


def create_markdown_file(
df: pd.DataFrame,
markdown_path: str,
max_projects: Optional[int],
*,
sleep_sec: float = 1.5,
) -> None:
"""
Create a markdown file with the project descriptions using helpers.hio.

:param df: the dataframe containing the project descriptions
:param markdown_path: the path to the markdown file
:param max_projects: limit to the rows processed
:param sleep_sec: amount of time to sleep between rows
"""
content = "# MSML610 Projects\n\n"
# Generate the project descriptions.
# Limit the number of projects.
rows = df.head(max_projects) if max_projects is not None else df
for _, row in rows.iterrows():
project_name = row["Tool"]
difficulty = row["Difficulty"]
description = _generate_project_description(project_name, difficulty)
# Add the project description to the markdown file.
content += f"## {project_name}\n"
content += f"{description}\n\n"
# Letting it wait for a while before triggering another request
time.sleep(sleep_sec)
# Write the markdown file.
hio.to_file(markdown_path, content)


def _parse() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument(
"--sheet_url", default=DEFAULT_SHEET_URL, help="Google Sheet URL"
)
parser.add_argument(
"--secret_path",
default="/app/DATA605/google_secret.json",
help="Path to Google service‑account JSON.",
)
parser.add_argument(
"--markdown_path",
default=DEFAULT_MARKDOWN_PATH,
help="Output Markdown file",
)
parser.add_argument(
"--max_projects",
type=int,
default=DEFAULT_MAX_PROJECTS,
help="Limit rows processed (None = all).",
)
parser.add_argument(
"--openai_key",
type=str,
default=None,
help="OpenAI API key (will override env var)",
)
hparser.add_verbosity_arg(parser)
return parser


def _main(parser: argparse.ArgumentParser) -> None:
args = parser.parse_args()
hdbg.init_logger(verbosity=args.log_level, use_exec_path=True)
# Expand user/relative paths to absolute ones early to avoid surprises.
secret_path = str(pathlib.Path(args.secret_path).expanduser().resolve())
markdown_path = str(pathlib.Path(args.markdown_path).expanduser().resolve())
_LOG.info("Reading sheet %s", args.sheet_url)
sheet_df = _read_google_sheet(args.sheet_url, secret_path)
_LOG.info("Generating Markdown → %s", markdown_path)
create_markdown_file(
sheet_df,
markdown_path,
args.max_projects,
)
_LOG.info("Done: %s", markdown_path)


if __name__ == "__main__":
_main(_parse())
Loading