runpod
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 53 additions & 44 deletions b/‎README.md‎
Lines changed: 53 additions & 44 deletions
diff --git a/‎docs/Flash_Deploy_Guide.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/Flash_Deploy_Guide.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/LoadBalancer_Runtime_Architecture.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/LoadBalancer_Runtime_Architecture.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/Using_Remote_With_LoadBalancer.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/Using_Remote_With_LoadBalancer.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎src/runpod_flash/cli/commands/_run_server_helpers.py‎
Lines changed: 23 additions & 7 deletions b/‎src/runpod_flash/cli/commands/_run_server_helpers.py‎
Lines changed: 23 additions & 7 deletions
@@ -35,7 +35,7 @@ Get your API key from: https://docs.runpod.io/get-started/api-keys
 - Integration tests that interact with Runpod API
 
 **When is the API key NOT needed?**
-- Local development with `flash run` (local server only)
+- Local development with `flash dev` (local server only)
 - `flash init` command (project scaffolding)
 - Unit tests (mocked API calls)
 - Code formatting, linting, type checking
 
@@ -1,55 +1,46 @@
 # Flash
 
-Flash is a Python SDK for developing cloud-native AI apps where you define everything—hardware, remote functions, and dependencies—using local code.
+Flash is a Python SDK for developing cloud-native AI apps where you define everything -- hardware, remote functions, and dependencies -- using local code.
 
 ```python
 import asyncio
 from runpod_flash import Endpoint, GpuType
 
-# Mark the function below for remote execution
-@Endpoint(name="hello-gpu", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch"]) 
-async def hello(): # This function runs on Runpod
+@Endpoint(name="hello-gpu", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch"])
+async def hello():
     import torch
     gpu_name = torch.cuda.get_device_name(0)
     print(f"Hello from your GPU! ({gpu_name})")
     return {"gpu": gpu_name}
 
 asyncio.run(hello())
-print("Done!") # This runs locally
+print("Done!")
 ```
 
-Write `@Endpoint` decorated Python functions on your local machine. Run them, and Flash automatically handles GPU/CPU provisioning and worker scaling on [Runpod Serverless](https://docs.runpod.io/serverless/overview).
+Write `@Endpoint` decorated Python functions on your local machine. Deploy them with `flash deploy`, then call them by running the same script. Flash handles GPU/CPU provisioning and worker scaling on [RunPod Serverless](https://docs.runpod.io/serverless/overview).
 
 ## Setup
 
 ### Install Flash
 
-Install Flash using `pip` or `uv`:
-
 ```bash
-# Install with pip
 pip install runpod-flash
-
-# Or uv
+# or
 uv add runpod-flash
 ```
 
-Flash requires [Python 3.10+](https://www.python.org/downloads/), and is currently available for macOS and Linux. Windows support is in development.
+Flash requires [Python 3.10+](https://www.python.org/downloads/) on macOS or Linux. Windows support is in development.
 
 ### Authentication
 
-Before you can use Flash, you need to authenticate with your Runpod account:
-
 ```bash
 flash login
 ```
 
-This saves your API key securely and allows you to use the Flash CLI and run `@Endpoint` functions.
+This saves your API key and allows you to use the Flash CLI and call `@Endpoint` functions.
 
 ### Coding agent integration (optional)
 
-Install the Flash skill package for AI coding agents like Claude Code, Cline, and Cursor:
-
 ```bash
 npx skills add runpod/skills
 ```
@@ -71,18 +62,12 @@ from runpod_flash import Endpoint, GpuType
     dependencies=["numpy", "torch"]
 )
 def gpu_matrix_multiply(size):
-    # IMPORTANT: Import packages INSIDE the function
     import numpy as np
     import torch
 
-    # Get GPU name
     device_name = torch.cuda.get_device_name(0)
-
-    # Create random matrices
     A = np.random.rand(size, size)
     B = np.random.rand(size, size)
-
-    # Multiply matrices
     C = np.dot(A, B)
 
     return {
@@ -91,33 +76,61 @@ def gpu_matrix_multiply(size):
         "gpu": device_name
     }
 
-# Call the function
 async def main():
-    print("Running matrix multiplication on Runpod GPU...")
+    print("Running matrix multiplication on RunPod GPU...")
     result = await gpu_matrix_multiply(1000)
-
-    print(f"\n✓ Matrix size: {result['matrix_size']}x{result['matrix_size']}")
-    print(f"✓ Result mean: {result['result_mean']:.4f}")
-    print(f"✓ GPU used: {result['gpu']}")
+    print(f"Matrix size: {result['matrix_size']}x{result['matrix_size']}")
+    print(f"Result mean: {result['result_mean']:.4f}")
+    print(f"GPU used: {result['gpu']}")
 
 if __name__ == "__main__":
     asyncio.run(main())
 ```
 
-Run it:
+Deploy, then run:
 
 ```bash
+flash deploy
 python gpu_demo.py
 ```
 
-First run takes 30-60 seconds (provisioning). Subsequent runs take 2-3 seconds.
+## How it works
+
+Flash has two modes: **deploy** and **dev**.
+
+### Deploy and run (`flash deploy` + `python script.py`)
+
+Deploy packages your code and provisions endpoints on RunPod. After deploying, run your script directly and Flash routes calls to your deployed endpoints via implicit resolution:
+
+```bash
+flash deploy                 # build, upload, provision endpoints
+python gpu_demo.py           # calls deployed endpoints automatically
+```
+
+Flash resolves endpoints by matching the app name (defaults to the current directory name) and environment (defaults to `production`). Configure with env vars or `.env`:
+
+```bash
+FLASH_APP=my-project         # defaults to current directory name
+FLASH_ENV=staging            # defaults to "production"
+```
+
+### Dev mode (`flash dev`)
+
+For local development and testing, `flash dev` starts a hybrid dev server that runs your FastAPI app locally while provisioning live ephemeral workers on RunPod:
+
+```bash
+flash dev                    # starts local server + provisions workers
+flash dev --port 3000        # custom port
+flash dev --auto-provision   # provision all endpoints at startup
+```
 
 ## What Flash does
 
-- **Remote execution**: `@Endpoint` functions run on Runpod Serverless GPUs/CPUs
-- **Auto-scaling**: Workers scale from 0 to N based on demand
-- **Dependency management**: Packages install automatically on remote workers
-- **Two patterns**: Queue-based (`@Endpoint`) for batch work, load-balanced (`Endpoint()` + routes) for REST APIs
+- **Remote execution**: `@Endpoint` functions run on RunPod Serverless GPUs/CPUs
+- **Implicit endpoint resolution**: `python script.py` routes to deployed endpoints automatically
+- **Auto-scaling**: workers scale from 0 to N based on demand
+- **Dependency management**: packages install automatically on remote workers
+- **Two patterns**: queue-based (`@Endpoint`) for batch work, load-balanced (`Endpoint()` + routes) for REST APIs
 - **Concurrency control**: `max_concurrency` lets each worker process multiple jobs simultaneously
 
 ## Documentation
@@ -126,47 +139,43 @@ Full documentation: **[docs.runpod.io/flash](https://docs.runpod.io/flash)**
 
 - [Quickstart](https://docs.runpod.io/flash/quickstart) - First GPU workload in 5 minutes
 - [Create endpoints](https://docs.runpod.io/flash/endpoint-functions) - Queue-based, load-balancing, and custom Docker endpoints
-- [CLI reference](https://docs.runpod.io/flash/cli/overview) - `flash run`, `flash deploy`, `flash build`
+- [CLI reference](https://docs.runpod.io/flash/cli/overview) - `flash dev`, `flash deploy`, `flash build`
 - [Configuration](https://docs.runpod.io/flash/configuration/parameters) - All endpoint parameters
 
 ## Flash apps
 
-When you're ready to move beyond scripts and build a production-ready API, you can create a [Flash app](https://docs.runpod.io/flash/apps/overview) (a collection of interconnected endpoints with diverse hardware configurations) and deploy it to Runpod.
+When you're ready to move beyond scripts and build a production-ready API, you can create a [Flash app](https://docs.runpod.io/flash/apps/overview) (a collection of interconnected endpoints with diverse hardware configurations) and deploy it to RunPod.
 
 [Follow this tutorial to build your first Flash app](https://docs.runpod.io/flash/apps/build-app).
 
 ## Flash CLI
 
-The Flash CLI provides a set of commands for managing your Flash apps and endpoints.
-
 ```bash
 flash --help
 ```
 
 [Learn more about the Flash CLI](https://docs.runpod.io/flash/cli/overview).
 
-
 ## Examples
 
 Browse working examples: **[github.com/runpod/flash-examples](https://github.com/runpod/flash-examples)**
 
 ## Requirements
 
-- Python 3.12
+- Python 3.10-3.12
 - macOS or Linux (Windows support in development)
-- A [Runpod account](https://runpod.io/console) (email must be verified) with an API key
+- A [RunPod account](https://runpod.io/console) (email must be verified) with an API key
 
 ## Contributing
 
 We welcome contributions! See [RELEASE_SYSTEM.md](RELEASE_SYSTEM.md) for development workflow.
 
 ```bash
-# Clone and install
 git clone https://github.com/runpod/flash.git
 cd flash
 pip install -e ".[dev]"
 
-# Use conventional commits
+# use conventional commits
 git commit -m "feat: add new feature"
 git commit -m "fix: resolve issue"
 ```
 
@@ -21,7 +21,7 @@ cd my-project
 flash login
 
 # test locally
-flash run
+flash dev
 
 # deploy
 flash deploy --env production
@@ -115,7 +115,7 @@ async def process(data: dict) -> dict:
 ### 2. Test Locally
 
 ```bash
-flash run
+flash dev
 ```
 
 This starts a local dev server at `http://localhost:8888` with auto-reload:
 
@@ -108,9 +108,9 @@ https://api.runpod.ai/v2/{endpoint-id}/runsync
 
 ### /execute Endpoint
 
-The `/execute` endpoint accepts and runs arbitrary Python code. It exists **only during local development** (`flash run`).
+The `/execute` endpoint accepts and runs arbitrary Python code. It exists **only during local development** (`flash dev`).
 
-**In local development (`flash run`):**
+**In local development (`flash dev`):**
 - `/execute` is available for Flash's remote code execution protocol
 - Code originates from your own `Endpoint`-decorated functions
 - Safe because only you can run code locally
 
@@ -124,10 +124,10 @@ async def health():
 
 ## Local Development
 
-Run locally with `flash run`:
+Run locally with `flash dev`:
 
 ```bash
-flash run
+flash dev
 # starts a local dev server at http://localhost:8888
 # all routes are auto-discovered and registered
 ```
@@ -230,7 +230,7 @@ health = await ep.get("/health")
 
 1. **Group related routes** on the same `Endpoint` instance
 2. **Use descriptive paths** like `/api/users/{user_id}` not `/api/u`
-3. **Test locally with `flash run`** before deploying
+3. **Test locally with `flash dev`** before deploying
 4. **Handle errors gracefully** with meaningful error messages
 5. **Use CPU endpoints for I/O-bound work** to save costs
 6. **Set appropriate `workers` scaling** based on expected traffic
 
@@ -78,7 +78,12 @@ async def call_with_body(func, body):
     model_fields_set) to match RunPod platform behavior.  Plain dict
     bodies bypass this check since they originate from LB local routes
     where zero-param functions legitimately receive empty input.
+
+    Remote execution errors (timeouts, worker failures) are caught and
+    returned as JSON responses instead of raising through FastAPI.
     """
+    from fastapi.responses import JSONResponse
+
     if hasattr(body, "model_fields_set") and not body.model_fields_set:
         raise HTTPException(
             status_code=422,
@@ -88,11 +93,22 @@ async def call_with_body(func, body):
                 'optional parameters, e.g. {"input": {"param_name": null}}.'
             ),
         )
-    if hasattr(body, "model_dump"):
-        return await func(**body.model_dump())
-    raw = body.get("input", body) if isinstance(body, dict) else body
-    kwargs = _map_body_to_params(func, raw)
-    return await func(**kwargs)
+    try:
+        if hasattr(body, "model_dump"):
+            return await func(**body.model_dump())
+        raw = body.get("input", body) if isinstance(body, dict) else body
+        kwargs = _map_body_to_params(func, raw)
+        return await func(**kwargs)
+    except Exception as exc:
+        msg = str(exc)
+        # strip the "Remote execution failed: " wrapper if present
+        prefix = "Remote execution failed: "
+        if msg.startswith(prefix):
+            msg = msg[len(prefix) :]
+        return JSONResponse(
+            status_code=500,
+            content={"error": msg},
+        )
 
 
 def to_dict(body) -> dict:
@@ -138,13 +154,13 @@ async def lb_execute(resource_config, func, body: dict):
         if routing and routing.get("method")
         else func.__name__
     )
-    log.info(f"[REMOTE] {resource_config} | {route_label}")
+    log.debug(f"{resource_config} | {route_label}")
 
     try:
         result = await stub(
             func, dependencies, system_dependencies, accelerate_downloads, **kwargs
         )
-        log.info(f"[REMOTE] {resource_config} | Execution complete")
+        log.debug(f"{resource_config} | execution complete")
         return result
     except TimeoutError as e:
         raise HTTPException(status_code=504, detail=str(e))