Skip to content

Commit 641bacd

Browse files
committed
Remove Project (InftyAI#175)
* remove project Signed-off-by: kerthcet <kerthcet@gmail.com> * fix Signed-off-by: kerthcet <kerthcet@gmail.com> * add labels Signed-off-by: kerthcet <kerthcet@gmail.com> * optimize pages Signed-off-by: kerthcet <kerthcet@gmail.com> * update the dropdown component Signed-off-by: kerthcet <kerthcet@gmail.com> * fix Signed-off-by: kerthcet <kerthcet@gmail.com> * optimize Signed-off-by: kerthcet <kerthcet@gmail.com> * optimize Signed-off-by: kerthcet <kerthcet@gmail.com> * add labels Signed-off-by: kerthcet <kerthcet@gmail.com> * optimize Signed-off-by: kerthcet <kerthcet@gmail.com> * support multi-filters Signed-off-by: kerthcet <kerthcet@gmail.com> * add colors Signed-off-by: kerthcet <kerthcet@gmail.com> * optimie color Signed-off-by: kerthcet <kerthcet@gmail.com> * optimie font Signed-off-by: kerthcet <kerthcet@gmail.com> * update artifacgts pages Signed-off-by: kerthcet <kerthcet@gmail.com> * update artifacgts pages Signed-off-by: kerthcet <kerthcet@gmail.com> * use the same artifact display component Signed-off-by: kerthcet <kerthcet@gmail.com> * optimize the Dashboard Signed-off-by: kerthcet <kerthcet@gmail.com> * Change the execution result name Signed-off-by: kerthcet <kerthcet@gmail.com> * add new card Signed-off-by: kerthcet <kerthcet@gmail.com> * update the card Signed-off-by: kerthcet <kerthcet@gmail.com> * add component Signed-off-by: kerthcet <kerthcet@gmail.com> * update readme Signed-off-by: kerthcet <kerthcet@gmail.com> * fix lint Signed-off-by: kerthcet <kerthcet@gmail.com> * fix test Signed-off-by: kerthcet <kerthcet@gmail.com> * fix test Signed-off-by: kerthcet <kerthcet@gmail.com> * change the icon of workflow Signed-off-by: kerthcet <kerthcet@gmail.com> * remove result.json if not exists Signed-off-by: kerthcet <kerthcet@gmail.com> --------- Signed-off-by: kerthcet <kerthcet@gmail.com>
1 parent 2350715 commit 641bacd

74 files changed

Lines changed: 2845 additions & 4121 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,8 @@ Open, modular framework to build GenAI applications.
1818

1919
## Concepts
2020

21-
- **Team**: A Team is the highest-level organizational unit in AlphaTrion. It represents a group of users collaborating on projects and experiments.
22-
- **Project**: A Project is a namespace-level abstraction that contains multiple experiments. It helps organize experiments related to a specific goal or topic.
23-
- **Experiment**: An Experiment is a logic-level abstraction for organizing and managing a series of related runs. It allows users to group runs that share a common purpose or configuration.
21+
- **Team**: A Team is the highest-level organizational unit in AlphaTrion. It represents a group of users collaborating on experiments.
22+
- **Experiment**: An Experiment is a logic-level abstraction for organizing and managing a series of related runs. It allows users to group runs that share a common purpose or configuration. Experiments can be organized using labels.
2423
- **Run**: A Run is a real execution instance of an experiment. It represents the actual execution of the code with the specified configuration and hyperparameters defined in the experiment.
2524

2625
## Quick Start
@@ -70,7 +69,7 @@ Below is a simple example with two approaches demonstrating how to create an exp
7069

7170
```python
7271
import alphatrion as alpha
73-
from alphatrion import experiment, project
72+
from alphatrion import experiment
7473

7574
# Use the user ID generated from the `alphatrion init` command.
7675
alpha.init(user_id=<user_id>)
@@ -79,17 +78,16 @@ async def your_task():
7978
# Run your code here then log metrics.
8079
await alpha.log_metrics({"accuracy": 0.95})
8180

82-
async with project.Project.setup(name="my_project"):
83-
async with experiment.CraftExperiment.start(name="my_experiment") as exp:
84-
task = exp.run(your_task) # use lambda or partial if you need to pass arguments to your_task
85-
await task.wait()
81+
async with experiment.CraftExperiment.start(name="my_experiment") as exp:
82+
task = exp.run(your_task) # use lambda or partial if you need to pass arguments to your_task
83+
await task.wait()
8684
```
8785

8886
### View Dashboard
8987

9088
![dashboard](./site/images/dashboard.png)
9189

92-
The dashboard provides a web interface to explore projects, experiments, runs, and metrics through an intuitive UI.
90+
The dashboard provides a web interface to explore experiments, runs, and metrics through an intuitive UI.
9391

9492
#### Launch Dashboard
9593

@@ -111,6 +109,12 @@ The dashboard will automatically open in your browser at `http://127.0.0.1:5173`
111109
- [Dashboard CLI Guide](./docs/dashboard/dashboard-cli.md) - Using the dashboard CLI command
112110
- [Dashboard Architecture](./docs/dashboard/dashboard-architecture.md) - Technical architecture and deployment patterns
113111

112+
### Tracing
113+
114+
AlphaTrion automatically captures tracing data for all runs, including spans for each run and associated metadata. You can query this data to analyze model performance, latency, and token usage.
115+
116+
![tracing](./site/images/trace.png)
117+
114118
### Cleanup
115119

116120
```bash

alphatrion/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
from alphatrion.log.log import log_artifact, log_execution, log_metrics, log_params
1+
from alphatrion.log.log import log_artifact, log_metrics, log_params, log_result
22
from alphatrion.runtime.runtime import init
33

44
__all__ = [
55
"init",
66
"log_artifact",
77
"log_params",
88
"log_metrics",
9-
"log_execution",
9+
"log_result",
1010
]

alphatrion/artifact/artifact.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,6 @@ def list_versions(self, repo_name: str) -> list[str]:
7575
or "does not exist" in error_msg
7676
):
7777
# Return empty list if repository doesn't exist yet
78-
# This is expected for projects without artifacts
7978
return []
8079
# Re-raise other errors
8180
raise RuntimeError(f"Failed to list artifacts versions: {e}") from e

alphatrion/experiment/base.py

Lines changed: 67 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,21 @@
1+
import asyncio
2+
import contextlib
13
import enum
4+
import os
5+
import shutil
6+
import signal
27
import uuid
38
from abc import ABC, abstractmethod
49
from collections.abc import Callable
510
from datetime import UTC, datetime
611

712
from pydantic import BaseModel, Field, model_validator
813

14+
from alphatrion import envs
915
from alphatrion.run.run import Run
1016
from alphatrion.runtime.contextvars import current_exp_id
1117
from alphatrion.runtime.runtime import global_runtime
18+
from alphatrion.snapshot.snapshot import team_path
1219
from alphatrion.storage.sql_models import FINISHED_STATUS, Status
1320
from alphatrion.types import CallableEntry
1421
from alphatrion.utils import context
@@ -65,7 +72,6 @@ class ExperimentConfig(BaseModel):
6572
max_execution_seconds: int = Field(
6673
default=-1,
6774
description="Maximum execution seconds for the Experiment. \
68-
Experiment timeout will override project timeout if both are set. \
6975
Default is -1 (no limit).",
7076
)
7177
early_stopping_runs: int = Field(
@@ -152,11 +158,16 @@ class Experiment(ABC):
152158
"_total_runs_counter",
153159
# The end status, None, Err or Cancelled.
154160
"_end_status",
161+
"_stopped",
162+
"_signal_task",
155163
)
156164

157165
def __init__(self, config: ExperimentConfig | None = None):
158166
self._config = config or ExperimentConfig()
167+
159168
self._runtime = global_runtime()
169+
self._runtime.current_experiment = self
170+
160171
self._construct_meta()
161172
self._runs = dict[uuid.UUID, Run]()
162173
self._early_stopping_counter = 0
@@ -165,35 +176,59 @@ def __init__(self, config: ExperimentConfig | None = None):
165176
# if experiment starts to wait, it will auto stop when the runs
166177
# are all finished.
167178
self._start_waiting = False
179+
self._end_status = None
180+
self._stopped = asyncio.Event()
181+
self._signal_task: asyncio.Task | None = None
168182

169183
async def __aenter__(self):
184+
self._signal_task = self._start_signal_handlers()
170185
return self
171186

172187
async def __aexit__(self, exc_type, exc_val, exc_tb):
173188
self.done()
189+
self._end_status = None
190+
191+
if self._signal_task:
192+
# Already done, will not update the status again.
193+
self._signal_task.cancel()
194+
195+
with contextlib.suppress(asyncio.CancelledError):
196+
await self._signal_task
197+
174198
if self._token:
175199
current_exp_id.reset(self._token)
200+
self._runtime.current_experiment = None
176201

177202
def _start(
178203
self,
179204
name: str,
180205
description: str | None = None,
206+
labels: str | None = None,
181207
meta: dict | None = None,
182208
params: dict | None = None,
183209
):
184-
proj = self._runtime.current_proj
185-
exp_obj = self._runtime.metadb.get_exp_by_name(name=name, project_id=proj.id)
210+
exp_obj = self._runtime.metadb.get_exp_by_name(
211+
name=name, team_id=self._runtime.team_id
212+
)
186213

187-
# FIXME: what if the existing Experiment is completed, will lead to confusion?
188-
if exp_obj:
214+
# Just in case of kubernetes pod restarts, we want to make sure the experiment
215+
# can be resumed if it is not completed, instead of creating a new experiment
216+
# with the same name. If the experiment is already completed, we raise an error
217+
# to avoid confusion.
218+
if exp_obj and exp_obj.status != Status.COMPLETED:
189219
self._id = exp_obj.uuid
220+
elif exp_obj and exp_obj.status == Status.COMPLETED:
221+
raise RuntimeError(
222+
f"Experiment with name '{name}' already exists and is completed. \
223+
Please choose a different name or delete the existing experiment."
224+
)
190225
else:
191226
self._id = self._runtime._metadb.create_experiment(
192227
name=name,
193228
team_id=self._runtime._team_id,
194229
user_id=self._runtime._user_id,
195-
project_id=proj.id,
196230
description=description,
231+
labels=labels,
197232
meta=meta,
198233
params=params,
199234
status=Status.RUNNING,
@@ -204,10 +239,7 @@ def _start(
204239
timeout=self._timeout(),
205240
)
206241

207-
# We don't reset the Experiment id context var,
208-
# because each experiment runs in its own context.
209242
self._token = current_exp_id.set(self._id)
210-
proj.register_experiment(id=self.id, instance=self)
211243

212244
@property
213245
def id(self) -> uuid.UUID:
@@ -335,6 +367,7 @@ def is_done(self) -> bool:
335367
# TODO: Should we distinguish done and cancel?
336368
def done(self):
337369
self._cancel()
370+
self._cleanup()
338371

339372
def done_with_err(self):
340373
self._end_status = "Err"
@@ -370,8 +403,6 @@ def _stop(self):
370403
experiment_id=self._id, status=status, duration=duration
371404
)
372405

373-
self._runtime.current_proj.unregister_experiment(self.id)
374-
375406
def _get_obj(self):
376407
return self._runtime._metadb.get_experiment(experiment_id=self.id)
377408

@@ -425,3 +456,28 @@ def start(
425456
params: dict | None = None,
426457
) -> "Experiment":
427458
raise NotImplementedError
459+
460+
def _cleanup(self):
461+
# remove the whole folder once the experiment is done.
462+
if (
463+
os.path.exists(team_path())
464+
and os.getenv(envs.AUTO_CLEANUP, "true").lower() == "true"
465+
):
466+
shutil.rmtree(team_path(), ignore_errors=True)
467+
468+
def _start_signal_handlers(self):
469+
loop = asyncio.get_running_loop()
470+
471+
# Handle SIGINT and SIGTERM to allow graceful shutdown.
472+
# Make sure to call done() on receiving the signal.
473+
for sig in (signal.SIGINT, signal.SIGTERM):
474+
loop.add_signal_handler(sig, self._on_signal)
475+
476+
return asyncio.create_task(self._wait_for_stop())
477+
478+
def _on_signal(self):
479+
self._stopped.set()
480+
481+
async def _wait_for_stop(self):
482+
await self._stopped.wait()
483+
self.done_with_cancel()

alphatrion/experiment/craft_experiment.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ def start(
1414
cls,
1515
name: str,
1616
description: str | None = None,
17+
labels: str | None = None,
1718
meta: dict | None = None,
1819
params: dict | None = None,
1920
config: base.ExperimentConfig | None = None,
@@ -27,6 +28,7 @@ def start(
2728
exp._start(
2829
name=name,
2930
description=description,
31+
labels=labels,
3032
meta=meta,
3133
params=params,
3234
)

alphatrion/log/log.py

Lines changed: 17 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@
1818

1919
async def log_artifact(
2020
paths: str | list[str],
21+
repo_name: str,
2122
version: str | None = None,
22-
repo_name: str | None = None,
2323
pre_save_hook: Callable | None = None,
2424
) -> str:
2525
"""
@@ -35,7 +35,7 @@ async def log_artifact(
3535
If want to save something, make sure it's under the paths.
3636
3737
:return: the path of the logged artifact in the format of
38-
{team_id}/{project_id}:{version}
38+
{team_id}/{repo_name}:{version}
3939
"""
4040

4141
if not paths:
@@ -57,15 +57,9 @@ async def log_artifact(
5757
else:
5858
raise ValueError("pre_save_hook must be a callable function")
5959

60-
# We use project ID as the repo name rather than the project name,
61-
# because project name is not unique and might change over time.
62-
proj = runtime.current_proj
63-
if proj is None:
64-
raise RuntimeError("No running project found in the current context.")
65-
6660
loop = asyncio.get_running_loop()
6761
return await loop.run_in_executor(
68-
None, runtime._artifact.push, repo_name or str(proj.id), paths, version
62+
None, runtime._artifact.push, repo_name, paths, version
6963
)
7064

7165

@@ -97,13 +91,12 @@ async def log_metrics(metrics: dict[str, float]) -> bool:
9791
raise RuntimeError("log_metrics must be called inside a Run.")
9892

9993
runtime = global_runtime()
100-
proj = runtime.current_proj
10194

10295
exp_id = current_exp_id.get()
10396
if exp_id is None:
10497
raise RuntimeError("log_metrics must be called inside a Experiment.")
10598

106-
exp = proj.get_experiment(id=exp_id)
99+
exp = runtime.current_experiment
107100
if exp is None:
108101
raise RuntimeError(f"Experiment {exp_id} not found in the database.")
109102

@@ -117,7 +110,6 @@ async def log_metrics(metrics: dict[str, float]) -> bool:
117110
key=key,
118111
value=value,
119112
team_id=runtime._team_id,
120-
project_id=proj.id,
121113
experiment_id=exp_id,
122114
run_id=run_id,
123115
)
@@ -138,7 +130,7 @@ async def log_metrics(metrics: dict[str, float]) -> bool:
138130
# TODO: refactor this with an event driven mechanism later.
139131
if should_checkpoint:
140132
path = await log_artifact(
141-
repo_name=f"{str(proj.id)}/ckpt",
133+
repo_name="ckpt",
142134
# If not provided, will use the default checkpoint path.
143135
paths=exp.config().checkpoint.path or checkpoint_path(),
144136
pre_save_hook=exp.config().checkpoint.pre_save_hook,
@@ -154,23 +146,23 @@ async def log_metrics(metrics: dict[str, float]) -> bool:
154146
return is_best_metric
155147

156148

157-
# log_execution is used to log the record of a run/experiment/project,
149+
# log_result is used to log the result of a run/experiment,
158150
# including both input and output, e.g. you want to save the code snippet.
159151
# It will be stored in the object storage as a JSON file if object storage
160152
# is enabled or locally otherwise.
161-
async def log_execution(
153+
async def log_result(
162154
output: dict[str, Any],
163155
input: dict[str, Any] | None = None,
164156
phase: str = "success",
165157
kind: ExecutionKind = ExecutionKind.RUN,
166158
):
167-
execution = None
159+
result = None
168160

169161
if kind == ExecutionKind.RUN:
170-
execution = build_run_execution(output=output, input=input, phase=phase)
162+
result = build_run_execution(output=output, input=input, phase=phase)
171163
else:
172164
raise NotImplementedError(
173-
f"Logging record of kind {execution.kind} is not implemented yet."
165+
f"Logging record of kind {result.kind} is not implemented yet."
174166
)
175167

176168
# Can I get the file size to store in the database?
@@ -179,28 +171,28 @@ async def log_execution(
179171
if os.path.exists(path) is False:
180172
os.makedirs(path, exist_ok=True)
181173

182-
# Will eventually be cleanup on Project done() if AUTO_CLEANUP is enabled.
174+
# Will eventually be cleanup on Experiment done() if AUTO_CLEANUP is enabled.
183175
# Considering the record file is small, we just save it locally first.
184176
# If this changes in the future, we should delete them after uploading.
185-
with open(os.path.join(path, "execution.json"), "w") as f:
186-
f.write(execution.model_dump_json())
177+
with open(os.path.join(path, "result.json"), "w") as f:
178+
f.write(result.model_dump_json())
187179

188-
file_size = os.path.getsize(os.path.join(path, "execution.json"))
180+
file_size = os.path.getsize(os.path.join(path, "result.json"))
189181
runtime = global_runtime()
190182

191183
# If not enabled, only save to local disk.
192184
if runtime.artifact_storage_enabled():
193185
path = await log_artifact(
194-
paths=os.path.join(path, "execution.json"),
195-
repo_name=f"{str(runtime.current_proj.id)}/execution",
186+
paths=os.path.join(path, "result.json"),
187+
repo_name="execution",
196188
)
197189
runtime.metadb.update_run(
198190
run_id=current_run_id.get(),
199191
meta={
200192
EXECUTION_RESULT: {
201193
"path": path,
202194
"size": file_size,
203-
"file_name": "execution.json",
195+
"file_name": "result.json",
204196
}
205197
},
206198
)

0 commit comments

Comments
 (0)