docs: add project onboarding guide and reindexing scripts#78
docs: add project onboarding guide and reindexing scripts#78
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an onboarding guide for bringing new projects into the v1→v2 sync pipeline and introduces a set of ad-hoc Python reindexing/audit scripts under scripts/reindexing/ to help replay data into NATS KV (and, in one case, optionally delete stale OpenSearch docs).
Changes:
- Add
docs/onboarding-new-project.mddocumenting allowlisting, replaying project entries, verifying mappings, and suggested reindex order. - Introduce multiple Python scripts for reindexing resources via DynamoDB/OpenSearch → NATS KV and for auditing OpenSearch vs NATS KV.
- Add basic Python packaging/dependency artifacts for the scripts (
requirements.txt,pyproject.toml,uv.lock) plus a short scripts README.
Reviewed changes
Copilot reviewed 17 out of 22 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/reindexing/reindex.py | DynamoDB→NATS KV reindex/check runner across configured tables |
| scripts/reindexing/reindex_votes.py | Reindex votes/vote_responses discovered from OpenSearch into NATS KV |
| scripts/reindexing/reindex_past_meetings.py | Reindex past meetings discovered via OpenSearch scroll into NATS KV |
| scripts/reindexing/reindex_groupsio.py | Reindex Groups.io service/subgroups/members via DynamoDB lookups into NATS KV |
| scripts/reindexing/reindex_committees.py | Reindex committees/members via project-service API into NATS KV |
| scripts/reindexing/reindex_committees_v2.py | “v2” committee reindex that deletes+recreates KV entries and deletes mappings |
| scripts/reindexing/audit_opensearch.py | Audit OpenSearch documents vs NATS KV presence; optionally delete “stale” docs |
| scripts/reindexing/requirements.txt | pip requirements for the scripts |
| scripts/reindexing/pyproject.toml | uv/packaging metadata for the scripts directory |
| scripts/reindexing/uv.lock | uv lockfile for the scripts directory |
| scripts/reindexing/README.md | Setup notes for running the reindexing scripts |
| docs/onboarding-new-project.md | New-project onboarding runbook (allowlist, mappings, replay, reindex order) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| description = "Add your description here" | ||
| readme = "README.md" | ||
| requires-python = ">=3.12" | ||
| dependencies = [] |
There was a problem hiding this comment.
pyproject.toml declares no dependencies, but the scripts import boto3/httpx/nats/opensearchpy. As written, uv sync will not install the required packages (and the generated uv.lock only contains the local project). Add the runtime dependencies here (or adjust the README to not recommend uv sync).
| dependencies = [] | |
| dependencies = [ | |
| "boto3", | |
| "httpx", | |
| "nats-py", | |
| "opensearch-py", | |
| ] |
| ## Setup | ||
|
|
||
| ```sh | ||
| uv sync | ||
| ``` | ||
|
|
||
| Or with pip: | ||
|
|
||
| ```sh | ||
| pip install -r requirements.txt | ||
| ``` |
There was a problem hiding this comment.
The setup instructions recommend uv sync, but this directory’s pyproject.toml currently has an empty dependencies list, so uv sync won’t install the packages these scripts need. Either list the dependencies in pyproject.toml (preferred) or change this section to use uv pip install -r requirements.txt / pip install -r requirements.txt only.
| import json | ||
| import sys | ||
| from typing import Dict, List, Set, Tuple, Optional | ||
| from dataclasses import dataclass | ||
|
|
||
| import boto3 | ||
| from boto3.dynamodb.conditions import Key | ||
| from nats.aio.client import Client as NATS | ||
| from nats.js.kv import KeyValue |
There was a problem hiding this comment.
There are unused imports here (json, Tuple, and KeyValue) which will be flagged by ruff/linters and add noise. Remove the unused imports or use them if needed.
| import json | |
| import sys | |
| from typing import Dict, List, Set, Tuple, Optional | |
| from dataclasses import dataclass | |
| import boto3 | |
| from boto3.dynamodb.conditions import Key | |
| from nats.aio.client import Client as NATS | |
| from nats.js.kv import KeyValue | |
| import sys | |
| from typing import Dict, List, Set, Optional | |
| from dataclasses import dataclass | |
| import boto3 | |
| from boto3.dynamodb.conditions import Key | |
| from nats.aio.client import Client as NATS |
| if config.parent_key_index is None: | ||
| parent_keys_list = list(parent_keys) | ||
| # Batch get items in chunks of 100 (DynamoDB limit) | ||
| for i in range(0, len(parent_keys_list), 100): | ||
| batch = parent_keys_list[i:i + 100] | ||
| keys = [{config.primary_key: pk} for pk in batch] | ||
|
|
||
| response = self.dynamodb.batch_get_item( | ||
| RequestItems={ | ||
| config.name: { | ||
| 'Keys': keys | ||
| } | ||
| } | ||
| ) | ||
| items.extend(response.get('Responses', {}).get(config.name, [])) | ||
|
|
||
| # Handle unprocessed keys | ||
| while response.get('UnprocessedKeys'): | ||
| response = self.dynamodb.batch_get_item( | ||
| RequestItems=response['UnprocessedKeys'] | ||
| ) |
There was a problem hiding this comment.
boto3.resource('dynamodb') does not expose batch_get_item; that API is on the DynamoDB client (boto3.client('dynamodb')) or self.dynamodb.meta.client.batch_get_item. As-is, this will raise an AttributeError when the parent_key_index is None path is hit.
| try: | ||
| # Try to get the entry from KV bucket | ||
| entry = await self.kv.get(kv_key) | ||
|
|
||
| if entry is None: | ||
| print(f" ✗ Missing: {kv_key}") | ||
| self.stats.add_missing(table_name, primary_key_value) | ||
| return False | ||
|
|
||
| # Entry exists - trigger reindex if not in dry-run mode | ||
| if not self.dry_run: | ||
| # Update with the same value to trigger reindex | ||
| await self.kv.put(kv_key, entry.value) | ||
| self.stats.add_reindexed(table_name) | ||
| print(f" ✓ Reindexed: {kv_key}") | ||
|
|
||
| return True | ||
|
|
||
| except Exception as e: | ||
| print(f" ✗ Error checking {kv_key}: {e}") | ||
| self.stats.add_error(table_name) | ||
| return False |
There was a problem hiding this comment.
nats-py KV get() raises nats.js.errors.KeyNotFoundError when a key is missing (it typically won’t return None). With the current logic, missing keys will be counted as generic errors instead of "missing". Catch KeyNotFoundError explicitly and treat it as a missing entry.
| try: | ||
| entry = await self.kv.get(kv_key) | ||
| if entry is None: | ||
| print(f" MISSING: {kv_key}") | ||
| s.missing += 1 | ||
| s.missing_keys.append(item_id) | ||
| return | ||
|
|
||
| if not self.dry_run: | ||
| await self.kv.put(kv_key, entry.value) | ||
| s.reindexed += 1 | ||
| print(f" reindexed: {kv_key}") | ||
| else: | ||
| print(f" found: {kv_key}") | ||
|
|
||
| except Exception as e: | ||
| print(f" ERROR {kv_key}: {e}") | ||
| s.missing += 1 | ||
| s.errors += 1 | ||
|
|
There was a problem hiding this comment.
Similar to the other scripts: kv.get() will raise KeyNotFoundError when the key is missing (not return None), and the except path increments missing for all exceptions. Catch KeyNotFoundError as missing and count other exceptions as errors only so the summary is accurate.
| #API_BASE = "https://api-gw.platform.linuxfoundation.org/project-service/v2" | ||
| API_BASE = "https://api-gw.dev.platform.linuxfoundation.org/project-service/v2" | ||
| KV_BUCKET = "v1-objects" | ||
| COMMITTEE_KV_PREFIX = "platform-collaboration__c" | ||
| MEMBER_KV_PREFIX = "platform-community__c" | ||
| PAGE_SIZE = 100 |
There was a problem hiding this comment.
API_BASE is hardcoded to the dev project-service gateway URL. That’s easy to run accidentally against the wrong environment. Consider making this configurable via a CLI flag and/or env var (defaulting to production, or at least requiring explicit selection).
| try: | ||
| entry = await self.kv.get(kv_key) | ||
| if entry is None: | ||
| print(f" MISSING: {kv_key}") | ||
| s.missing += 1 | ||
| s.missing_keys.append(item_id) | ||
| return | ||
|
|
||
| if not self.dry_run: | ||
| await self.kv.put(kv_key, entry.value) | ||
| s.reindexed += 1 | ||
| print(f" reindexed: {kv_key}") | ||
| else: | ||
| print(f" found: {kv_key}") | ||
|
|
||
| except Exception as e: | ||
| print(f" ERROR {kv_key}: {e}") | ||
| s = self._stats(kv_prefix) | ||
| s.missing += 1 | ||
| s.errors += 1 | ||
|
|
There was a problem hiding this comment.
kv.get() missing-key behavior: nats-py raises KeyNotFoundError, so the entry is None branch may never run. Also, counting all exceptions as missing will over-report missing keys. Handle KeyNotFoundError separately and increment errors for other exceptions only.
| resp = await self._os( | ||
| self.os.search, | ||
| index=self.os_index, | ||
| body=query, | ||
| scroll=SCROLL_TTL, | ||
| size=SCROLL_SIZE, | ||
| _source=["id", "object_type"], | ||
| ) | ||
| scroll_id = resp.get("_scroll_id") | ||
|
|
||
| while True: | ||
| hits = resp["hits"]["hits"] | ||
| if not hits: | ||
| break | ||
| docs.extend(hits) | ||
| if len(hits) < SCROLL_SIZE: | ||
| break | ||
| resp = await self._os(self.os.scroll, scroll_id=scroll_id, scroll=SCROLL_TTL) | ||
| scroll_id = resp.get("_scroll_id") | ||
|
|
||
| if scroll_id: | ||
| try: | ||
| await self._os(self.os.clear_scroll, scroll_id=scroll_id) | ||
| except Exception: | ||
| pass | ||
|
|
||
| return docs | ||
|
|
||
| async def audit_type(self, object_type: str, kv_bucket: str) -> Stats: | ||
| s = Stats() | ||
| print(f"\n--- {object_type} → KV bucket: {kv_bucket} ---") | ||
|
|
||
| print(f" Fetching all '{object_type}' documents from OpenSearch index '{self.os_index}'...") | ||
| docs = await self.scroll_opensearch(object_type) | ||
| s.total = len(docs) | ||
| print(f" Found {s.total} document(s) in OpenSearch") | ||
|
|
||
| if s.total == 0: | ||
| return s | ||
|
|
||
| kv = await self._get_kv(kv_bucket) | ||
|
|
||
| for doc in docs: | ||
| doc_id = doc["_id"] # OpenSearch document _id | ||
| # Also check the `id` field in _source for the uuid | ||
| source_id = doc.get("_source", {}).get("object_id") or doc_id | ||
|
|
||
| # Strip to bare UUID (last segment if namespaced) | ||
| uuid = source_id.split(":")[-1] if ":" in source_id else source_id | ||
|
|
||
| try: | ||
| entry = await kv.get(uuid) | ||
| if entry is None: | ||
| raise NotFoundError(404, "key not found", {}) | ||
| s.in_nats += 1 | ||
| except Exception: | ||
| s.missing_in_nats += 1 | ||
| s.missing_ids.append(doc_id) | ||
|
|
||
| if self.dry_run: | ||
| print(f" [DRY RUN] would delete: {doc_id} (uuid={uuid})") | ||
| else: | ||
| try: | ||
| await self._os(self.os.delete, index=self.os_index, id=doc_id) | ||
| s.deleted += 1 | ||
| print(f" deleted: {doc_id} (uuid={uuid})") | ||
| except Exception as e: | ||
| s.errors += 1 | ||
| print(f" ERROR deleting {doc_id}: {e}") |
There was a problem hiding this comment.
This uses _source=["id", "object_type"] but then reads object_id from _source. As written, source_id will almost always fall back to doc_id, so the derived uuid may be wrong. In --delete mode this can incorrectly classify documents as stale and delete them. Use the actual field you request (likely _source["id"]) and only delete when the KV lookup definitively indicates missing (e.g., KeyNotFoundError).
| try: | ||
| entry = await self.kv.get(kv_key) | ||
| if entry is None: | ||
| print(f" MISSING: {kv_key}") | ||
| self.stats.missing += 1 | ||
| self.stats.missing_keys.append(meeting_and_occurrence_id) | ||
| return | ||
|
|
||
| if not self.dry_run: | ||
| await self.kv.put(kv_key, entry.value) | ||
| self.stats.reindexed += 1 | ||
| print(f" reindexed: {kv_key}") | ||
| else: | ||
| print(f" found: {kv_key}") | ||
|
|
||
| except Exception as e: | ||
| print(f" ERROR {kv_key}: {e}") | ||
| self.stats.missing += 1 | ||
| self.stats.errors += 1 | ||
|
|
There was a problem hiding this comment.
Same KV-missing handling issue as the other scripts: kv.get() will generally raise KeyNotFoundError rather than returning None, and the except branch increments missing for all exceptions. Catch KeyNotFoundError as missing and count other exceptions as errors only so the report is accurate.
a177e0e to
53a0177
Compare
LFXV2-1371 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jordan Evans <jevans@linuxfoundation.org>
- Add runtime deps to pyproject.toml so uv sync works - Use meta.client.batch_get_item on dynamodb resource - Catch KeyNotFoundError explicitly; count other exceptions as errors, not missing entries - Replace offset pagination with scroll API in reindex_votes to avoid 10k result window limit - Fix _source field name object_id -> id in audit script - Make API_BASE configurable via env var, defaulting to production URL - Fix docstring table names in reindex_groupsio - Regenerate uv.lock with new dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1371 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Issue: LFXV2-1371 Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
LFXV2-1371