Skip to content

docs: add project onboarding guide and reindexing scripts#78

Open
jordane wants to merge 5 commits intomainfrom
jme/LFXV2-1371
Open

docs: add project onboarding guide and reindexing scripts#78
jordane wants to merge 5 commits intomainfrom
jme/LFXV2-1371

Conversation

@jordane
Copy link
Copy Markdown
Member

@jordane jordane commented Apr 1, 2026

@jordane jordane requested review from a team and emsearcy as code owners April 1, 2026 22:38
Copilot AI review requested due to automatic review settings April 1, 2026 22:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an onboarding guide for bringing new projects into the v1→v2 sync pipeline and introduces a set of ad-hoc Python reindexing/audit scripts under scripts/reindexing/ to help replay data into NATS KV (and, in one case, optionally delete stale OpenSearch docs).

Changes:

  • Add docs/onboarding-new-project.md documenting allowlisting, replaying project entries, verifying mappings, and suggested reindex order.
  • Introduce multiple Python scripts for reindexing resources via DynamoDB/OpenSearch → NATS KV and for auditing OpenSearch vs NATS KV.
  • Add basic Python packaging/dependency artifacts for the scripts (requirements.txt, pyproject.toml, uv.lock) plus a short scripts README.

Reviewed changes

Copilot reviewed 17 out of 22 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
scripts/reindexing/reindex.py DynamoDB→NATS KV reindex/check runner across configured tables
scripts/reindexing/reindex_votes.py Reindex votes/vote_responses discovered from OpenSearch into NATS KV
scripts/reindexing/reindex_past_meetings.py Reindex past meetings discovered via OpenSearch scroll into NATS KV
scripts/reindexing/reindex_groupsio.py Reindex Groups.io service/subgroups/members via DynamoDB lookups into NATS KV
scripts/reindexing/reindex_committees.py Reindex committees/members via project-service API into NATS KV
scripts/reindexing/reindex_committees_v2.py “v2” committee reindex that deletes+recreates KV entries and deletes mappings
scripts/reindexing/audit_opensearch.py Audit OpenSearch documents vs NATS KV presence; optionally delete “stale” docs
scripts/reindexing/requirements.txt pip requirements for the scripts
scripts/reindexing/pyproject.toml uv/packaging metadata for the scripts directory
scripts/reindexing/uv.lock uv lockfile for the scripts directory
scripts/reindexing/README.md Setup notes for running the reindexing scripts
docs/onboarding-new-project.md New-project onboarding runbook (allowlist, mappings, replay, reindex order)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/reindexing/pyproject.toml Outdated
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = []
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyproject.toml declares no dependencies, but the scripts import boto3/httpx/nats/opensearchpy. As written, uv sync will not install the required packages (and the generated uv.lock only contains the local project). Add the runtime dependencies here (or adjust the README to not recommend uv sync).

Suggested change
dependencies = []
dependencies = [
"boto3",
"httpx",
"nats-py",
"opensearch-py",
]

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +21
## Setup

```sh
uv sync
```

Or with pip:

```sh
pip install -r requirements.txt
```
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup instructions recommend uv sync, but this directory’s pyproject.toml currently has an empty dependencies list, so uv sync won’t install the packages these scripts need. Either list the dependencies in pyproject.toml (preferred) or change this section to use uv pip install -r requirements.txt / pip install -r requirements.txt only.

Copilot uses AI. Check for mistakes.
Comment thread scripts/reindexing/reindex.py Outdated
Comment on lines +13 to +21
import json
import sys
from typing import Dict, List, Set, Tuple, Optional
from dataclasses import dataclass

import boto3
from boto3.dynamodb.conditions import Key
from nats.aio.client import Client as NATS
from nats.js.kv import KeyValue
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unused imports here (json, Tuple, and KeyValue) which will be flagged by ruff/linters and add noise. Remove the unused imports or use them if needed.

Suggested change
import json
import sys
from typing import Dict, List, Set, Tuple, Optional
from dataclasses import dataclass
import boto3
from boto3.dynamodb.conditions import Key
from nats.aio.client import Client as NATS
from nats.js.kv import KeyValue
import sys
from typing import Dict, List, Set, Optional
from dataclasses import dataclass
import boto3
from boto3.dynamodb.conditions import Key
from nats.aio.client import Client as NATS

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +258
if config.parent_key_index is None:
parent_keys_list = list(parent_keys)
# Batch get items in chunks of 100 (DynamoDB limit)
for i in range(0, len(parent_keys_list), 100):
batch = parent_keys_list[i:i + 100]
keys = [{config.primary_key: pk} for pk in batch]

response = self.dynamodb.batch_get_item(
RequestItems={
config.name: {
'Keys': keys
}
}
)
items.extend(response.get('Responses', {}).get(config.name, []))

# Handle unprocessed keys
while response.get('UnprocessedKeys'):
response = self.dynamodb.batch_get_item(
RequestItems=response['UnprocessedKeys']
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto3.resource('dynamodb') does not expose batch_get_item; that API is on the DynamoDB client (boto3.client('dynamodb')) or self.dynamodb.meta.client.batch_get_item. As-is, this will raise an AttributeError when the parent_key_index is None path is hit.

Copilot uses AI. Check for mistakes.
Comment on lines +286 to +307
try:
# Try to get the entry from KV bucket
entry = await self.kv.get(kv_key)

if entry is None:
print(f" ✗ Missing: {kv_key}")
self.stats.add_missing(table_name, primary_key_value)
return False

# Entry exists - trigger reindex if not in dry-run mode
if not self.dry_run:
# Update with the same value to trigger reindex
await self.kv.put(kv_key, entry.value)
self.stats.add_reindexed(table_name)
print(f" ✓ Reindexed: {kv_key}")

return True

except Exception as e:
print(f" ✗ Error checking {kv_key}: {e}")
self.stats.add_error(table_name)
return False
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nats-py KV get() raises nats.js.errors.KeyNotFoundError when a key is missing (it typically won’t return None). With the current logic, missing keys will be counted as generic errors instead of "missing". Catch KeyNotFoundError explicitly and treat it as a missing entry.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +110
try:
entry = await self.kv.get(kv_key)
if entry is None:
print(f" MISSING: {kv_key}")
s.missing += 1
s.missing_keys.append(item_id)
return

if not self.dry_run:
await self.kv.put(kv_key, entry.value)
s.reindexed += 1
print(f" reindexed: {kv_key}")
else:
print(f" found: {kv_key}")

except Exception as e:
print(f" ERROR {kv_key}: {e}")
s.missing += 1
s.errors += 1

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other scripts: kv.get() will raise KeyNotFoundError when the key is missing (not return None), and the except path increments missing for all exceptions. Catch KeyNotFoundError as missing and count other exceptions as errors only so the summary is accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +31
#API_BASE = "https://api-gw.platform.linuxfoundation.org/project-service/v2"
API_BASE = "https://api-gw.dev.platform.linuxfoundation.org/project-service/v2"
KV_BUCKET = "v1-objects"
COMMITTEE_KV_PREFIX = "platform-collaboration__c"
MEMBER_KV_PREFIX = "platform-community__c"
PAGE_SIZE = 100
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API_BASE is hardcoded to the dev project-service gateway URL. That’s easy to run accidentally against the wrong environment. Consider making this configurable via a CLI flag and/or env var (defaulting to production, or at least requiring explicit selection).

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +126
try:
entry = await self.kv.get(kv_key)
if entry is None:
print(f" MISSING: {kv_key}")
s.missing += 1
s.missing_keys.append(item_id)
return

if not self.dry_run:
await self.kv.put(kv_key, entry.value)
s.reindexed += 1
print(f" reindexed: {kv_key}")
else:
print(f" found: {kv_key}")

except Exception as e:
print(f" ERROR {kv_key}: {e}")
s = self._stats(kv_prefix)
s.missing += 1
s.errors += 1

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv.get() missing-key behavior: nats-py raises KeyNotFoundError, so the entry is None branch may never run. Also, counting all exceptions as missing will over-report missing keys. Handle KeyNotFoundError separately and increment errors for other exceptions only.

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +190
resp = await self._os(
self.os.search,
index=self.os_index,
body=query,
scroll=SCROLL_TTL,
size=SCROLL_SIZE,
_source=["id", "object_type"],
)
scroll_id = resp.get("_scroll_id")

while True:
hits = resp["hits"]["hits"]
if not hits:
break
docs.extend(hits)
if len(hits) < SCROLL_SIZE:
break
resp = await self._os(self.os.scroll, scroll_id=scroll_id, scroll=SCROLL_TTL)
scroll_id = resp.get("_scroll_id")

if scroll_id:
try:
await self._os(self.os.clear_scroll, scroll_id=scroll_id)
except Exception:
pass

return docs

async def audit_type(self, object_type: str, kv_bucket: str) -> Stats:
s = Stats()
print(f"\n--- {object_type} → KV bucket: {kv_bucket} ---")

print(f" Fetching all '{object_type}' documents from OpenSearch index '{self.os_index}'...")
docs = await self.scroll_opensearch(object_type)
s.total = len(docs)
print(f" Found {s.total} document(s) in OpenSearch")

if s.total == 0:
return s

kv = await self._get_kv(kv_bucket)

for doc in docs:
doc_id = doc["_id"] # OpenSearch document _id
# Also check the `id` field in _source for the uuid
source_id = doc.get("_source", {}).get("object_id") or doc_id

# Strip to bare UUID (last segment if namespaced)
uuid = source_id.split(":")[-1] if ":" in source_id else source_id

try:
entry = await kv.get(uuid)
if entry is None:
raise NotFoundError(404, "key not found", {})
s.in_nats += 1
except Exception:
s.missing_in_nats += 1
s.missing_ids.append(doc_id)

if self.dry_run:
print(f" [DRY RUN] would delete: {doc_id} (uuid={uuid})")
else:
try:
await self._os(self.os.delete, index=self.os_index, id=doc_id)
s.deleted += 1
print(f" deleted: {doc_id} (uuid={uuid})")
except Exception as e:
s.errors += 1
print(f" ERROR deleting {doc_id}: {e}")
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses _source=["id", "object_type"] but then reads object_id from _source. As written, source_id will almost always fall back to doc_id, so the derived uuid may be wrong. In --delete mode this can incorrectly classify documents as stale and delete them. Use the actual field you request (likely _source["id"]) and only delete when the KV lookup definitively indicates missing (e.g., KeyNotFoundError).

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +156
try:
entry = await self.kv.get(kv_key)
if entry is None:
print(f" MISSING: {kv_key}")
self.stats.missing += 1
self.stats.missing_keys.append(meeting_and_occurrence_id)
return

if not self.dry_run:
await self.kv.put(kv_key, entry.value)
self.stats.reindexed += 1
print(f" reindexed: {kv_key}")
else:
print(f" found: {kv_key}")

except Exception as e:
print(f" ERROR {kv_key}: {e}")
self.stats.missing += 1
self.stats.errors += 1

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same KV-missing handling issue as the other scripts: kv.get() will generally raise KeyNotFoundError rather than returning None, and the except branch increments missing for all exceptions. Catch KeyNotFoundError as missing and count other exceptions as errors only so the report is accurate.

Copilot uses AI. Check for mistakes.
@jordane jordane force-pushed the jme/LFXV2-1371 branch 3 times, most recently from a177e0e to 53a0177 Compare April 1, 2026 23:03
LFXV2-1371

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jordan Evans <jevans@linuxfoundation.org>
bramwelt
bramwelt previously approved these changes Apr 6, 2026
bramwelt and others added 2 commits April 6, 2026 15:58
- Add runtime deps to pyproject.toml so uv sync works
- Use meta.client.batch_get_item on dynamodb resource
- Catch KeyNotFoundError explicitly; count other
  exceptions as errors, not missing entries
- Replace offset pagination with scroll API in
  reindex_votes to avoid 10k result window limit
- Fix _source field name object_id -> id in audit script
- Make API_BASE configurable via env var, defaulting
  to production URL
- Fix docstring table names in reindex_groupsio
- Regenerate uv.lock with new dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Issue: LFXV2-1371
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Issue: LFXV2-1371
Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants