docs: add project onboarding guide and reindexing scripts by jordane · Pull Request #78 · linuxfoundation/lfx-v1-sync-helper

jordane · 2026-04-01T22:38:01Z

Copilot

Pull request overview

Adds an onboarding guide for bringing new projects into the v1→v2 sync pipeline and introduces a set of ad-hoc Python reindexing/audit scripts under scripts/reindexing/ to help replay data into NATS KV (and, in one case, optionally delete stale OpenSearch docs).

Changes:

Add docs/onboarding-new-project.md documenting allowlisting, replaying project entries, verifying mappings, and suggested reindex order.
Introduce multiple Python scripts for reindexing resources via DynamoDB/OpenSearch → NATS KV and for auditing OpenSearch vs NATS KV.
Add basic Python packaging/dependency artifacts for the scripts (requirements.txt, pyproject.toml, uv.lock) plus a short scripts README.

Reviewed changes

Copilot reviewed 17 out of 22 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
scripts/reindexing/reindex.py	DynamoDB→NATS KV reindex/check runner across configured tables
scripts/reindexing/reindex_votes.py	Reindex votes/vote_responses discovered from OpenSearch into NATS KV
scripts/reindexing/reindex_past_meetings.py	Reindex past meetings discovered via OpenSearch scroll into NATS KV
scripts/reindexing/reindex_groupsio.py	Reindex Groups.io service/subgroups/members via DynamoDB lookups into NATS KV
scripts/reindexing/reindex_committees.py	Reindex committees/members via project-service API into NATS KV
scripts/reindexing/reindex_committees_v2.py	“v2” committee reindex that deletes+recreates KV entries and deletes mappings
scripts/reindexing/audit_opensearch.py	Audit OpenSearch documents vs NATS KV presence; optionally delete “stale” docs
scripts/reindexing/requirements.txt	pip requirements for the scripts
scripts/reindexing/pyproject.toml	uv/packaging metadata for the scripts directory
scripts/reindexing/uv.lock	uv lockfile for the scripts directory
scripts/reindexing/README.md	Setup notes for running the reindexing scripts
docs/onboarding-new-project.md	New-project onboarding runbook (allowlist, mappings, replay, reindex order)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T22:42:33Z

+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = []


pyproject.toml declares no dependencies, but the scripts import boto3/httpx/nats/opensearchpy. As written, uv sync will not install the required packages (and the generated uv.lock only contains the local project). Add the runtime dependencies here (or adjust the README to not recommend uv sync).

Suggested change

dependencies = []

dependencies = [

"boto3",

"httpx",

"nats-py",

"opensearch-py",

]

Copilot · 2026-04-01T22:42:34Z

+## Setup
+
+```sh
+uv sync
+```
+
+Or with pip:
+
+```sh
+pip install -r requirements.txt
+```


The setup instructions recommend uv sync, but this directory’s pyproject.toml currently has an empty dependencies list, so uv sync won’t install the packages these scripts need. Either list the dependencies in pyproject.toml (preferred) or change this section to use uv pip install -r requirements.txt / pip install -r requirements.txt only.

Copilot · 2026-04-01T22:42:34Z

+import json
+import sys
+from typing import Dict, List, Set, Tuple, Optional
+from dataclasses import dataclass
+
+import boto3
+from boto3.dynamodb.conditions import Key
+from nats.aio.client import Client as NATS
+from nats.js.kv import KeyValue


There are unused imports here (json, Tuple, and KeyValue) which will be flagged by ruff/linters and add noise. Remove the unused imports or use them if needed.

Suggested change

import json

import sys

from typing import Dict, List, Set, Tuple, Optional

from dataclasses import dataclass

import boto3

from boto3.dynamodb.conditions import Key

from nats.aio.client import Client as NATS

from nats.js.kv import KeyValue

import sys

from typing import Dict, List, Set, Optional

from dataclasses import dataclass

import boto3

from boto3.dynamodb.conditions import Key

from nats.aio.client import Client as NATS

Copilot · 2026-04-01T22:42:34Z

+        if config.parent_key_index is None:
+            parent_keys_list = list(parent_keys)
+            # Batch get items in chunks of 100 (DynamoDB limit)
+            for i in range(0, len(parent_keys_list), 100):
+                batch = parent_keys_list[i:i + 100]
+                keys = [{config.primary_key: pk} for pk in batch]
+
+                response = self.dynamodb.batch_get_item(
+                    RequestItems={
+                        config.name: {
+                            'Keys': keys
+                        }
+                    }
+                )
+                items.extend(response.get('Responses', {}).get(config.name, []))
+
+                # Handle unprocessed keys
+                while response.get('UnprocessedKeys'):
+                    response = self.dynamodb.batch_get_item(
+                        RequestItems=response['UnprocessedKeys']
+                    )


boto3.resource('dynamodb') does not expose batch_get_item; that API is on the DynamoDB client (boto3.client('dynamodb')) or self.dynamodb.meta.client.batch_get_item. As-is, this will raise an AttributeError when the parent_key_index is None path is hit.

Copilot · 2026-04-01T22:42:34Z

+        try:
+            # Try to get the entry from KV bucket
+            entry = await self.kv.get(kv_key)
+
+            if entry is None:
+                print(f"  ✗ Missing: {kv_key}")
+                self.stats.add_missing(table_name, primary_key_value)
+                return False
+
+            # Entry exists - trigger reindex if not in dry-run mode
+            if not self.dry_run:
+                # Update with the same value to trigger reindex
+                await self.kv.put(kv_key, entry.value)
+                self.stats.add_reindexed(table_name)
+                print(f"  ✓ Reindexed: {kv_key}")
+
+            return True
+
+        except Exception as e:
+            print(f"  ✗ Error checking {kv_key}: {e}")
+            self.stats.add_error(table_name)
+            return False


nats-py KV get() raises nats.js.errors.KeyNotFoundError when a key is missing (it typically won’t return None). With the current logic, missing keys will be counted as generic errors instead of "missing". Catch KeyNotFoundError explicitly and treat it as a missing entry.

Copilot · 2026-04-01T22:42:35Z

+        try:
+            entry = await self.kv.get(kv_key)
+            if entry is None:
+                print(f"  MISSING: {kv_key}")
+                s.missing += 1
+                s.missing_keys.append(item_id)
+                return
+
+            if not self.dry_run:
+                await self.kv.put(kv_key, entry.value)
+                s.reindexed += 1
+                print(f"  reindexed: {kv_key}")
+            else:
+                print(f"  found: {kv_key}")
+
+        except Exception as e:
+            print(f"  ERROR {kv_key}: {e}")
+            s.missing += 1
+            s.errors += 1
+


Similar to the other scripts: kv.get() will raise KeyNotFoundError when the key is missing (not return None), and the except path increments missing for all exceptions. Catch KeyNotFoundError as missing and count other exceptions as errors only so the summary is accurate.

Copilot · 2026-04-01T22:42:36Z

+#API_BASE = "https://api-gw.platform.linuxfoundation.org/project-service/v2"
+API_BASE = "https://api-gw.dev.platform.linuxfoundation.org/project-service/v2"
+KV_BUCKET = "v1-objects"
+COMMITTEE_KV_PREFIX = "platform-collaboration__c"
+MEMBER_KV_PREFIX = "platform-community__c"
+PAGE_SIZE = 100


API_BASE is hardcoded to the dev project-service gateway URL. That’s easy to run accidentally against the wrong environment. Consider making this configurable via a CLI flag and/or env var (defaulting to production, or at least requiring explicit selection).

Copilot · 2026-04-01T22:42:36Z

+        try:
+            entry = await self.kv.get(kv_key)
+            if entry is None:
+                print(f"  MISSING: {kv_key}")
+                s.missing += 1
+                s.missing_keys.append(item_id)
+                return
+
+            if not self.dry_run:
+                await self.kv.put(kv_key, entry.value)
+                s.reindexed += 1
+                print(f"  reindexed: {kv_key}")
+            else:
+                print(f"  found: {kv_key}")
+
+        except Exception as e:
+            print(f"  ERROR {kv_key}: {e}")
+            s = self._stats(kv_prefix)
+            s.missing += 1
+            s.errors += 1
+


kv.get() missing-key behavior: nats-py raises KeyNotFoundError, so the entry is None branch may never run. Also, counting all exceptions as missing will over-report missing keys. Handle KeyNotFoundError separately and increment errors for other exceptions only.

Copilot · 2026-04-01T22:42:36Z

+        resp = await self._os(
+            self.os.search,
+            index=self.os_index,
+            body=query,
+            scroll=SCROLL_TTL,
+            size=SCROLL_SIZE,
+            _source=["id", "object_type"],
+        )
+        scroll_id = resp.get("_scroll_id")
+
+        while True:
+            hits = resp["hits"]["hits"]
+            if not hits:
+                break
+            docs.extend(hits)
+            if len(hits) < SCROLL_SIZE:
+                break
+            resp = await self._os(self.os.scroll, scroll_id=scroll_id, scroll=SCROLL_TTL)
+            scroll_id = resp.get("_scroll_id")
+
+        if scroll_id:
+            try:
+                await self._os(self.os.clear_scroll, scroll_id=scroll_id)
+            except Exception:
+                pass
+
+        return docs
+
+    async def audit_type(self, object_type: str, kv_bucket: str) -> Stats:
+        s = Stats()
+        print(f"\n--- {object_type} → KV bucket: {kv_bucket} ---")
+
+        print(f"  Fetching all '{object_type}' documents from OpenSearch index '{self.os_index}'...")
+        docs = await self.scroll_opensearch(object_type)
+        s.total = len(docs)
+        print(f"  Found {s.total} document(s) in OpenSearch")
+
+        if s.total == 0:
+            return s
+
+        kv = await self._get_kv(kv_bucket)
+
+        for doc in docs:
+            doc_id = doc["_id"]  # OpenSearch document _id
+            # Also check the `id` field in _source for the uuid
+            source_id = doc.get("_source", {}).get("object_id") or doc_id
+
+            # Strip to bare UUID (last segment if namespaced)
+            uuid = source_id.split(":")[-1] if ":" in source_id else source_id
+
+            try:
+                entry = await kv.get(uuid)
+                if entry is None:
+                    raise NotFoundError(404, "key not found", {})
+                s.in_nats += 1
+            except Exception:
+                s.missing_in_nats += 1
+                s.missing_ids.append(doc_id)
+
+                if self.dry_run:
+                    print(f"  [DRY RUN] would delete: {doc_id} (uuid={uuid})")
+                else:
+                    try:
+                        await self._os(self.os.delete, index=self.os_index, id=doc_id)
+                        s.deleted += 1
+                        print(f"  deleted: {doc_id} (uuid={uuid})")
+                    except Exception as e:
+                        s.errors += 1
+                        print(f"  ERROR deleting {doc_id}: {e}")


This uses _source=["id", "object_type"] but then reads object_id from _source. As written, source_id will almost always fall back to doc_id, so the derived uuid may be wrong. In --delete mode this can incorrectly classify documents as stale and delete them. Use the actual field you request (likely _source["id"]) and only delete when the KV lookup definitively indicates missing (e.g., KeyNotFoundError).

Copilot · 2026-04-01T22:42:36Z

+        try:
+            entry = await self.kv.get(kv_key)
+            if entry is None:
+                print(f"  MISSING: {kv_key}")
+                self.stats.missing += 1
+                self.stats.missing_keys.append(meeting_and_occurrence_id)
+                return
+
+            if not self.dry_run:
+                await self.kv.put(kv_key, entry.value)
+                self.stats.reindexed += 1
+                print(f"  reindexed: {kv_key}")
+            else:
+                print(f"  found: {kv_key}")
+
+        except Exception as e:
+            print(f"  ERROR {kv_key}: {e}")
+            self.stats.missing += 1
+            self.stats.errors += 1
+


Same KV-missing handling issue as the other scripts: kv.get() will generally raise KeyNotFoundError rather than returning None, and the except branch increments missing for all exceptions. Catch KeyNotFoundError as missing and count other exceptions as errors only so the report is accurate.

LFXV2-1371 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jordan Evans <jevans@linuxfoundation.org>

- Add runtime deps to pyproject.toml so uv sync works - Use meta.client.batch_get_item on dynamodb resource - Catch KeyNotFoundError explicitly; count other exceptions as errors, not missing entries - Replace offset pagination with scroll API in reindex_votes to avoid 10k result window limit - Fix _source field name object_id -> id in audit script - Make API_BASE configurable via env var, defaulting to production URL - Fix docstring table names in reindex_groupsio - Regenerate uv.lock with new dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1371 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>

Issue: LFXV2-1371 Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>

jordane requested review from a team and emsearcy as code owners April 1, 2026 22:38

Copilot AI review requested due to automatic review settings April 1, 2026 22:38

Copilot started reviewing on behalf of jordane April 1, 2026 22:38 View session

jordane force-pushed the jme/LFXV2-1371 branch from f3f8054 to c1757a0 Compare April 1, 2026 22:38

Copilot AI reviewed Apr 1, 2026

View reviewed changes

jordane force-pushed the jme/LFXV2-1371 branch 3 times, most recently from a177e0e to 53a0177 Compare April 1, 2026 23:03

docs: add project onboarding guide and reindexing scripts

6e9b26a

LFXV2-1371 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jordan Evans <jevans@linuxfoundation.org>

jordane force-pushed the jme/LFXV2-1371 branch from 53a0177 to 6e9b26a Compare April 1, 2026 23:15

Merge branch 'main' into jme/LFXV2-1371

ac0d369

bramwelt previously approved these changes Apr 6, 2026

View reviewed changes

bramwelt and others added 2 commits April 6, 2026 15:58

Merge branch 'main' into jme/LFXV2-1371

b48e881

bramwelt dismissed their stale review via f80d656 April 7, 2026 17:58

github-license-compliance Bot found potential problems Apr 7, 2026

View reviewed changes

Comment thread scripts/reindexing/uv.lock

Comment thread scripts/reindexing/uv.lock

Fix docstring length in reindex_votes.py

c9ef501

Issue: LFXV2-1371 Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add project onboarding guide and reindexing scripts#78

docs: add project onboarding guide and reindexing scripts#78
jordane wants to merge 5 commits intomainfrom
jme/LFXV2-1371

jordane commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-dependencies = []
+dependencies = [
+    "boto3",
+    "httpx",
+    "nats-py",
+    "opensearch-py",
+]

Conversation

jordane commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants