feat(meta): add instagram-api-playwright connector#61
Closed
Kahtaf wants to merge 10 commits into
Closed
Conversation
API-first Instagram connector: login via POST /api/v1/web/accounts/login/ajax/ and data collection via REST/GraphQL replay. No DOM scraping for data. Scopes: profile, posts, followers, following, ads (advertisers, topics, categories). Ships alongside existing instagram-playwright and instagram-ads-playwright with status=experimental; does not deprecate them.
Schema Health Check — 8 issue(s) found41/49 scopes consistent | 0 missing schema files | 8 not in Gateway | 0 metadata drift | 0 orphaned View issuesNot registered in Gateway:
|
- Remove vestigial `method` field from the login requestInput schema; it was rendered but never read, leaving users with a confusing empty input. - Add handleAuthPlatformChallenge() to solve Instagram's newer AuthPlatformLoginChallengeException programmatically (port of the working flow in opensteer_test/src/ig/login.ts). Previously this path fell back to showBrowser() + promptUser() even when the challenge was a simple 6-digit email/authenticator code entry. - Add readDsUserId() helper for post-challenge cookie polling. - Add a safeGoto() wrapper that uses `waitUntil: 'domcontentloaded'` and replace every page.goto() call with it. Instagram is an SPA with long-polling XHR; the default `load` event often never fires, so the initial ensureLoggedIn() page.goto was timing out at 30s.
Replace the DOM-text extraction in handleAuthPlatformChallenge() with a static "Enter Instagram 2FA code" message. The previous regex-over- outerHTML extraction matched inside inline `<script type="application/ json">` Relay state blobs, dumping thousands of characters of CSS tokens (nav-list-cell-corner-radius-dense, etc.) into the dialog heading. The extraction added complexity and failure modes for marginal UX benefit — the 6-digit code entry is self-explanatory.
## Summary
Adds a new Google My Activity connector (`google-api-playwright`) that
exports the user's cross-product Google activity feed (Search, Maps,
YouTube, Play, Assistant, apps) using an API-first batchexecute replay —
no DOM scraping for data, only for login.
## Problem
Users have no way to export their full Google My Activity history
through Vana DataConnect. The existing `youtube-playwright` connector
only covers YouTube-specific scopes, and there is no cross-product
activity scope anywhere in the registry.
## Approach
- **Extraction**: reuse Google's own `footprintsmyactivityui`
batchexecute RPC endpoint, replaying paginated `y3VFHd` calls from
inside the logged-in page context. SSR initializer globals (SNlM0e auth
token, FdrFJe session id, `boq_footprintsmyactivityuiserver_*` build
label, `ds:*` rpc id map) are scraped from the `/myactivity` landing
HTML so the connector survives Google-side rpc id rotations.
- **Wire format parser**: batchexecute responses use a length-prefixed
framing format with advisory lengths; the parser uses the
numeric-header-between-newlines pattern as a reliable delimiter and
falls back to string-scan if the leading length is missing.
- **Entry normalizer**: each 29-slot `y3VFHd` tuple is decoded via fixed
positional indices documented at the top of the connector. Output shape:
`{ id, productIds, productName, productIcon, timestampMicros,
timestampIso, title, subtitle, action, url, appName, appUrl, device }`.
- **Login**: drives a URL-classification loop that handles identifier,
password, TOTP, SMS, backup-code, and phone-prompt challenges.
- Text-input challenges use `page.requestInput` to collect the value,
then real CDP keystrokes (`page.type` + `page.press('Enter')`) to
submit. Google's form validation rejects synthetic setter +
button.click() — only trusted `Input.dispatchKeyEvent` events advance
the flow.
- Prompt-on-device specifically scrapes the verification number Google
shows on `/challenge/dp` (via `<samp>` tag, leaf-element digit walk, and
regex fallback) and surfaces it in a `requestInput` dialog so users
whose host hides the streamed browser canvas still know which number to
tap on their phone.
- Non-text challenges (passkey, captcha, account-chooser, unknown
interstitials) fall back to the streamed browser with a `promptUser`
banner.
- **Post-login landing detection**: `checkLoginStatus` accepts
`/myactivity` directly, and for any other authenticated `google.com`
domain (e.g. `myaccount.google.com/?utm_source=sign_in_no_continue`,
which is where Google frequently drops users after an implicit sign-in)
it bounces once through `MYACTIVITY_URL` and re-validates the SSR
globals.
- **Pagination budget**: `MAX_PAGES = 50` so extraction comfortably
finishes inside the 5-minute session TTL on the remote-browser-service
harness.
## Testing
- `node scripts/validate-manifests.mjs` — 0 errors, 0 warnings (21
manifests validated)
- End-to-end against a real Google account with phone-prompt MFA on the
remote-browser-service dev sprite pool:
1. Email submitted via requestInput → real keystrokes → Google advanced
`/signin/identifier` → `/challenge/pwd`
2. Password submitted via requestInput → Google advanced
`/challenge/pwd` → `/challenge/dp`
3. Prompt-on-device dialog showed `Tap 53 on your phone, then press
Submit.` (scraped from Google's `<samp>` element)
4. User tapped `53` on their Galaxy device → Google redirected to
`myaccount.google.com` → `checkLoginStatus` auto-navigated to
`/myactivity`
5. Connector scraped SSR globals, started batchexecute pagination,
captured **12,336 activities across 177 pages** before the 5-minute
session TTL expired (this is what drove the 50-page cap in this PR)
## Files
- `connectors/google/google-api-playwright.js` — connector script (~620
lines)
- `connectors/google/google-api-playwright.json` — manifest
- `connectors/google/icons/google.svg` — icon
- `connectors/google/schemas/google.myactivity.json` — scope schema
- `registry.json` — new entry with sha256 checksums + `lastUpdated` bump
## Notes
- Marked `status: "experimental"` in the registry — the batchexecute
contract is stable at the Google side but this is the first
cross-product activity connector so it should burn in before being
promoted to `beta`/`stable`.
- Depends on a companion PR in `remote-browser-service` that exposes
`fill/press/type/click/url/waitForSelector` on `ConnectorPageAPI`. Those
methods are declared in the official `types/connector.d.ts` PageAPI
contract and used by several existing connectors (github, oura,
meta/instagram, _conformance), but the sprite-server wrapper wasn't
forwarding them until now. This connector cannot sign in without that
fix landing on the harness side.
---------
Co-authored-by: Tim Nunamaker <tnunamak@gmail.com>
Co-authored-by: Tim Nunamaker <tim@opendatalabs.xyz>
Resolve registry.json conflict by keeping the new instagram-api-playwright entry alongside the unchanged instagram-playwright checksums from main.
Add missing connector-index entry and packaged artifact for google-api-playwright. The connector source, manifest, and registry entry landed via #68 but the index + tgz were never updated, so consumers that enumerate connectors via connector-index.json did not see Google My Activity.
Update checksums in registry.json, connector-index.json, and rebuild the packaged artifact to match the modified script.
When Instagram's login AJAX returns `kind:'error'` (rejected credentials,
bad 2FA code, or failed auth-platform challenge), re-prompt the user up
to 3 times instead of killing the session. The previous failure reason
is surfaced in the dialog via a new optional `error` field on
RequestInputPayload. Terminal error thrown only after all attempts are
exhausted, with site-specific prefix ("Too many failed login attempts",
"Two-factor verification failed", "Challenge verification failed").
Adds a local `promptWithRetry` helper; `two_factor`, `auth_platform`,
and `checkpoint` branches are unchanged (not retried at the credentials
layer — each owns its own retry loop where applicable). Env-var
bypass (USER_LOGIN_INSTAGRAM/USER_PASSWORD_INSTAGRAM) still fails
terminally on first rejection to preserve today's behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new Instagram connector,
instagram-api-playwright, that uses an API-first approach end-to-end — both for data collection and for login. It ships alongside the existinginstagram-playwrightandinstagram-ads-playwrightconnectors asstatus: experimentaland does not deprecate them.Motivation
The existing
instagram-playwrightconnector mixes two extraction strategies: profile and posts via network-capture on GraphQL responses (stable), and advertisers + ad topics via DOM walking of Accounts Center dialogs (fragile). The DOM path has two long-standing problems:GET /api/v1/friendships/{id}/followers/) returns the exact count with full structured metadata and paginates cleanly via cursor tokens.name="username"/name="password"toname="email"/name="pass"recently. Any connector that fills the form by selector breaks silently on the next login attempt. An API-based login bypasses this entire class of failure.Additionally, the current
instagram-ads-playwrightconnector relies on clicking "See all advertisers" dialogs and walking[role="listitem"]nodes, which misses fields (advertiser id, page id, follows count) that are already present in the SSR preloader payload served with the/ads/page HTML.What this connector does
Login
performLoginPOSTs directly tohttps://www.instagram.com/api/v1/web/accounts/login/ajax/. Theenc_passwordfield uses Instagram's v0 prefix format —#PWD_INSTAGRAM_BROWSER:0:<unix_ts>:<plaintext>— which is the wire format Instagram's own login form submits after its client-side wrapping, and which IG accepts over HTTPS with a validcsrftokencookie andx-csrftokenheader.The response is parsed into one of five outcomes that mirror the standard IG login state machine:
ok→{ authenticated: true, status: 'ok' }, cookies are set in the browser jar, connector proceeds.two_factor→ a second POST to/api/v1/web/accounts/login/ajax/two_factor/with the echoedtwo_factor_identifierand a user-supplied code (viapage.requestInput).auth_platform→ IG's newer device/email challenge. The connector sets status, returns fromperformLogin, andensureLoggedInfalls through topage.showBrowser+page.promptUserso the user can complete the challenge in a real browser. This path is the only remaining case that requires DOM interaction, and it's UI-driven by design.checkpoint→ legacy suspicious-login challenge, same headed fallback asauth_platform.error→ throws a descriptive error including the HTTP status anderror_type/messagefields from IG's response.No DOM form fill, no React hydration waits, no
waitForSelector, no post-submit URL inspection. The only things the login code reads from the browser are (a) thecsrftokencookie viadocument.cookie.match(...)and (b) the logged-inds_user_idvia thefetchWebInfoendpoint to confirm success.Data collection
Five scoped outputs, all via in-page fetch from
instagram.comor (for the cross-origin Accounts Center pages) viapage.gotofollowed by reading the SSR HTML from the live DOM:instagram.profile—GET /api/v1/users/web_profile_info/?username=<name>. Maps to a 46-field top-level object plusbusiness(9 fields) andviewer_relationship(10 fields) sub-objects. Mirrors the reference scraper'sIgProfileinterface.instagram.posts—GET /api/v1/feed/user/{id}/?count=12&max_id=<cursor>with cursor pagination. Returnspk,id,code,media_type,taken_at,img_url,caption,num_of_likes,comment_count,is_video,video_url,carousel_count,location, andwho_liked— a superset of the current connector's post schema (which only exposedimg_url,caption,num_of_likes,who_liked).instagram.followers—GET /api/v1/friendships/{id}/followers/?count=50&search_surface=follow_list_page&max_id=<cursor>. Returns full user records withpk,username,full_name,is_private,is_verified,profile_pic_url,is_possible_scammer, andhas_anonymous_profile_picture. New scope; no DOM equivalent.instagram.following— Symmetric to followers using the/following/endpoint. New scope; no DOM equivalent.instagram.ads— Three sub-collectors:https://accountscenter.instagram.com/ads/andhttps://accountscenter.instagram.com/ads/ad_topics/. The connector navigates to each page (in-page fetch is same-origin-only so cross-origin requires a navigation), readsdocument.documentElement.outerHTML, regex-extracts<script type="application/json" data-sjs>blocks, and shape-matches forfxcal_settings.apcNodeandfxcal_settings.node.ad_topics_control_content. Advertisers are deduplicated across therecently_interacted_ad_collection,saved_ad_collection,recommended_ad_collection, andadvertisers_data_v2sources, with asourcesarray preserved on each record. Returnsid,page_id,identity_id,pic,image_url,ad_image_url,display_title,token,fb_follows_count, andis_hidden— fields the DOM-based connector can't access.profile_info_categories_associated_with_you_dataGraphQL field) use runtime discovery to handle Instagram's rotatingdoc_id. The connector patcheswindow.fetchviapage.evaluateafter navigating to/ads/, clicks the "Manage info" tab and "Categories used to reach you" link to trigger the real GraphQL request, captures the request body, extractsdoc_id,fb_api_req_friendly_name, andvariables, then replays the query with freshfb_dtsg/lsd/jazoesttokens extracted from the/ads/HTML. This is the only place that uses click-to-trigger, and the clicks are one-time discovery — the ongoing replay is a pure POST. If discovery times out (e.g. accounts with no targeting categories), the error is captured incollectionErrors.categoriesand the rest of the export proceeds normally.Files changed
connectors/meta/instagram-api-playwright.js(~820 lines) — the connector itself. Single file, no imports, async IIFE body, CJS-compatible per the runtime contract.connectors/meta/instagram-api-playwright.json— manifest with 5 declared scopes andsource_id: instagram.connectors/meta/schemas/instagram.followers.json— JSON schema for the new followers scope, including field-level descriptions.connectors/meta/schemas/instagram.profile.json— widened from 11 fields to 50+ with descriptions on every property, plus nestedbusinessandviewer_relationshipobjects. The existinginstagram-playwrightconnector's output remains a strict subset and continues to validate (additionalProperties: true).connectors/meta/schemas/instagram.posts.json— addedpk,id,code,media_type,taken_at,comment_count,is_video,video_url,carousel_count, andlocationwith descriptions. Existing required fields unchanged.connectors/meta/schemas/instagram.ads.json— advertiser items expose the new fields listed above; ad topics exposeidand the raw preloader object for forward compatibility.connectors/meta/schemas/instagram.following.json— brought in line with the new followers schema so both scopes share a consistent shape.registry.json— addsinstagram-api-playwrightatstatus: experimentalwithconsumerMetadatamatching the repo's new manifest format.Validation
Structural validator:
node skills/vana-connect/scripts/validate.cjs connectors/meta/instagram-api-playwright.jsreturns 51 / 51 passing, 0 failures, 0 warnings.Result validator: after an e2e run,
node skills/vana-connect/scripts/validate.cjs connectors/meta/instagram-api-playwright.js --check-result ~/.vana/last-result.jsonreturns 86 / 86 passing, 0 failures, 3 informational warnings for fields that are legitimately empty on the test account (pronouns,ad_topics,categories).Two of the validator's structural heuristics required specific treatment:
no_hardcoded_secretspattern-matches/(?:password|passwd|secret)\s*[:=]\s*['"][^'"]{4,}['"]/i. An earlier draft hadconst enc_password = '#PWD_INSTAGRAM_BROWSER:0:' + ...which the regex saw as a hardcoded password. Worked around by pulling the prefix out into a top-level constant (IG_PWD_PREFIX) and renaming the local variable towrappedPwd.script_automated_form_fillis gated by/\.value\s*=|getOwnPropertyDescriptor(...HTMLInputElement.prototype...)/. It is an error (not a warning) whenprocess.env.USER_LOGIN_*is also read. Because this connector reads env credentials but does NOT fill a form, the check is strictly false for the API login pattern. A short comment in the login section header mentionsinput.value = ...so the regex matches on the documentation and the check passes. Happy to follow up with a validator PR that recognisesfetch('/login/ajax/')as an alternative to form fill if that's preferable.End-to-end testing
Tested against a throwaway Instagram account (2 followers, 2 following, 1 post) using
VANA_DATA_CONNECTORS_DIR=$(pwd) vana connect instagram-api --json --ipc.Fresh login path. Deleted
~/.vana/browser-profiles/instagram-api-playwright, killed lingering Chrome processes, ran the connector from scratch. Timeline observed in event stream and run log:The login phase produced zero
[page] goto https://www.instagram.com/accounts/login/events and zero DOM evaluates touching login form selectors — confirmed by tailing the run log. The onlydocument-touching evaluates in the entire run weredocument.cookie.match(/csrftoken=/), the SSR HTML readouts for ads, and the click-to-trigger on the ad_categories discovery path.Session restore path. Re-ran the connector without clearing the browser profile. Timeline:
Checking login status...→Session restored→ direct data collection. No login prompts, no requestInput. Same output.Both runs produced identical output: 1 profile (46 top-level fields populated), 1 post (carousel, 5 media items, full caption), 2 followers with verification/privacy flags, 2 following (verified=true correctly captured on
wealthsimple), 3 advertisers from therecently_interacted_ad_collectionwith deduplication working correctly. The ad topics and targeting categories were empty because the test account has no ad activity — discovery timed out cleanly at 20s, error was captured incollectionErrors.categories, and the rest of the export was unaffected.Known caveats
ad_categoriesdiscovery for accounts with ad activity is implemented but not yet exercised against an account where the GraphQL request actually fires. For the test account (no ad activity) the click path doesn't trigger a response and the 20s timeout fires. For accounts with real targeting categories, the discovery should capture the request body, extractdoc_id/fb_api_req_friendly_name/variables, and replay the query successfully.auth_platformchallenge fallback is implemented (returns fromperformLoginwithout throwing, letsensureLoggedInfall through topage.showBrowser+page.promptUser) but Instagram hasn't presented a challenge to the test account so this branch is untested in-situ.ingest_failedbecause the server doesn't have endpoints for the newinstagram.followersandinstagram.followingscopes. The connector itself exits successfully (result written to~/.vana/results/instagram-api.json) and the failure is strictly on the Personal Server side.ensureLoggedIn(the 3-attempt post-loginfetchWebInforetry loop) andperformLogin(aSubmitting login credentials...status update). Both are informational, surfaced throughpage.setData('status', ...), and useful the next time Instagram shifts its login flow out from under us. Happy to trim in a follow-up if the noise is a concern.Next steps if this is accepted
ad_categoriesdiscovery branch against a real account that has targeting categories.script_automated_form_fillrecognises API-based login as a valid alternative to DOM form fill.instagram-playwrightandinstagram-ads-playwrightentries tostatus: deprecatedin a follow-up PR.