Merge pull request #2816 from Permify/omer/add-technical-faqs

omer-topal · web-flow · commit 3bc7f3b35497 · 2026-03-05T23:24:30.000+03:00
docs: add operational guides and technical FAQs
diff --git a/docs/api-reference/watch/watch-changes.mdx b/docs/api-reference/watch/watch-changes.mdx
@@ -67,4 +67,4 @@ service:
 ```
 </Info>
 
-
+For performance guidance, scaling strategies, and reconnection best practices, see [Watch — Operations](/operations/watch).
diff --git a/docs/getting-started/sync-data.mdx b/docs/getting-started/sync-data.mdx
@@ -360,11 +360,12 @@ curl --location --request POST 'localhost:3476/v1/tenants/{tenant_id}/data/write
 </Tabs>
 
 <Note>
-Note: `relation: “...”` used when subject type is different from **user** entity. **#…** represents a relation that does not affect the semantics of the tuple.
+The `relation` field in the subject object controls what the subject refers to:
 
-Simply, the usage of ... is straightforward: if you're using the user entity as a subject, you should not be using the `...` If you're using another subject rather than the user entity then you need to use the `...` .
+- **Empty (`""`)** — the subject is a direct, individual user (e.g. `user:123`).
+- **`"..."`** — used when the subject type is not the `user` entity. It represents the entity itself without a specific sub-relation.
+- **A named relation (e.g. `"member"`)** — the subject is a userset: everyone who holds that relation on the subject entity. For example, `document:1#maintainer@organization:2#member` means all members of organization 2 are maintainers of document 1. This is how Permify supports group-based permissions without enumerating individual users.
 </Note>
-
 ### Organization Members Are Maintainers Of A Specific Doc
 
 **Created relational tuple:** document:1#maintainer@organization:2#member
diff --git a/docs/mint.json b/docs/mint.json
@@ -332,7 +332,8 @@
         "operations/cache",
         "operations/contextual-tuples",
         "operations/tracing",
-        "operations/snap-tokens"
+        "operations/snap-tokens",
+        "operations/watch"
       ]
     },
     {
diff --git a/docs/operations/cache.mdx b/docs/operations/cache.mdx
@@ -50,6 +50,27 @@ The cache library used is: https://github.com/dgraph-io/ristretto
 
 Note: Another advantage of the MVCC pattern is the ability to historically store data. However, it has a downside of accumulation of too many relationships. For this, we have developed a garbage collector that will delete old data at a time period you specify.
 
+### Cache Sizing & Eviction (Snap Tokens)
+
+There is **no separate, dedicated cache for snap tokens**. The snap token is simply part of the permission cache key:
+
+```
+check_{tenant_id}_{schema_version}:{snapshot_token}:{check_request}
+```
+
+This means the snap-token "cache" is sized entirely through the same `permission.cache` settings shown above:
+
+| Config key | Purpose |
+|---|---|
+| `service.permission.cache.max_cost` | Maximum memory budget for the permission cache (e.g. `10MiB`, `256MiB`). This is the effective size limit that gates how many snap-token-keyed entries can reside in memory at once. |
+| `service.permission.cache.number_of_counters` | Number of TinyLFU admission counters. A good rule of thumb is ~10× the expected number of unique cached items. |
+
+No TTL is configured by default. Eviction is driven purely by **memory pressure** against `max_cost`, using Ristretto's TinyLFU admission policy combined with a SampledLFU eviction policy. Entries are evicted when new items need space and the budget is exhausted — not after a fixed time window.
+
+<Note>
+  If you observe high cache miss rates after a schema version change, it is expected: the `schema_version` component of the key changes, making all prior entries stale. Size your `max_cost` to hold a comfortable working set for the most recent schema version in use.
+</Note>
+
 ## Distributed Cache
 
 Permify does provide a distributed cache across availability zones (within an AWS region) via **Consistent Hashing**. Permify uses Consistent Hashing across its distributed instances for more efficient use of their individual caches.
@@ -75,6 +96,19 @@ You can learn more about consistent hashing from the following blog post: [Intro
   appropriately).
 </Note>
 
+### Scaling Events — Adding or Removing Pods
+
+When you scale out (add pods) or scale in (remove pods) in Kubernetes, here is what happens at the cache level:
+
+**Key rebalancing is partial, not global.** The consistent hash ring updates and only the key ranges that mapped to the affected pod(s) need to move. The rest of the ring — and its cached entries — is undisturbed.
+
+**Each pod's cache is local and in-memory.** Permify uses Ristretto as a process-local cache; there is no shared cache layer. This has two practical consequences:
+
+- **Scale-out (new pod joins):** The new pod starts with a cold cache. For the key range now routed to it, requests will miss the cache and fall through to the database until the cache warms up. Expect a temporary increase in database load and response latency immediately after a pod is added.
+- **Scale-in (pod removed):** All entries cached in that pod's memory are lost. The key range is reassigned to a remaining pod, which will experience cold-cache behaviour for those keys until they warm up.
+
+A **brief hit-rate drop is normal** during any scale event. The warm-up period depends on your `max_cost` budget and request rate — under typical read-heavy workloads this resolves within minutes.
+
 Here is an example configuration:
 
 ```yaml
diff --git a/docs/operations/watch.mdx b/docs/operations/watch.mdx
@@ -0,0 +1,50 @@
+---
+icon: tower-broadcast
+title: Watch
+---
+
+This page covers operational guidance for running the Watch API at scale. For setup and API details, see the [Watch API reference](/api-reference/watch/watch-changes).
+
+## Performance
+
+Each active Watch stream opens a long-lived connection with a continuous polling loop against the database. At high connection counts, this can result in significant CPU and I/O usage.
+
+### Mitigation Strategies
+
+**1. Fan-in / fan-out architecture**
+
+Instead of each application service or pod opening its own Watch stream:
+
+- Run a **small number of dedicated Watch consumers** (e.g. 2–4 Permify pods receiving Watch streams).
+- Distribute permission-change events internally via a pub/sub system (Kafka, Redis Pub/Sub, NATS, etc.) to the rest of your fleet.
+
+This limits the number of concurrent Watch connections to a fixed, controlled count regardless of how many application pods you run.
+
+**2. Control mass reconnections**
+
+After a Permify restart or rolling deployment, all Watch clients reconnect simultaneously. Implement:
+
+- **Exponential backoff** — double the wait time after each failed attempt.
+- **Jitter** — add a random offset to the backoff to spread reconnects over time.
+- **Connection budgets** — limit the maximum reconnect rate per client.
+
+**3. Separate Watch and Check deployments**
+
+Run Watch-heavy workloads on a **dedicated Permify deployment** with its own Horizontal Pod Autoscaler (HPA), separate from the fleet serving Check, LookupEntity, and other read APIs. This prevents Watch load from affecting Check API capacity and vice versa.
+
+### Tuning `watch_buffer_size`
+
+The `database.watch_buffer_size` config key (default: `100`) controls how many pending change events can be queued per Watch stream before back-pressure is applied. If your write rate is high and consumers are slow, increasing this value reduces the risk of events being dropped. See [Database Configurations](/setting-up/configuration#database--database-configurations) for details.
+
+## Stream Disconnection & Reconnection
+
+Watch streams are **pod-specific** and are not handed off when a Permify instance terminates. If a pod running an active Watch stream shuts down (scale-in, rolling restart, node eviction):
+
+- The gRPC stream is terminated.
+- Clients **must reconnect** and open a new Watch stream, ideally passing their last received `snap_token` so they can resume from where they left off without replaying the full history.
+
+**Best practices:**
+
+- Store the last received `snap_token` durably (e.g. in Redis or your application database) so reconnects are resumable without data loss.
+- Implement **exponential backoff with jitter** on reconnect to avoid a wave of simultaneous reconnections after a rolling deployment or pod restart.
+- Apply a **connection budget** per client to cap the maximum reconnect rate.
diff --git a/docs/permify-overview/faqs.mdx b/docs/permify-overview/faqs.mdx
@@ -80,6 +80,41 @@ We recommend applying the following pattern to safely handle schema changes:
 -  Centrally check and approve every change before deploying it via CI pipeline that utilizes the **Write Schema API**. We recommend adding our [schema validator](https://github.com/Permify/permify-validate-action) to the pipeline to ensure that any changes are automatically validated.
 - After successful deployment, you can use the newly created schema on further API calls by either specifying its schema ID or by not providing any schema ID, which will automatically retrieve the latest schema on API calls.
 
+#### Changing an Attribute's Type (e.g. `integer` → `boolean`)
+
+Adding a new attribute or relation to your schema is safe — existing data is unaffected. The risk begins when you **change the type of an existing attribute**.
+
+Existing attribute records stored in the database are **not automatically migrated**. If you:
+
+1. Define an attribute as `integer`, insert data, then
+2. Change the schema so that same attribute becomes `boolean`
+
+the following can happen:
+
+- **New writes** may fail with a type-mismatch error.
+- **Existing incompatible records** may cause check errors or result in an implicit deny, because the stored value cannot be evaluated against the new type expectation.
+- There is **no automatic data transformation** — Permify does not back-fill or convert stored attribute values when you change the schema.
+
+**Required steps before switching to the new schema version in production:**
+
+1. **Backfill migration** — rewrite all existing attribute records for the affected attribute to values valid under the new type.
+2. **Gradual rollout** — deploy the new schema to a staging environment first, verify that all checks pass with the migrated data, then promote to production.
+3. Only then update clients/services to use the new schema version.
+
+<Note>
+The same caution applies to `rule` definitions inside your schema. Both entity definitions and rules are stored as a single versioned schema document in the `schema_definitions` table — changing a rule's logic or its referenced attribute types follows the same migration discipline described above.
+</Note>
+
+### How Are Rules and Entity Definitions Stored in the Database?
+
+Both `entity` blocks and `rule` blocks from your schema are stored together in the `schema_definitions` table — there is no separate physical table for rules. Each row holds a versioned, serialised copy of the full schema document.
+
+The distinction between an entity and a rule exists only at read time:
+
+- `ReadEntityDefinition(...)` — parses the stored document and returns the entity block.
+- `ReadRuleDefinition(...)` — parses the same stored document and returns the rule block.
+
+Shared storage, different interpretation.
 
 ### What is the Preferred Deployment Pattern For Permify?
 
diff --git a/docs/setting-up/configuration.mdx b/docs/setting-up/configuration.mdx
@@ -572,6 +572,35 @@ limiting, cache size).
 pprof is a performance profiler for Go programs. It allows developers to analyze and understand the performance
 characteristics of their code by generating detailed profiles of program execution
 
+When enabled, Permify exposes Go's standard pprof HTTP endpoints on the configured port (default `6060`):
+
+| Endpoint | Purpose |
+|---|---|
+| `GET /debug/pprof/profile?seconds=30` | CPU profile — shows which functions are consuming CPU cycles over the sampling window |
+| `GET /debug/pprof/trace?seconds=5` | Execution trace — records goroutine scheduling, GC, and syscall events |
+| `GET /debug/pprof/heap` | Heap allocation snapshot — identifies memory-heavy code paths |
+| `GET /debug/pprof/goroutine` | List all goroutines and their stack traces — useful for detecting goroutine leaks |
+| `GET /debug/pprof/` | Index of all available profiles |
+
+You can analyse captured profiles locally using Go's built-in tooling:
+
+```bash
+# Download and open a 30-second CPU profile in the interactive UI
+go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
+```
+
+**When to enable the profiler:**
+
+- Investigating a CPU spike or unexpectedly high latency in Permify.
+- During load or capacity testing to identify bottlenecks before production.
+- When suspecting a performance regression after a version upgrade.
+
+**Best practice:** Enable the profiler **temporarily** when needed, then disable it again. Keeping it permanently open in production is not recommended — it exposes an unauthenticated HTTP endpoint and adds a small constant overhead.
+
+<Warning>
+  The pprof endpoint has no built-in authentication. Restrict network access to it (e.g. via a sidecar, internal network policy, or firewall rule) so it is not reachable from the public internet.
+</Warning>
+
 #### Structure
 
 ```
diff --git a/docs/setting-up/installation/kubernetes.mdx b/docs/setting-up/installation/kubernetes.mdx
@@ -168,4 +168,14 @@ Let’s apply service.yaml to our nodes.
 kubectl apply -f service.yaml
 ```
 
-Last but not least, we can check our pods & nodes. And we can start using the container with load balancer
+Last but not least, we can check our pods & nodes. And we can start using the container with load balancer
+
+## Operational Notes for Scaling
+
+### Cache Warm-up After Scale Events
+
+Permify's permission cache is **pod-local**. When a pod is added or removed, the key ranges reassigned to the affected pod start cold. Expect a temporary increase in database read load until the cache warms up. See [Cache Mechanisms — Scaling Events](/operations/cache#scaling-events--adding-or-removing-pods) for a detailed explanation.
+
+### Watch API — gRPC Stream Reconnection
+
+Watch streams are pod-specific and are not handed off when a pod terminates. Clients must reconnect after a scale-in or rolling restart event. See [Watch — Operations](/operations/watch#stream-disconnection--reconnection) for client best practices.

-Original file line number
+Diff line change
 ```
 </Info>
+-
 +For performance guidance, scaling strategies, and reconnection best practices, see [Watch — Operations](/operations/watch).
Original file line number	Diff line number	Diff line change
`@@ -332,7 +332,8 @@`
`332`	`332`	`"operations/cache",`
`333`	`333`	`"operations/contextual-tuples",`
`334`	`334`	`"operations/tracing",`
`335`		`- "operations/snap-tokens"`
	`335`	`+ "operations/snap-tokens",`
	`336`	`+ "operations/watch"`
`336`	`337`	`]`
`337`	`338`	`},`
`338`	`339`	`{`