You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/getting-started/sync-data.mdx
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -360,11 +360,12 @@ curl --location --request POST 'localhost:3476/v1/tenants/{tenant_id}/data/write
360
360
</Tabs>
361
361
362
362
<Note>
363
-
Note:`relation: “...”` used when subject type is different from **user** entity. **#…** represents a relation that does not affect the semantics of the tuple.
363
+
The`relation` field in the subject object controls what the subject refers to:
364
364
365
-
Simply, the usage of ... is straightforward: if you're using the user entity as a subject, you should not be using the `...` If you're using another subject rather than the user entity then you need to use the `...` .
365
+
-**Empty (`""`)** — the subject is a direct, individual user (e.g. `user:123`).
366
+
-**`"..."`** — used when the subject type is not the `user` entity. It represents the entity itself without a specific sub-relation.
367
+
-**A named relation (e.g. `"member"`)** — the subject is a userset: everyone who holds that relation on the subject entity. For example, `document:1#maintainer@organization:2#member` means all members of organization 2 are maintainers of document 1. This is how Permify supports group-based permissions without enumerating individual users.
366
368
</Note>
367
-
368
369
### Organization Members Are Maintainers Of A Specific Doc
Copy file name to clipboardExpand all lines: docs/operations/cache.mdx
+34Lines changed: 34 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,6 +50,27 @@ The cache library used is: https://github.com/dgraph-io/ristretto
50
50
51
51
Note: Another advantage of the MVCC pattern is the ability to historically store data. However, it has a downside of accumulation of too many relationships. For this, we have developed a garbage collector that will delete old data at a time period you specify.
52
52
53
+
### Cache Sizing & Eviction (Snap Tokens)
54
+
55
+
There is **no separate, dedicated cache for snap tokens**. The snap token is simply part of the permission cache key:
This means the snap-token "cache" is sized entirely through the same `permission.cache` settings shown above:
62
+
63
+
| Config key | Purpose |
64
+
|---|---|
65
+
|`service.permission.cache.max_cost`| Maximum memory budget for the permission cache (e.g. `10MiB`, `256MiB`). This is the effective size limit that gates how many snap-token-keyed entries can reside in memory at once. |
66
+
|`service.permission.cache.number_of_counters`| Number of TinyLFU admission counters. A good rule of thumb is ~10× the expected number of unique cached items. |
67
+
68
+
No TTL is configured by default. Eviction is driven purely by **memory pressure** against `max_cost`, using Ristretto's TinyLFU admission policy combined with a SampledLFU eviction policy. Entries are evicted when new items need space and the budget is exhausted — not after a fixed time window.
69
+
70
+
<Note>
71
+
If you observe high cache miss rates after a schema version change, it is expected: the `schema_version` component of the key changes, making all prior entries stale. Size your `max_cost` to hold a comfortable working set for the most recent schema version in use.
72
+
</Note>
73
+
53
74
## Distributed Cache
54
75
55
76
Permify does provide a distributed cache across availability zones (within an AWS region) via **Consistent Hashing**. Permify uses Consistent Hashing across its distributed instances for more efficient use of their individual caches.
@@ -75,6 +96,19 @@ You can learn more about consistent hashing from the following blog post: [Intro
75
96
appropriately).
76
97
</Note>
77
98
99
+
### Scaling Events — Adding or Removing Pods
100
+
101
+
When you scale out (add pods) or scale in (remove pods) in Kubernetes, here is what happens at the cache level:
102
+
103
+
**Key rebalancing is partial, not global.** The consistent hash ring updates and only the key ranges that mapped to the affected pod(s) need to move. The rest of the ring — and its cached entries — is undisturbed.
104
+
105
+
**Each pod's cache is local and in-memory.** Permify uses Ristretto as a process-local cache; there is no shared cache layer. This has two practical consequences:
106
+
107
+
-**Scale-out (new pod joins):** The new pod starts with a cold cache. For the key range now routed to it, requests will miss the cache and fall through to the database until the cache warms up. Expect a temporary increase in database load and response latency immediately after a pod is added.
108
+
-**Scale-in (pod removed):** All entries cached in that pod's memory are lost. The key range is reassigned to a remaining pod, which will experience cold-cache behaviour for those keys until they warm up.
109
+
110
+
A **brief hit-rate drop is normal** during any scale event. The warm-up period depends on your `max_cost` budget and request rate — under typical read-heavy workloads this resolves within minutes.
This page covers operational guidance for running the Watch API at scale. For setup and API details, see the [Watch API reference](/api-reference/watch/watch-changes).
7
+
8
+
## Performance
9
+
10
+
Each active Watch stream opens a long-lived connection with a continuous polling loop against the database. At high connection counts, this can result in significant CPU and I/O usage.
11
+
12
+
### Mitigation Strategies
13
+
14
+
**1. Fan-in / fan-out architecture**
15
+
16
+
Instead of each application service or pod opening its own Watch stream:
17
+
18
+
- Run a **small number of dedicated Watch consumers** (e.g. 2–4 Permify pods receiving Watch streams).
19
+
- Distribute permission-change events internally via a pub/sub system (Kafka, Redis Pub/Sub, NATS, etc.) to the rest of your fleet.
20
+
21
+
This limits the number of concurrent Watch connections to a fixed, controlled count regardless of how many application pods you run.
22
+
23
+
**2. Control mass reconnections**
24
+
25
+
After a Permify restart or rolling deployment, all Watch clients reconnect simultaneously. Implement:
26
+
27
+
-**Exponential backoff** — double the wait time after each failed attempt.
28
+
-**Jitter** — add a random offset to the backoff to spread reconnects over time.
29
+
-**Connection budgets** — limit the maximum reconnect rate per client.
30
+
31
+
**3. Separate Watch and Check deployments**
32
+
33
+
Run Watch-heavy workloads on a **dedicated Permify deployment** with its own Horizontal Pod Autoscaler (HPA), separate from the fleet serving Check, LookupEntity, and other read APIs. This prevents Watch load from affecting Check API capacity and vice versa.
34
+
35
+
### Tuning `watch_buffer_size`
36
+
37
+
The `database.watch_buffer_size` config key (default: `100`) controls how many pending change events can be queued per Watch stream before back-pressure is applied. If your write rate is high and consumers are slow, increasing this value reduces the risk of events being dropped. See [Database Configurations](/setting-up/configuration#database--database-configurations) for details.
38
+
39
+
## Stream Disconnection & Reconnection
40
+
41
+
Watch streams are **pod-specific** and are not handed off when a Permify instance terminates. If a pod running an active Watch stream shuts down (scale-in, rolling restart, node eviction):
42
+
43
+
- The gRPC stream is terminated.
44
+
- Clients **must reconnect** and open a new Watch stream, ideally passing their last received `snap_token` so they can resume from where they left off without replaying the full history.
45
+
46
+
**Best practices:**
47
+
48
+
- Store the last received `snap_token` durably (e.g. in Redis or your application database) so reconnects are resumable without data loss.
49
+
- Implement **exponential backoff with jitter** on reconnect to avoid a wave of simultaneous reconnections after a rolling deployment or pod restart.
50
+
- Apply a **connection budget** per client to cap the maximum reconnect rate.
Copy file name to clipboardExpand all lines: docs/permify-overview/faqs.mdx
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,6 +80,41 @@ We recommend applying the following pattern to safely handle schema changes:
80
80
- Centrally check and approve every change before deploying it via CI pipeline that utilizes the **Write Schema API**. We recommend adding our [schema validator](https://github.com/Permify/permify-validate-action) to the pipeline to ensure that any changes are automatically validated.
81
81
- After successful deployment, you can use the newly created schema on further API calls by either specifying its schema ID or by not providing any schema ID, which will automatically retrieve the latest schema on API calls.
82
82
83
+
#### Changing an Attribute's Type (e.g. `integer` → `boolean`)
84
+
85
+
Adding a new attribute or relation to your schema is safe — existing data is unaffected. The risk begins when you **change the type of an existing attribute**.
86
+
87
+
Existing attribute records stored in the database are **not automatically migrated**. If you:
88
+
89
+
1. Define an attribute as `integer`, insert data, then
90
+
2. Change the schema so that same attribute becomes `boolean`
91
+
92
+
the following can happen:
93
+
94
+
-**New writes** may fail with a type-mismatch error.
95
+
-**Existing incompatible records** may cause check errors or result in an implicit deny, because the stored value cannot be evaluated against the new type expectation.
96
+
- There is **no automatic data transformation** — Permify does not back-fill or convert stored attribute values when you change the schema.
97
+
98
+
**Required steps before switching to the new schema version in production:**
99
+
100
+
1.**Backfill migration** — rewrite all existing attribute records for the affected attribute to values valid under the new type.
101
+
2.**Gradual rollout** — deploy the new schema to a staging environment first, verify that all checks pass with the migrated data, then promote to production.
102
+
3. Only then update clients/services to use the new schema version.
103
+
104
+
<Note>
105
+
The same caution applies to `rule` definitions inside your schema. Both entity definitions and rules are stored as a single versioned schema document in the `schema_definitions` table — changing a rule's logic or its referenced attribute types follows the same migration discipline described above.
106
+
</Note>
107
+
108
+
### How Are Rules and Entity Definitions Stored in the Database?
109
+
110
+
Both `entity` blocks and `rule` blocks from your schema are stored together in the `schema_definitions` table — there is no separate physical table for rules. Each row holds a versioned, serialised copy of the full schema document.
111
+
112
+
The distinction between an entity and a rule exists only at read time:
113
+
114
+
-`ReadEntityDefinition(...)` — parses the stored document and returns the entity block.
115
+
-`ReadRuleDefinition(...)` — parses the same stored document and returns the rule block.
116
+
117
+
Shared storage, different interpretation.
83
118
84
119
### What is the Preferred Deployment Pattern For Permify?
| `GET /debug/pprof/goroutine` | List all goroutines and their stack traces — useful for detecting goroutine leaks |
583
+
| `GET /debug/pprof/` | Index of all available profiles |
584
+
585
+
You can analyse captured profiles locally using Go's built-in tooling:
586
+
587
+
```bash
588
+
# Download and open a 30-second CPU profile in the interactive UI
589
+
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
590
+
```
591
+
592
+
**When to enable the profiler:**
593
+
594
+
- Investigating a CPU spike or unexpectedly high latency in Permify.
595
+
- During load or capacity testing to identify bottlenecks before production.
596
+
- When suspecting a performance regression after a version upgrade.
597
+
598
+
**Best practice:** Enable the profiler **temporarily** when needed, then disable it again. Keeping it permanently open in production is not recommended — it exposes an unauthenticated HTTP endpoint and adds a small constant overhead.
599
+
600
+
<Warning>
601
+
The pprof endpoint has no built-in authentication. Restrict network access to it (e.g. via a sidecar, internal network policy, or firewall rule) so it is not reachable from the public internet.
Copy file name to clipboardExpand all lines: docs/setting-up/installation/kubernetes.mdx
+11-1Lines changed: 11 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -168,4 +168,14 @@ Let’s apply service.yaml to our nodes.
168
168
kubectl apply -f service.yaml
169
169
```
170
170
171
-
Last but not least, we can check our pods & nodes. And we can start using the container with load balancer
171
+
Last but not least, we can check our pods & nodes. And we can start using the container with load balancer
172
+
173
+
## Operational Notes for Scaling
174
+
175
+
### Cache Warm-up After Scale Events
176
+
177
+
Permify's permission cache is **pod-local**. When a pod is added or removed, the key ranges reassigned to the affected pod start cold. Expect a temporary increase in database read load until the cache warms up. See [Cache Mechanisms — Scaling Events](/operations/cache#scaling-events--adding-or-removing-pods) for a detailed explanation.
178
+
179
+
### Watch API — gRPC Stream Reconnection
180
+
181
+
Watch streams are pod-specific and are not handed off when a pod terminates. Clients must reconnect after a scale-in or rolling restart event. See [Watch — Operations](/operations/watch#stream-disconnection--reconnection) for client best practices.
0 commit comments