|
1 | 1 | # Internals |
2 | 2 |
|
3 | | -This document details information about how Compass interfaces with elasticsearch. It is meant to give an overview of how some concepts work internally, to help streamline understanding of how things work under the hood. |
4 | | - |
5 | | -## Index Setup |
6 | | - |
7 | | -There is a migration command in compass to setup all storages. The indices are configured with a camel case tokenizer, to support proper lexing of some resources that use camel case in their nomenclature \(protobuf names for instance\). Given below is a sample of the index settings that are used: |
8 | | - |
9 | | -```javascript |
10 | | -// PUT http://${ES_HOST}/{index} |
11 | | -{ |
12 | | - "mappings": {}, // used for boost |
13 | | - "aliases": { // all indices are aliased to the "universe" index |
14 | | - "universe": {} |
15 | | - }, |
16 | | - "settings": { // configuration for handling camel case text |
17 | | - "analysis": { |
18 | | - "analyzer": { |
19 | | - "default": { |
20 | | - "type": "pattern", |
21 | | - "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" |
22 | | - } |
23 | | - } |
24 | | - } |
25 | | - } |
26 | | - } |
27 | | -``` |
| 3 | +This document details how Compass works under the hood. It covers the search architecture, storage internals, and multi-tenancy model. |
28 | 4 |
|
29 | | -One shared index is created for all services and tenants but each request(read/write) is routed to a unique shard for each tenant. Compass categorize tenants into two tires, `shared` and `dedicated`. For shared tenants, all the requests will be routed by namespace id over a single shard in an index. For dedicated tenants, each tenant will have its own index. Note, a single index will have N number of `types` same as the number of `Services` supported in Compass. This design will ensure, all the document insert/query requests are only confined to a single shard(in case of shared) or a single index(in case of dedicated). |
30 | | -Details on why we did this is available at [issue #208](https://github.com/raystack/compass/issues/208). |
| 5 | +## Search Architecture |
31 | 6 |
|
32 | | -## Postgres |
| 7 | +All search in Compass is Postgres-native, combining keyword, fuzzy, and semantic strategies with no external search engine dependencies. |
33 | 8 |
|
34 | | -To enforce multi-tenant restrictions at the database level, [Row Level Security](https://www.postgresql.org/docs/current/ddl-rowsecurity.html) is used. RLS requires Postgres users used for application database connection not to be a table owner or a superuser else all RLS are bypassed by default. That means a Postgres user that is migrating the application and a user that is used to serve the app should both be different. |
| 9 | +### Postgres-Native Search |
35 | 10 |
|
36 | | -To create a postgres user |
| 11 | +#### Full-Text Search (tsvector) |
37 | 12 |
|
38 | | -```sql |
39 | | -CREATE USER "compass_user" WITH PASSWORD 'compass'; |
40 | | -GRANT CONNECT ON DATABASE "compass" TO "compass_user"; |
41 | | -GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO "compass_user"; |
42 | | -GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO "compass_user"; |
43 | | -GRANT ALL ON ALL FUNCTIONS IN SCHEMA public TO "compass_user"; |
| 13 | +Entities are indexed using PostgreSQL's built-in full-text search. A `search_vector` generated column is maintained on the entities table with weighted fields: |
44 | 14 |
|
45 | | -ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES |
46 | | -ON TABLES TO "compass_user"; |
47 | | -ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT USAGE ON SEQUENCES TO "compass_user"; |
48 | | -ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT EXECUTE ON FUNCTIONS TO "compass_user"; |
49 | | -``` |
| 15 | +- **Weight A:** URN and name (highest relevance) |
| 16 | +- **Weight B:** Description |
| 17 | +- **Weight C:** Source and service metadata |
50 | 18 |
|
51 | | -A middleware looks for `x-namespace` header to extract tenant id if not found falls back to `default` namespace. |
52 | | -Same could be passed in a `jwt token` of Authentication Bearer with `namespace_id` as a claim. |
| 19 | +GIN indexes on the search vector enable fast full-text queries. |
53 | 20 |
|
54 | | -## Search |
| 21 | +#### Fuzzy Matching (pg_trgm) |
55 | 22 |
|
56 | | -We use elasticsearch's `multi_match` search for running our queries. Depending on whether there are additional filter's specified during search, we augment the query with a custom script query that filter's the result set. |
| 23 | +Trigram indexes powered by the `pg_trgm` extension support typo-tolerant and partial matching. This handles cases where users misspell entity names or search with partial terms. |
57 | 24 |
|
58 | | -The script filter is designed to match a document if: |
| 25 | +#### Semantic Search (pgvector) |
59 | 26 |
|
60 | | -- the document contains the filter key and it's value matches the filter value OR |
61 | | -- the document doesn't contain the filter key at all |
| 27 | +Vector embeddings are stored in a chunks table and indexed for cosine similarity search using pgvector. When an entity is created or updated, its semantic content (description, properties, labels) is embedded and stored. Semantic search finds conceptually related entities even when the exact terms don't overlap. |
62 | 28 |
|
63 | | -To demonstrate, the following API call: |
| 29 | +#### Hybrid Ranking |
64 | 30 |
|
65 | | -```text |
66 | | -$ curl http://localhost:8080/v1beta1/search?text=log&filter[landscape]=id |
67 | | -``` |
| 31 | +Results from keyword and semantic search are combined using Reciprocal Rank Fusion (RRF). This produces a single ranked list that balances keyword precision with semantic recall. |
68 | 32 |
|
69 | | -is internally translated to the following elasticsearch query |
70 | | - |
71 | | -```javascript |
72 | | -{ |
73 | | - "query": { |
74 | | - "bool": { |
75 | | - "must": { |
76 | | - "multi_match": { |
77 | | - "query": "log" |
78 | | - } |
79 | | - }, |
80 | | - "filter": [{ |
81 | | - "script": { |
82 | | - "script": { |
83 | | - "source": "doc.containsKey(\"landscape.keyword\") == false || doc[\"landscape.keyword\"].value == \"id\"" |
84 | | - } |
85 | | - } |
86 | | - }] |
87 | | - } |
88 | | - } |
89 | | -} |
90 | | -``` |
| 33 | +## Entity Storage |
91 | 34 |
|
92 | | -Compass also supports filter with fuzzy match with `query` query params. The script query is designed to match a document if: |
| 35 | +### Temporal Model |
93 | 36 |
|
94 | | -- the document contains the filter key and it's value is fuzzily matches the `query` value |
| 37 | +Entities in Compass are temporal. Each entity version carries `valid_from` and `valid_to` timestamps, allowing Compass to track how entities and their properties evolve over time. This supports queries like "what did this entity look like last week" and "what changed in the last 24 hours." |
95 | 38 |
|
96 | | -```text |
97 | | -$ curl http://localhost:8080/v1beta1/search?text=log&filter[landscape]=id |
98 | | -``` |
| 39 | +### Graph Edges |
| 40 | + |
| 41 | +Relationships between entities are stored as typed, directed edges. Each edge has a type (lineage, ownership, documentation, etc.) and optional properties. Edges are also temporal, capturing when relationships were established and when they ended. |
99 | 42 |
|
100 | | -is internally translated to the following elasticsearch query |
101 | | - |
102 | | -```javascript |
103 | | -{ |
104 | | - "query":{ |
105 | | - "bool":{ |
106 | | - "filter":{ |
107 | | - "match":{ |
108 | | - "description":{ |
109 | | - "fuzziness":"AUTO", |
110 | | - "query":"test" |
111 | | - } |
112 | | - } |
113 | | - }, |
114 | | - "should":{ |
115 | | - "bool":{ |
116 | | - "should":[ |
117 | | - { |
118 | | - "multi_match":{ |
119 | | - "fields":[ |
120 | | - "urn^10", |
121 | | - "name^5" |
122 | | - ], |
123 | | - "query":"log" |
124 | | - } |
125 | | - }, |
126 | | - { |
127 | | - "multi_match":{ |
128 | | - "fields":[ |
129 | | - "urn^10", |
130 | | - "name^5" |
131 | | - ], |
132 | | - "fuzziness":"AUTO", |
133 | | - "query":"log" |
134 | | - } |
135 | | - }, |
136 | | - { |
137 | | - "multi_match":{ |
138 | | - "fields":[ |
139 | | - |
140 | | - ], |
141 | | - "fuzziness":"AUTO", |
142 | | - "query":"log" |
143 | | - } |
144 | | - } |
145 | | - ] |
146 | | - } |
147 | | - } |
148 | | - } |
149 | | - }, |
150 | | - "min_score":0.01 |
151 | | -} |
| 43 | +Graph traversal uses recursive Common Table Expressions (CTEs) in PostgreSQL, enabling multi-hop queries without external graph database dependencies. |
| 44 | + |
| 45 | +## PostgreSQL Multi-Tenancy |
| 46 | + |
| 47 | +To enforce multi-tenant restrictions at the database level, [Row Level Security](https://www.postgresql.org/docs/current/ddl-rowsecurity.html) is used. RLS requires Postgres users used for application database connection not to be a table owner or a superuser, else all RLS policies are bypassed by default. That means the Postgres user that runs migrations and the user that serves the app should be different. |
| 48 | + |
| 49 | +To create a postgres user: |
| 50 | + |
| 51 | +```sql |
| 52 | +CREATE USER "compass_user" WITH PASSWORD 'compass'; |
| 53 | +GRANT CONNECT ON DATABASE "compass" TO "compass_user"; |
| 54 | +GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO "compass_user"; |
| 55 | +GRANT ALL ON ALL SEQUENCES IN SCHEMA public TO "compass_user"; |
| 56 | +GRANT ALL ON ALL FUNCTIONS IN SCHEMA public TO "compass_user"; |
| 57 | + |
| 58 | +ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT SELECT, INSERT, UPDATE, DELETE, REFERENCES |
| 59 | +ON TABLES TO "compass_user"; |
| 60 | +ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT USAGE ON SEQUENCES TO "compass_user"; |
| 61 | +ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT EXECUTE ON FUNCTIONS TO "compass_user"; |
152 | 62 | ``` |
| 63 | + |
| 64 | +A middleware looks for `x-namespace` header to extract tenant id. If not found, it falls back to the `default` namespace. The same can be passed in a JWT token of Authentication Bearer with `namespace_id` as a claim. |
0 commit comments