Skip to content

Commit ef754c7

Browse files
authored
Merge pull request #447 from dahlia/main
Move remote reply scraping to a worker
2 parents 83b62e2 + d4d4a39 commit ef754c7

25 files changed

Lines changed: 15810 additions & 78 deletions

AGENTS.md

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -382,21 +382,26 @@ STORAGE_URL_BASE=https://your-bucket.s3.amazonaws.com
382382

383383
### Optional variables
384384

385-
| Variable | Default | Description |
386-
| ------------------------------ | ------- | ---------------------------------------- |
387-
| `PORT` | 3000 | Server port |
388-
| `BIND` | - | Bind address |
389-
| `NODE_TYPE` | all | Node type: `all`, `web`, or `worker` |
390-
| `BEHIND_PROXY` | false | Trust proxy headers |
391-
| `LOG_LEVEL` | info | Logging level |
392-
| `LOG_QUERY` | false | Log database queries |
393-
| `LOG_FILE` | - | JSON log file path |
394-
| `SENTRY_DSN` | - | Sentry error tracking |
395-
| `HOME_URL` | - | Home page redirect URL |
396-
| `ALLOW_PRIVATE_ADDRESS` | false | Disable SSRF protection |
397-
| `REMOTE_ACTOR_FETCH_POSTS` | 10 | Posts to fetch from remote actors |
398-
| `REMOTE_ACTOR_STALENESS_DAYS` | 7 | Days before remote actor data is stale |
399-
| `REFRESH_ACTORS_ON_INTERACTION`| false | Refresh actors on all activity types |
385+
| Variable | Default | Description |
386+
| ---------------------------------------- | ------- | ---------------------------------------- |
387+
| `PORT` | 3000 | Server port |
388+
| `BIND` | - | Bind address |
389+
| `NODE_TYPE` | all | Node type: `all`, `web`, or `worker` |
390+
| `BEHIND_PROXY` | false | Trust proxy headers |
391+
| `LOG_LEVEL` | info | Logging level |
392+
| `LOG_QUERY` | false | Log database queries |
393+
| `LOG_FILE` | - | JSON log file path |
394+
| `SENTRY_DSN` | - | Sentry error tracking |
395+
| `HOME_URL` | - | Home page redirect URL |
396+
| `ALLOW_PRIVATE_ADDRESS` | false | Disable SSRF protection |
397+
| `REMOTE_ACTOR_FETCH_POSTS` | 10 | Posts to fetch from remote actors |
398+
| `REMOTE_ACTOR_STALENESS_DAYS` | 7 | Days before remote actor data is stale |
399+
| `REFRESH_ACTORS_ON_INTERACTION` | false | Refresh actors on all activity types |
400+
| `REMOTE_REPLIES_SCRAPE_DEPTH` | 2 | Reply scraping depth for remote posts |
401+
| `REMOTE_REPLIES_SCRAPE_MAX_ITEMS` | 100 | Replies to process per scraping job |
402+
| `REMOTE_REPLIES_SCRAPE_INTERVAL_SECONDS` | 5 | Delay between scrape requests per origin |
403+
| `REMOTE_REPLIES_SCRAPE_BACKOFF_SECONDS` | 300 | Backoff for 429 without `Retry-After` |
404+
| `REMOTE_REPLIES_SCRAPE_COOLDOWN_SECONDS` | 300 | Completed scrape deduplication window |
400405

401406

402407
Adding new environment variables

CHANGES.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,20 @@ To be released.
3838
- Preview card scraping, media processing, and authentication helpers now
3939
load on demand instead of eagerly during route registration.
4040

41+
- Moved remote replies scraping from synchronous post ingestion to a
42+
rate-limited background worker. Remote posts now enqueue reply collection
43+
scraping jobs instead of fetching nested replies inline, which prevents
44+
slow or very large remote reply collections from delaying federation
45+
processing. [[#445], [#447]]
46+
47+
- Added per-origin throttling and `429 Too Many Requests` backoff for
48+
remote replies scraping.
49+
- Added bounded scraping controls:
50+
`REMOTE_REPLIES_SCRAPE_DEPTH`, `REMOTE_REPLIES_SCRAPE_MAX_ITEMS`,
51+
`REMOTE_REPLIES_SCRAPE_INTERVAL_SECONDS`,
52+
`REMOTE_REPLIES_SCRAPE_BACKOFF_SECONDS`, and
53+
`REMOTE_REPLIES_SCRAPE_COOLDOWN_SECONDS`.
54+
4155
- Added automatic refresh of stale remote actor profiles. When receiving
4256
activities like `Announce` or `Create(Note)`, Hollo now checks if the
4357
actor's cached data is stale and asynchronously refreshes their profile
@@ -137,6 +151,8 @@ To be released.
137151
[#425]: https://github.com/fedify-dev/hollo/issues/425
138152
[#435]: https://github.com/fedify-dev/hollo/issues/435
139153
[#436]: https://github.com/fedify-dev/hollo/pull/436
154+
[#445]: https://github.com/fedify-dev/hollo/issues/445
155+
[#447]: https://github.com/fedify-dev/hollo/pull/447
140156
[Fedify debugger]: https://fedify.dev/manual/debug
141157

142158

bin/server.ts

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,14 +71,17 @@ if (NODE_TYPE === "worker" || NODE_TYPE === "all") {
7171
{ federation },
7272
{ startImportWorker, stopImportWorker },
7373
{ startCleanupWorker, stopCleanupWorker },
74+
{ startRemoteReplyScrapeWorker, stopRemoteReplyScrapeWorker },
7475
] = await Promise.all([
7576
import("../src/federation"),
7677
import("../src/import/worker"),
7778
import("../src/cleanup/worker"),
79+
import("../src/federation/replies-worker"),
7880
]);
7981
stopWorkers = () => {
8082
stopImportWorker();
8183
stopCleanupWorker();
84+
stopRemoteReplyScrapeWorker();
8285
};
8386

8487
// Start the Fedify message queue
@@ -93,8 +96,11 @@ if (NODE_TYPE === "worker" || NODE_TYPE === "all") {
9396
// Start the workers for background job processing
9497
startImportWorker();
9598
startCleanupWorker();
99+
startRemoteReplyScrapeWorker();
96100

97-
console.log("Worker started (Fedify queue + Import worker + Cleanup worker)");
101+
console.log(
102+
"Worker started (Fedify queue + Import worker + Cleanup worker + Remote reply scrape worker)",
103+
);
98104
}
99105

100106
// Graceful shutdown handling

docs/src/content/docs/install/env.mdx

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,11 @@ e.g., `UTC`, `America/New_York`, `Asia/Tokyo`.
4646

4747
Controls which components run in this process. Valid values are:
4848

49-
- `all` (default): Run web server, Fedify message queue, import worker and cleanup worker
49+
- `all` (default): Run web server, Fedify message queue, import worker,
50+
cleanup worker and remote replies scrape worker
5051
- `web`: Run only the web server (HTTP API)
51-
- `worker`: Run only workers (Fedify message queue + import worker + cleanup worker)
52+
- `worker`: Run only workers (Fedify message queue + import worker +
53+
cleanup worker + remote replies scrape worker)
5254

5355
This allows separating the web server from background workers for better
5456
scalability. When running high-traffic instances with many followers,
@@ -107,6 +109,40 @@ encountered first time.
107109

108110
`10` by default.
109111

112+
### `REMOTE_REPLIES_SCRAPE_DEPTH` <Badge text="Optional" />
113+
114+
The number of remote reply levels to scrape in background worker jobs.
115+
Set this to `0` to disable remote replies scraping.
116+
117+
`2` by default.
118+
119+
### `REMOTE_REPLIES_SCRAPE_MAX_ITEMS` <Badge text="Optional" />
120+
121+
The maximum number of reply items to persist from a single remote replies
122+
scraping job.
123+
124+
`100` by default.
125+
126+
### `REMOTE_REPLIES_SCRAPE_INTERVAL_SECONDS` <Badge text="Optional" />
127+
128+
The minimum delay between remote replies scraping requests to the same origin.
129+
130+
`5` by default.
131+
132+
### `REMOTE_REPLIES_SCRAPE_BACKOFF_SECONDS` <Badge text="Optional" />
133+
134+
The fallback delay before retrying a remote replies scraping job after an HTTP
135+
429 response when the remote server does not provide `Retry-After`.
136+
137+
`300` by default.
138+
139+
### `REMOTE_REPLIES_SCRAPE_COOLDOWN_SECONDS` <Badge text="Optional" />
140+
141+
The time window during which completed remote replies scraping jobs suppress
142+
duplicate jobs for the same replies collection.
143+
144+
`300` by default.
145+
110146
### `REMOTE_ACTOR_STALENESS_DAYS` <Badge text="Optional" />
111147

112148
The number of days after which a remote actor's cached data is considered stale.

docs/src/content/docs/install/workers.mdx

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -30,21 +30,22 @@ Worker separation is beneficial when:
3030
How it works
3131
------------
3232

33-
Hollo has three main components:
33+
Hollo has five main components:
3434

3535
1. *Web server*: Handles HTTP requests (API, web UI)
3636
2. *Fedify message queue*: Processes ActivityPub inbox/outbox messages
3737
3. *Import worker*: Handles background data import jobs
3838
4. *Cleanup worker*: Handles background cleanup jobs
39+
5. *Remote replies scrape worker*: Fetches remote replies collections slowly
3940

40-
By default (`NODE_TYPE=all`), all three run in a single process. You can
41+
By default (`NODE_TYPE=all`), all five run in a single process. You can
4142
separate them using the `NODE_TYPE` environment variable:
4243

43-
| `NODE_TYPE` | Web server | Fedify queue | Import worker | Cleanup worker |
44-
| ----------- | ---------- | ------------ | ------------- | -------------- |
45-
| `all` (default) |||||
46-
| `web` |||||
47-
| `worker` |||||
44+
| `NODE_TYPE` | Web server | Fedify queue | Import worker | Cleanup worker | Remote replies scrape worker |
45+
| ----------- | ---------- | ------------ | ------------- | -------------- | ---------------------------- |
46+
| `all` (default) ||||||
47+
| `web` ||||||
48+
| `worker` ||||||
4849

4950
All nodes share the same PostgreSQL database, which acts as the message queue
5051
backend using `LISTEN`/`NOTIFY` for real-time message delivery.
@@ -253,11 +254,11 @@ sudo journalctl -u hollo-worker -f
253254
When a worker node starts, you should see:
254255

255256
```
256-
Worker started (Fedify queue + Import worker + Cleanup worker)
257+
Worker started (Fedify queue + Import worker + Cleanup worker + Remote reply scrape worker)
257258
```
258259

259-
Watch for messages about processing activities and import and cleanup jobs to
260-
confirm the worker is functioning correctly.
260+
Watch for messages about processing activities, imports, cleanups, and remote
261+
replies scraping to confirm the worker is functioning correctly.
261262

262263

263264
Troubleshooting

docs/src/content/docs/ja/install/env.mdx

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,11 @@ openssl rand -hex 32
4747

4848
このプロセスで実行するコンポーネントを制御します。有効な値は:
4949

50-
- `all`(デフォルト):Webサーバー、Fedifyメッセージキュー、インポートワーカーをすべて実行
50+
- `all`(デフォルト):Webサーバー、Fedifyメッセージキュー、
51+
インポートワーカー、クリーンアップワーカー、リモート返信取得ワーカーをすべて実行
5152
- `web`:Webサーバー(HTTP API)のみ実行
52-
- `worker`:ワーカーのみ実行(Fedifyメッセージキュー + インポートワーカー)
53+
- `worker`:ワーカーのみ実行(Fedifyメッセージキュー + インポートワーカー +
54+
クリーンアップワーカー + リモート返信取得ワーカー)
5355

5456
これにより、スケーラビリティを向上させるためにWebサーバーとバックグラウンドワーカーを分離できます。
5557
フォロワーが多い高トラフィックインスタンスを運営する場合、ワーカーを分離することでパフォーマンスが向上します。
@@ -106,6 +108,39 @@ HolloがL7ロードバランサーの後ろにある場合(通常はそうす
106108

107109
デフォルトは`10`です。
108110

111+
### `REMOTE_REPLIES_SCRAPE_DEPTH` <Badge text="オプション" />
112+
113+
バックグラウンドワーカージョブで取得するリモート返信の階層数。
114+
`0`に設定すると、リモート返信の取得を無効にします。
115+
116+
デフォルトは`2`です。
117+
118+
### `REMOTE_REPLIES_SCRAPE_MAX_ITEMS` <Badge text="オプション" />
119+
120+
1つのリモート返信取得ジョブから保存する返信アイテムの最大数。
121+
122+
デフォルトは`100`です。
123+
124+
### `REMOTE_REPLIES_SCRAPE_INTERVAL_SECONDS` <Badge text="オプション" />
125+
126+
同じoriginへのリモート返信取得リクエスト間の最小待機秒数。
127+
128+
デフォルトは`5`です。
129+
130+
### `REMOTE_REPLIES_SCRAPE_BACKOFF_SECONDS` <Badge text="オプション" />
131+
132+
リモートサーバーが`Retry-After`を返さずHTTP 429を返した場合に、
133+
リモート返信取得ジョブを再試行するまでの待機秒数。
134+
135+
デフォルトは`300`です。
136+
137+
### `REMOTE_REPLIES_SCRAPE_COOLDOWN_SECONDS` <Badge text="オプション" />
138+
139+
完了したリモート返信取得ジョブが同じ返信コレクションの重複ジョブを
140+
抑制する時間。
141+
142+
デフォルトは`300`です。
143+
109144
### `REMOTE_ACTOR_STALENESS_DAYS` <Badge text="オプション" />
110145

111146
リモートアクターのキャッシュされたデータが古いと見なされるまでの日数。

docs/src/content/docs/ja/install/workers.mdx

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,20 +30,22 @@ import { Aside, Code, Tabs, TabItem } from '@astrojs/starlight/components';
3030
動作の仕組み
3131
------------
3232

33-
Holloは3つの主要コンポーネントで構成されています:
33+
Holloは5つの主要コンポーネントで構成されています:
3434

3535
1. *Webサーバー*: HTTPリクエスト(API、Web UI)を処理
3636
2. *Fedifyメッセージキュー*: ActivityPub inbox/outboxメッセージを処理
3737
3. *インポートワーカー*: バックグラウンドデータインポートジョブを処理
38+
4. *クリーンアップワーカー*: バックグラウンドクリーンアップジョブを処理
39+
5. *リモート返信取得ワーカー*: リモート返信コレクションをゆっくり取得
3840

39-
デフォルト(`NODE_TYPE=all`)では、3つすべてが単一プロセスで実行されます
41+
デフォルト(`NODE_TYPE=all`)では、5つすべてが単一プロセスで実行されます
4042
`NODE_TYPE`環境変数を使用して分離できます:
4143

42-
| `NODE_TYPE` | Webサーバー | Fedifyキュー | インポートワーカー |
43-
| ----------- | ----------- | ------------ | ------------------ |
44-
| `all` (デフォルト) ||||
45-
| `web` ||||
46-
| `worker` ||||
44+
| `NODE_TYPE` | Webサーバー | Fedifyキュー | インポートワーカー | クリーンアップワーカー | 返信ワーカー |
45+
| ----------- | ----------- | ------------ | ------------------ | ------------------------ | ------------ |
46+
| `all` (デフォルト) ||||||
47+
| `web` ||||||
48+
| `worker` ||||||
4749

4850
すべてのノードは同じPostgreSQLデータベースを共有し、
4951
`LISTEN`/`NOTIFY`を使用してリアルタイムメッセージ配信のための
@@ -251,11 +253,12 @@ sudo journalctl -u hollo-worker -f
251253
ワーカーノードが起動すると、次のように表示されます:
252254

253255
```
254-
Worker started (Fedify queue + Import worker)
256+
Worker started (Fedify queue + Import worker + Cleanup worker + Remote reply scrape worker)
255257
```
256258

257259
ワーカーが正常に機能していることを確認するには、
258-
アクティビティ処理とインポートジョブに関するメッセージを確認してください。
260+
アクティビティ処理、インポート、クリーンアップ、リモート返信取得に関する
261+
メッセージを確認してください。
259262

260263

261264
トラブルシューティング

docs/src/content/docs/ko/install/env.mdx

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,11 @@ openssl rand -hex 32
4646

4747
이 프로세스에서 실행할 구성 요소를 제어합니다. 유효한 값은:
4848

49-
- `all` (기본값): 웹 서버, Fedify 메시지 큐, 임포트 워커를 모두 실행
49+
- `all` (기본값): 웹 서버, Fedify 메시지 큐, 임포트 워커, 정리 워커,
50+
원격 답글 가져오기 워커를 모두 실행
5051
- `web`: 웹 서버(HTTP API)만 실행
51-
- `worker`: 워커만 실행 (Fedify 메시지 큐 + 임포트 워커)
52+
- `worker`: 워커만 실행 (Fedify 메시지 큐 + 임포트 워커 + 정리 워커 +
53+
원격 답글 가져오기 워커)
5254

5355
이를 통해 더 나은 확장성을 위해 웹 서버와 백그라운드 워커를 분리할 수 있습니다.
5456
팔로워가 많은 고트래픽 인스턴스를 운영할 때 워커를 분리하면 성능이 향상될 수 있습니다.
@@ -104,6 +106,39 @@ Hollo가 L7 로드 밸런서 뒤에 위치할 경우 (일반적으로 그래야
104106

105107
기본값은 `10`입니다.
106108

109+
### `REMOTE_REPLIES_SCRAPE_DEPTH` <Badge text="선택" />
110+
111+
백그라운드 워커 작업에서 가져올 원격 답글의 계층 수입니다.
112+
`0`으로 설정하면 원격 답글 가져오기를 비활성화합니다.
113+
114+
기본값은 `2`입니다.
115+
116+
### `REMOTE_REPLIES_SCRAPE_MAX_ITEMS` <Badge text="선택" />
117+
118+
원격 답글 가져오기 작업 하나에서 저장할 답글 항목의 최대 수입니다.
119+
120+
기본값은 `100`입니다.
121+
122+
### `REMOTE_REPLIES_SCRAPE_INTERVAL_SECONDS` <Badge text="선택" />
123+
124+
같은 origin에 대한 원격 답글 가져오기 요청 사이의 최소 대기 시간(초)입니다.
125+
126+
기본값은 `5`입니다.
127+
128+
### `REMOTE_REPLIES_SCRAPE_BACKOFF_SECONDS` <Badge text="선택" />
129+
130+
원격 서버가 `Retry-After` 없이 HTTP 429를 반환했을 때 원격 답글
131+
가져오기 작업을 다시 시도하기 전까지 기다릴 시간(초)입니다.
132+
133+
기본값은 `300`입니다.
134+
135+
### `REMOTE_REPLIES_SCRAPE_COOLDOWN_SECONDS` <Badge text="선택" />
136+
137+
완료된 원격 답글 가져오기 작업이 같은 답글 컬렉션의 중복 작업을
138+
억제하는 시간(초)입니다.
139+
140+
기본값은 `300`입니다.
141+
107142
### `REMOTE_ACTOR_STALENESS_DAYS` <Badge text="선택" />
108143

109144
원격 액터의 캐시된 데이터가 오래된 것으로 간주되기까지의 일수.

docs/src/content/docs/ko/install/workers.mdx

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,20 +30,22 @@ import { Aside, Code, Tabs, TabItem } from '@astrojs/starlight/components';
3030
작동 방식
3131
---------
3232

33-
Hollo는 가지 주요 구성 요소로 이루어져 있습니다:
33+
Hollo는 다섯 가지 주요 구성 요소로 이루어져 있습니다:
3434

3535
1. *웹 서버*: HTTP 요청(API, 웹 UI) 처리
3636
2. *Fedify 메시지 큐*: ActivityPub inbox/outbox 메시지 처리
3737
3. *임포트 워커*: 백그라운드 데이터 가져오기 작업 처리
38+
4. *정리 워커*: 백그라운드 정리 작업 처리
39+
5. *원격 답글 가져오기 워커*: 원격 답글 컬렉션을 천천히 가져오기
3840

39-
기본적으로(`NODE_TYPE=all`) 가지 모두 단일 프로세스에서 실행됩니다.
41+
기본적으로(`NODE_TYPE=all`) 다섯 가지 모두 단일 프로세스에서 실행됩니다.
4042
`NODE_TYPE` 환경 변수를 사용하여 분리할 수 있습니다:
4143

42-
| `NODE_TYPE` | 웹 서버 | Fedify 큐 | 임포트 워커 |
43-
| ----------- | ------- | --------- | ----------- |
44-
| `all` (기본값) ||||
45-
| `web` ||||
46-
| `worker` ||||
44+
| `NODE_TYPE` | 웹 서버 | Fedify 큐 | 임포트 워커 | 정리 워커 | 답글 워커 |
45+
| ----------- | ------- | --------- | ----------- | --------- | --------- |
46+
| `all` (기본값) ||||||
47+
| `web` ||||||
48+
| `worker` ||||||
4749

4850
모든 노드는 동일한 PostgreSQL 데이터베이스를 공유하며,
4951
`LISTEN`/`NOTIFY`를 사용하여 실시간 메시지 전달을 위한 메시지 큐 백엔드로 작동합니다.
@@ -250,10 +252,11 @@ sudo journalctl -u hollo-worker -f
250252
워커 노드가 시작되면 다음과 같이 표시됩니다:
251253

252254
```
253-
Worker started (Fedify queue + Import worker)
255+
Worker started (Fedify queue + Import worker + Cleanup worker + Remote reply scrape worker)
254256
```
255257

256-
워커가 정상적으로 작동하는지 확인하려면 액티비티 처리 및 가져오기 작업에 대한 메시지를 확인하세요.
258+
워커가 정상적으로 작동하는지 확인하려면 액티비티 처리, 가져오기, 정리,
259+
원격 답글 가져오기에 대한 메시지를 확인하세요.
257260

258261

259262
문제 해결

0 commit comments

Comments
 (0)