Configuration of ArchiveBox is done by using the archivebox config command, modifying the ArchiveBox.conf file in the data folder, or by using environment variables. All three methods work equivalently when using Docker as well.
Some equivalent examples of setting some configuration options:
archivebox config --set CHROME_BINARY=google-chrome-stable
# OR
echo "CHROME_BINARY=google-chrome-stable" >> ArchiveBox.conf
# OR
env CHROME_BINARY=google-chrome-stable archivebox add ~/Downloads/bookmarks_export.htmlEnvironment variables take precedence over the config file, which is useful if you only want to use a certain option temporarily during a single run. For more examples see Usage: Configuration...
Available Configuration Options:
- General Settings: Archiving process, output format, and timing.
- Server Settings: Web UI, authentication, and reverse proxy options.
- Storage Settings: File layout, permissions, and temp directories.
- Search Settings: Full-text search backend configuration.
- Shell Options: Format & behavior of CLI output.
- Plugin Settings: Per-plugin configuration options.
In case this document is ever out of date, check the source code for config definitions: archivebox/config/common.py ➡️
General options around the archiving process, output format, and timing.
Possible Values: [True]/False
Toggle whether or not to attempt rechecking old links when adding new ones, or leave old incomplete links alone and only archive the new links.
By default, ArchiveBox will only archive new links on each import. If you want it to go back through all links in the index and download any missing files on every run, set this to False.
Note: Regardless of how this is set, ArchiveBox will never re-download sites that have already succeeded previously. When this is False it only attempts to fix previous pages have missing archive extractor outputs, it does not re-archive pages that have already been successfully archived.
Possible Values: [False]/True
When set to True, ArchiveBox will re-archive URLs even if they have already been successfully archived before, overwriting any existing output.
Possible Values: [60]/120/...
Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value.
Note: Do not set this to anything less than 5 seconds as it will cause Chrome to hang indefinitely and many sites to fail completely.
Possible Values: [50]/100/...
Maximum number of times ArchiveBox will attempt to archive a URL before giving up. Useful for handling transient failures.
Possible Values: [1440,2000]/1024,768/...
Default screenshot/PDF resolution in pixels width,height. Used as the fallback for SCREENSHOT_RESOLUTION, PDF_RESOLUTION, and CHROME_RESOLUTION.
Possible Values: [True]/False
Whether to enforce HTTPS certificate and HSTS chain of trust when archiving sites. Set this to False if you want to archive pages even if they have expired or invalid certificates. Be aware that when False you cannot guarantee that you have not been man-in-the-middle'd while archiving content, so the content cannot be verified to be what's on the original site.
Possible Values: [Mozilla/5.0 ... ArchiveBox/{VERSION} ...]/"Mozilla/5.0 ..."/...
The default user agent string used during archiving. Individual extractors (wget, Chrome, curl, etc.) can override this with their own *_USER_AGENT settings, or fall back to this value.
Possible Values: [None]//path/to/cookies.txt/...
Cookies file to pass to wget, curl, yt-dlp and other extractors that don't use Chrome (with its CHROME_USER_DATA_DIR) for authentication. To capture sites that require a user to be logged in, you configure this option to point to a netscape-format cookies.txt file containing all the cookies you want to use during archiving.
You can generate this cookies.txt file by using a number of different browser extensions that can export your cookies in this format, or by using wget on the command line with --save-cookies + --user=... --password=....
Alternatively, you can create a persona and import cookies directly from your browser profile:
archivebox persona create --import=chrome personalWarning
Make sure you use separate burner credentials dedicated to archiving, e.g. don't re-use your normal daily Facebook/Instagram/Youtube/etc. account cookies as server responses often contain your name/email/PII, session tokens, etc. which then get preserved in your snapshots!
Related options:
CHROME_USER_DATA_DIR, DEFAULT_PERSONA
Possible Values: [Default]/personal/work/...
The persona profile to use by default when archiving. Personas allow you to have separate sets of cookies, Chrome profiles, and user agent strings for different archiving contexts.
Possible Values: [\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$]/.+\.exe$/...
A regex expression used to exclude certain URLs from archiving.
Related options:
URL_ALLOWLIST, SAVE_ALLOWLIST, SAVE_DENYLIST
Possible Values: [None]/^http(s)?:\/\/(.+)?example\.com\/?.*$/...
A regex expression used to exclude all URLs that don't match the given pattern from archiving. Useful for recursive crawling within a single domain.
Possible Values: [{}]/{".*example\\.com.*": ["screenshot", "pdf"]}/...
A JSON dictionary mapping URL regex patterns to lists of archive methods. Only the specified methods will be used for URLs matching each pattern.
Possible Values: [{}]/{".*\\.pdf$": ["screenshot", "dom"]}/...
A JSON dictionary mapping URL regex patterns to lists of archive methods to skip.
Possible Values: [[,]]/[,;]/...
Regex pattern used to split tag strings into individual tags.
Options for the web UI, authentication, and reverse proxy configuration.
Possible Values: [None]/"admin"/...
Only used on first run / initial setup in Docker. ArchiveBox will create an admin user with the specified username and password when these options are found in the environment.
More info:
Possible Values: [True]/False
Configure whether or not login is required to use each area of ArchiveBox.
archivebox config --set PUBLIC_INDEX=True # allow viewing snapshots list without login
archivebox config --set PUBLIC_SNAPSHOTS=True # allow viewing snapshot content without login
archivebox config --set PUBLIC_ADD_VIEW=False # allow submitting new URLs without loginPossible Values: auto-generated random string Django's secret key for cryptographic signing (sessions, CSRF tokens, etc.). Automatically generated on first run.
Possible Values: [127.0.0.1:8000]/0.0.0.0:8000/...
Address and port for the ArchiveBox web server to listen on.
Possible Values: [archivebox.localhost:8000]/archive.example.com:443/...
The public hostname and port that ArchiveBox is accessible at.
Possible Values: [*]/archive.example.com,localhost/...
Comma-separated list of allowed HTTP Host header values. Set this to your domain name(s) in production.
Possible Values: [http://admin.archivebox.localhost:8000]/https://archive.example.com/...
Comma-separated list of trusted origins for CSRF validation. Must include the scheme (http/https).
Possible Values: [""]//admin//...
Base URL path for the Django admin interface.
Possible Values: [""]//archive//...
Base URL path for serving archived content.
Possible Values: [40]/100/...
Maximum number of Snapshots to show per page on Snapshot list pages.
Possible Values: [True]/False
Whether to show inline previews of the original URL on snapshot detail pages.
Possible Values: [Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.]/...
Text to display in the footer of the archive index.
Possible Values: [data/custom_templates]//path/to/custom_templates/...
Path to a directory containing custom html/css/images for overriding the default UI styling.
Possible Values: [Remote-User]/X-Remote-User/...
HTTP header containing user name from authenticated proxy.
Related options:
REVERSE_PROXY_WHITELIST, LOGOUT_REDIRECT_URL
Possible Values: [<empty string>]/172.16.0.0/16/...
Comma separated list of IP CIDRs which are allowed to use reverse proxy authentication.
Possible Values: [/]/https://example.com/some/other/app/...
URL to redirect users back to on logout when using reverse proxy authentication.
Options for LDAP/Active Directory authentication. Requires pip install archivebox[ldap].
Possible Values: [False]/True
Whether to use an external LDAP server for authentication.
pip install archivebox[ldap]Then set these configuration values:
LDAP_ENABLED: True
LDAP_SERVER_URI: "ldap://ldap.example.com:3389"
LDAP_BIND_DN: "ou=archivebox,ou=services,dc=ldap.example.com"
LDAP_BIND_PASSWORD: "secret-bind-user-password"
LDAP_USER_BASE: "ou=users,ou=archivebox,ou=services,dc=ldap.example.com"
LDAP_USER_FILTER: "(uid=%(user)s)"
LDAP_USERNAME_ATTR: "username"
LDAP_FIRSTNAME_ATTR: "givenName"
LDAP_LASTNAME_ATTR: "sn"
LDAP_EMAIL_ATTR: "mail"
LDAP_CREATE_SUPERUSER: FalseMore info:
- https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Authentication
- https://github.com/django-auth-ldap/django-auth-ldap#example-configuration
Default: [None]
LDAP server URI (e.g. ldap://ldap.example.com:389).
Default: [None]
DN to bind for searching.
Default: [None]
Password for bind DN.
Default: [None]
Base DN for user searches.
Default: [(uid=%(user)s)]
LDAP search filter for users.
Default: [username]
LDAP attribute for username.
Default: [givenName]
LDAP attribute for first name.
Default: [sn]
LDAP attribute for last name.
Default: [mail]
LDAP attribute for email.
Default: [False]
Auto-create superuser accounts for LDAP users.
Options for file layout, permissions, and temp/lib directories.
Possible Values: [644]/755/...
Permissions to set output files to.
Related options:
PUID / PGID
Possible Values: [911]/1000/...
Note: Only applicable for Docker users, settable via environment variables only.
User and Group ID that the data directory should be owned by.
Learn more:
- https://docs.linuxserver.io/general/understanding-puid-and-pgid/
- https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#docker-permissions-issues
Possible Values: [windows]/unix/ascii/...
Restrict output filenames to be compatible with the given filesystem type.
Possible Values: [True]/False
Whether to use atomic writes when saving files.
Possible Values: [data/tmp/<machine_id>]//tmp/archivebox/abc5d851/...
Path for temporary files, unix sockets, and supervisor config. Must be a local, fast, short-path directory.
Possible Values: [data/lib/<arch>-<os>]//usr/local/share/archivebox/abc5/...
Path for installed binary dependencies.
Possible Values: [LIB_DIR/bin]
Path where installed binaries are symlinked for easy PATH management.
Options for full-text search backend configuration.
Possible Values: [True]/False
Enable the search indexing backend.
Possible Values: [True]/False
Enable the search querying backend.
Possible Values: [ripgrep]/sqlite/sonic
Which search backend engine to use. ripgrep (default) requires no setup. sqlite uses FTS5. sonic requires a running Sonic instance.
Possible Values: [True]/False
Whether to strip HTML tags before indexing content for search.
Options around the format of the CLI output.
Possible Values: [False]/True
Enable debug mode. Automatically set to True if --debug is passed on the command line.
Possible Values: auto-detected Whether stdout is a TTY (interactive terminal).
Possible Values: [True]/False
Colorize console output. Defaults to True if stdin is a TTY.
Possible Values: [True]/False
Show real-time progress bar in console output. Defaults to True if stdin is a TTY.
Possible Values: [False]/True
Whether ArchiveBox is running inside a Docker container.
Possible Values: [False]/True
Whether ArchiveBox is running inside QEMU emulation.
ArchiveBox uses a plugin system where each extractor defines its own configuration via config.json files. All plugin config options can be set the same way as core options — via environment variables, ArchiveBox.conf, or archivebox config --set.
archivebox config # see all available config options
archivebox config --set SCREENSHOT_TIMEOUT=120 # set a plugin optionFor the full list of plugins and their config schemas, see the abx-plugins repository.
Default: [True]
Enable title extraction
Default: [30] (falls back to TIMEOUT)
Timeout for title extraction in seconds
Default: [True]
Enable favicon downloading
Default: [30] (falls back to TIMEOUT)
Timeout for favicon fetch in seconds
Default: [""] (falls back to USER_AGENT)
User agent string
Default: [see defaults] Default wget arguments
Default: [[]]
Extra arguments to append to wget command
Default: [wget]
Path to wget binary
Default: [True] (falls back to CHECK_SSL_VALIDITY)
Whether to verify SSL certificates
Default: [""] (falls back to COOKIES_FILE)
Path to cookies file
Default: [True]
Enable wget archiving
Default: [60] (falls back to TIMEOUT)
Timeout for wget in seconds
Default: [""] (falls back to USER_AGENT)
User agent string for wget
Default: [True]
Save WARC archive file
Default: [True]
Enable screenshot capture
Default: [1440,2000] (falls back to RESOLUTION)
Screenshot resolution (width,height)
Default: [60] (falls back to TIMEOUT)
Timeout for screenshot capture in seconds
Default: [True]
Enable PDF generation
Default: [1440,2000] (falls back to RESOLUTION)
PDF page resolution (width,height)
Default: [60] (falls back to TIMEOUT)
Timeout for PDF generation in seconds
Default: [True]
Enable DOM capture
Default: [60] (falls back to TIMEOUT)
Timeout for DOM capture in seconds
Default: [['--browser-headless']]
Default single-file arguments
Default: [[]]
Extra arguments to append to single-file command
Default: [single-file]
Path to single-file binary
Default: [True] (falls back to CHECK_SSL_VALIDITY)
Whether to verify SSL certificates
Default: [[]] (falls back to CHROME_ARGS)
Chrome command-line arguments for SingleFile
Default: [""] (falls back to COOKIES_FILE)
Path to cookies file
Default: [True]
Enable SingleFile archiving
Default: [60] (falls back to TIMEOUT)
Timeout for SingleFile in seconds
Default: [""] (falls back to USER_AGENT)
User agent string
Default: [[]]
Default Readability arguments
Default: [[]]
Extra arguments to append to Readability command
Default: [readability-extractor]
Path to readability-extractor binary
Default: [True]
Enable Readability text extraction
Default: [30] (falls back to TIMEOUT)
Timeout for Readability in seconds
Default: [[]]
Default Mercury parser arguments
Default: [[]]
Extra arguments to append to Mercury parser command
Default: [postlight-parser]
Path to Mercury/Postlight parser binary
Default: [True]
Enable Mercury text extraction
Default: [30] (falls back to TIMEOUT)
Timeout for Mercury in seconds
Default: [[]]
Default Defuddle arguments
Default: [[]]
Extra arguments to append to Defuddle command
Default: [defuddle]
Path to defuddle binary
Default: [True]
Enable Defuddle text extraction
Default: [30] (falls back to TIMEOUT)
Timeout for Defuddle in seconds
Default: [True]
Enable HTML to text conversion
Default: [30] (falls back to TIMEOUT)
Timeout for HTML to text conversion in seconds
Default: [trafilatura]
Path to trafilatura binary
Default: [True]
Enable Trafilatura extraction
Default: [False]
Write CSV output (content.csv)
Default: [True]
Write HTML output (content.html)
Default: [False]
Write JSON output (content.json)
Default: [True]
Write markdown output (content.md)
Default: [True]
Write plain text output (content.txt)
Default: [False]
Write XML output (content.xml)
Default: [False]
Write XML TEI output (content.xmltei)
Default: [30] (falls back to TIMEOUT)
Timeout for Trafilatura in seconds
Default: [['clone', '--depth=1', '--recursive']]
Default git arguments
Default: [[]]
Extra arguments to append to git command
Default: [git]
Path to git binary
Default: [see defaults] Comma-separated list of domains to treat as git repositories
Default: [True]
Enable git repository cloning
Default: [120] (falls back to TIMEOUT)
Timeout for git operations in seconds
Default: [see defaults] Default yt-dlp arguments
Default: [[]]
Extra arguments to append to yt-dlp command
Default: [yt-dlp]
Path to yt-dlp binary
Default: [True] (falls back to CHECK_SSL_VALIDITY)
Whether to verify SSL certificates
Default: [""] (falls back to COOKIES_FILE)
Path to cookies file
Default: [True]
Enable video/audio downloading with yt-dlp
Default: [750m]
Maximum file size for yt-dlp downloads
Default: [3600] (falls back to TIMEOUT)
Timeout for yt-dlp downloads in seconds
Default: [['--write-metadata', '--write-info-json']]
Default gallery-dl arguments
Default: [[]]
Extra arguments to append to gallery-dl command
Default: [gallery-dl]
Path to gallery-dl binary
Default: [True] (falls back to CHECK_SSL_VALIDITY)
Whether to verify SSL certificates
Default: [""] (falls back to COOKIES_FILE)
Path to cookies file
Default: [True]
Enable gallery downloading with gallery-dl
Default: [3600] (falls back to TIMEOUT)
Timeout for gallery downloads in seconds
Default: [[]]
Default forum-dl arguments
Default: [[]]
Extra arguments to append to forum-dl command
Default: [forum-dl]
Path to forum-dl binary
Default: [True]
Enable forum downloading with forum-dl
Default: [jsonl]
Output format for forum downloads
Default: [3600] (falls back to TIMEOUT)
Timeout for forum downloads in seconds
Default: [['fetch']]
Default papers-dl arguments
Default: [[]]
Extra arguments to append to papers-dl command
Default: [papers-dl]
Path to papers-dl binary
Default: [True]
Enable paper downloading with papers-dl
Default: [300] (falls back to TIMEOUT)
Timeout for paper downloads in seconds
Default: [True]
Submit URLs to archive.org Wayback Machine
Default: [60] (falls back to TIMEOUT)
Timeout for archive.org submission in seconds
Default: [""] (falls back to USER_AGENT)
User agent string
Default: [see defaults] Default Chrome command-line arguments (static flags only, dynamic args like --user-data-dir are added at runtime)
Default: [[]]
Extra arguments to append to Chrome command (for user customization)
Default: [chromium]
Path to Chromium binary
Default: [True] (falls back to CHECK_SSL_VALIDITY)
Whether to verify SSL certificates (disable for self-signed certs)
Default: [0]
Extra delay in seconds after page load completes before archiving (useful for JS-heavy SPAs)
Default: [True]
Enable Chromium browser integration for archiving
Default: [True]
Run Chrome in headless mode
Default: [60] (falls back to CHROME_TIMEOUT)
Timeout for page navigation/load in seconds
Default: [1440,2000] (falls back to RESOLUTION)
Browser viewport resolution (width,height)
Default: [True]
Enable Chrome sandbox (disable in Docker with --no-sandbox)
Default: [60] (falls back to TIMEOUT)
Timeout for Chrome operations in seconds
Default: [""] (falls back to USER_AGENT)
User agent string for Chrome
Default: [""]
Path to Chrome user data directory for persistent sessions (derived from ACTIVE_PERSONA if not set)
Default: [networkidle2]
Page load completion condition (domcontentloaded, load, networkidle0, networkidle2)
Default: [True]
Enable DNS traffic recording during page load
Default: [30] (falls back to TIMEOUT)
Timeout for DNS recording in seconds
Default: [True]
Enable SSL certificate capture
Default: [30] (falls back to TIMEOUT)
Timeout for SSL capture in seconds
Default: [True]
Enable HTTP headers capture
Default: [30] (falls back to TIMEOUT)
Timeout for headers capture in seconds
Default: [True]
Enable redirect chain capture
Default: [30] (falls back to TIMEOUT)
Timeout for redirect capture in seconds
Default: [True]
Enable HTTP response capture
Default: [30] (falls back to TIMEOUT)
Timeout for response capture in seconds
Default: [True]
Enable console log capture
Default: [30] (falls back to TIMEOUT)
Timeout for console log capture in seconds
Default: [True]
Enable accessibility tree capture
Default: [30] (falls back to TIMEOUT)
Timeout for accessibility capture in seconds
Default: [True]
Enable SEO metadata capture
Default: [30] (falls back to TIMEOUT)
Timeout for SEO capture in seconds
Default: [True]
Enable merkle tree hash generation
Default: [30] (falls back to TIMEOUT)
Timeout for merkle tree generation in seconds
Default: [True]
Enable static file detection
Default: [30] (falls back to TIMEOUT)
Timeout for static file detection in seconds
Default: [True]
Enable uBlock Origin browser extension for ad blocking
Default: [True]
Enable I Still Don't Care About Cookies browser extension
Default: [""]
2captcha API key for CAPTCHA solving service (get from https://2captcha.com)
Default: [False]
Automatically submit forms after CAPTCHA is solved
Default: [True]
Enable 2captcha browser extension for automatic CAPTCHA solving
Default: [3]
Number of times to retry CAPTCHA solving on error
Default: [5]
Delay in seconds between CAPTCHA solving retries
Default: [60] (falls back to TIMEOUT)
Timeout for CAPTCHA solving in seconds
Default: [True]
Enable automatic modal and dialog closing
Default: [500]
How often to check for CSS modals (ms)
Default: [1250]
Delay before auto-closing dialogs (ms)
Default: [True]
Enable infinite scroll page expansion
Default: [True]
Expand
Details
elements and click 'load more' buttons for commentsDefault: [16000]
Minimum page height to scroll to in pixels
Default: [2000]
Delay between scrolls in milliseconds
Default: [1600]
Distance to scroll per step in pixels
Default: [10]
Maximum number of scroll steps
Default: [120] (falls back to TIMEOUT)
Maximum timeout for scrolling in seconds
Default: [True]
Enable DOM outlinks parsing from archived pages
Default: [30] (falls back to TIMEOUT)
Timeout for DOM outlinks parsing in seconds
Default: [True]
Enable HTML URL parsing
Default: [True]
Enable JSON Lines URL parsing
Default: [True]
Enable Netscape bookmarks HTML URL parsing
Default: [True]
Enable plain text URL parsing
Default: [True]
Enable RSS/Atom feed URL parsing
Default: [""]
Anthropic API key for Claude Code authentication
Default: [claude]
Path to Claude Code CLI binary
Default: [False]
Enable Claude Code AI agent integration. Controls whether the claudecode plugin participates in crawl-time extraction; child plugins still need the claudecode plugin installed and a working Claude binary.
Default: [10]
Maximum number of agentic turns per invocation
Default: [sonnet]
Claude model to use (e.g. sonnet, opus, haiku)
Default: [120] (falls back to TIMEOUT)
Timeout for Claude Code operations in seconds
Default: [False]
Enable Claude for Chrome browser extension for AI-driven page interaction
Default: [15]
Maximum number of agentic loop iterations (screenshots + actions) per page
Default: [sonnet]
Claude model to use (e.g. sonnet, opus, haiku). Availability depends on your plan.
Default: [see defaults] Prompt for Claude to execute on the page. Claude can click buttons, fill forms, download files, and interact with any page element.
Default: [120] (falls back to TIMEOUT)
Timeout for Claude for Chrome operations in seconds
Default: [False]
Enable Claude Code AI extraction
Default: [10] (falls back to CLAUDECODE_MAX_TURNS)
Maximum number of agentic turns for extraction
Default: [sonnet] (falls back to CLAUDECODE_MODEL)
Claude model to use for extraction (e.g. sonnet, opus, haiku)
Default: [see defaults] Custom prompt for Claude Code extraction. Use this to define what Claude should extract or generate from the snapshot.
Default: [120] (falls back to CLAUDECODE_TIMEOUT)
Timeout for Claude Code extraction in seconds
Default: [False]
Enable Claude Code AI cleanup of snapshot files
Default: [15] (falls back to CLAUDECODE_MAX_TURNS)
Maximum number of agentic turns for cleanup
Default: [sonnet] (falls back to CLAUDECODE_MODEL)
Claude model to use for cleanup (e.g. sonnet, opus, haiku)
Default: [see defaults] Custom prompt for Claude Code cleanup. Defines what Claude should clean up and how to determine which duplicates to keep.
Default: [120] (falls back to CLAUDECODE_TIMEOUT)
Timeout for Claude Code cleanup in seconds
Default: [['--files-with-matches', '--no-messages', '--ignore-case']]
Default ripgrep arguments
Default: [[]]
Extra arguments to append to ripgrep command
Default: [rg]
Path to ripgrep binary
Default: [90] (falls back to TIMEOUT)
Search timeout in seconds
Default: [snapshots]
Sonic bucket name
Default: [archivebox]
Sonic collection name
Default: [127.0.0.1]
Sonic server hostname
Default: [SecretPassword]
Sonic server password
Default: [1491]
Sonic server port
Default: [search.sqlite3]
SQLite FTS database filename
Default: [True]
Use separate database file for FTS index
Default: [porter unicode61 remove_diacritics 2]
FTS5 tokenizer configuration

