diff --git a/docs/api-reference.md b/docs/api-reference.md new file mode 100644 index 0000000..35a38ca --- /dev/null +++ b/docs/api-reference.md @@ -0,0 +1,501 @@ +# YouTube Transcript API Reference + +This document provides a comprehensive reference for the `youtube-transcript-api` library. + +# Core API + +## Class `YouTubeTranscriptApi` +### `__init__(proxy_config: Optional[ProxyConfig] = None, http_client: Optional[Session] = None)` + +Note on thread-safety: As this class will initialize a `requests.Session` +object, it is not thread-safe. Make sure to initialize an instance of +`YouTubeTranscriptApi` per thread, if used in a multi-threading scenario! + +:param proxy_config: an optional ProxyConfig object, defining proxies used for + all network requests. This can be used to work around your IP being blocked + by YouTube, as described in the "Working around IP bans" section of the + README + (https://github.com/jdepoix/youtube-transcript-api?tab=readme-ov-file#working-around-ip-bans-requestblocked-or-ipblocked-exception) +:param http_client: You can optionally pass in a requests.Session object, if you + manually want to share cookies between different instances of + `YouTubeTranscriptApi`, overwrite defaults, specify SSL certificates, etc. + +### `fetch(video_id: str, languages: Iterable[str] = ('en',), preserve_formatting: bool = False)` -> `FetchedTranscript` + +Retrieves the transcript for a single video. This is just a shortcut for +calling: +`YouTubeTranscriptApi().list(video_id).find_transcript(languages).fetch(preserve_formatting=preserve_formatting)` + +:param video_id: the ID of the video you want to retrieve the transcript for. + Make sure that this is the actual ID, NOT the full URL to the video! +:param languages: A list of language codes in a descending priority. For + example, if this is set to ["de", "en"] it will first try to fetch the + german transcript (de) and then fetch the english transcript (en) if + it fails to do so. This defaults to ["en"]. +:param preserve_formatting: whether to keep select HTML text formatting + +### `list(video_id: str)` -> `TranscriptList` + +Retrieves the list of transcripts which are available for a given video. It +returns a `TranscriptList` object which is iterable and provides methods to +filter the list of transcripts for specific languages. While iterating over +the `TranscriptList` the individual transcripts are represented by +`Transcript` objects, which provide metadata and can either be fetched by +calling `transcript.fetch()` or translated by calling `transcript.translate( +'en')`. Example: + +``` +ytt_api = YouTubeTranscriptApi() + +# retrieve the available transcripts +transcript_list = ytt_api.list('video_id') + +# iterate over all available transcripts +for transcript in transcript_list: + # the Transcript object provides metadata properties + print( + transcript.video_id, + transcript.language, + transcript.language_code, + # whether it has been manually created or generated by YouTube + transcript.is_generated, + # a list of languages the transcript can be translated to + transcript.translation_languages, + ) + + # fetch the actual transcript data + print(transcript.fetch()) + + # translating the transcript will return another transcript object + print(transcript.translate('en').fetch()) + +# you can also directly filter for the language you are looking for, using the transcript list +transcript = transcript_list.find_transcript(['de', 'en']) + +# or just filter for manually created transcripts +transcript = transcript_list.find_manually_created_transcript(['de', 'en']) + +# or automatically generated ones +transcript = transcript_list.find_generated_transcript(['de', 'en']) +``` + +:param video_id: the ID of the video you want to retrieve the transcript for. + Make sure that this is the actual ID, NOT the full URL to the video! + +#### Examples +```python +ytt_api = YouTubeTranscriptApi() + +# retrieve the available transcripts +transcript_list = ytt_api.list('video_id') + +# iterate over all available transcripts +for transcript in transcript_list: + # the Transcript object provides metadata properties + print( + transcript.video_id, + transcript.language, + transcript.language_code, + # whether it has been manually created or generated by YouTube + transcript.is_generated, + # a list of languages the transcript can be translated to + transcript.translation_languages, + ) + + # fetch the actual transcript data + print(transcript.fetch()) + + # translating the transcript will return another transcript object + print(transcript.translate('en').fetch()) + +# you can also directly filter for the language you are looking for, using the transcript list +transcript = transcript_list.find_transcript(['de', 'en']) + +# or just filter for manually created transcripts +transcript = transcript_list.find_manually_created_transcript(['de', 'en']) + +# or automatically generated ones +transcript = transcript_list.find_generated_transcript(['de', 'en']) +``` + +# Transcripts & Models + +## Class `FetchedTranscriptSnippet` +### Fields +- `text`: `str` +- `start`: `float` - The timestamp at which this transcript snippet appears on screen in seconds. +- `duration`: `float` - The duration of how long the snippet in seconds. Be aware that this is not the duration of the transcribed speech, but how long the snippet stays on screen. Therefore, there can be overlaps between snippets! +## Class `FetchedTranscript` + +Represents a fetched transcript. This object is iterable, which allows you to iterate over the transcript snippets. + +### Fields +- `snippets`: `List[FetchedTranscriptSnippet]` +- `video_id`: `str` +- `language`: `str` +- `language_code`: `str` +- `is_generated`: `bool` +### `__iter__()` -> `Iterator[FetchedTranscriptSnippet]` +### `__getitem__(index)` -> `FetchedTranscriptSnippet` +### `__len__()` -> `int` +### `to_raw_data()` -> `List[Dict]` +## Class `Transcript` +### `__init__(http_client: Session, video_id: str, url: str, language: str, language_code: str, is_generated: bool, translation_languages: List[_TranslationLanguage])` + +You probably don't want to initialize this directly. Usually you'll access Transcript objects using a TranscriptList. + +### `fetch(preserve_formatting: bool = False)` -> `FetchedTranscript` + +Loads the actual transcript data. +:param preserve_formatting: whether to keep select HTML text formatting + +### `is_translatable()` -> `bool` +### `translate(language_code: str)` -> `"Transcript"` +## Class `TranscriptList` + +This object represents a list of transcripts. It can be iterated over to list all transcripts which are available for a given YouTube video. Also, it provides functionality to search for a transcript in a given language. + +### `__init__(video_id: str, manually_created_transcripts: Dict[str, Transcript], generated_transcripts: Dict[str, Transcript], translation_languages: List[_TranslationLanguage])` + +The constructor is only for internal use. Use the static build method instead. + +:param video_id: the id of the video this TranscriptList is for +:param manually_created_transcripts: dict mapping language codes to the manually created transcripts +:param generated_transcripts: dict mapping language codes to the generated transcripts +:param translation_languages: list of languages which can be used for translatable languages + +### `build(http_client: Session, video_id: str, captions_json: Dict)` -> `"TranscriptList"` + +Factory method for TranscriptList. + +:param http_client: http client which is used to make the transcript retrieving http calls +:param video_id: the id of the video this TranscriptList is for +:param captions_json: the JSON parsed from the YouTube pages static HTML +:return: the created TranscriptList + +### `__iter__()` -> `Iterator[Transcript]` +### `find_transcript(language_codes: Iterable[str])` -> `Transcript` + +Finds a transcript for a given language code. Manually created transcripts are returned first and only if none are found, generated transcripts are used. If you only want generated transcripts use `find_manually_created_transcript` instead. + +:param language_codes: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. +:return: the found Transcript + +### `find_generated_transcript(language_codes: Iterable[str])` -> `Transcript` + +Finds an automatically generated transcript for a given language code. + +:param language_codes: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. +:return: the found Transcript + +### `find_manually_created_transcript(language_codes: Iterable[str])` -> `Transcript` + +Finds a manually created transcript for a given language code. + +:param language_codes: A list of language codes in a descending priority. For example, if this is set to ['de', 'en'] it will first try to fetch the german transcript (de) and then fetch the english transcript (en) if it fails to do so. +:return: the found Transcript + +## Class `TranscriptListFetcher` +### `__init__(http_client: Session, proxy_config: Optional[ProxyConfig])` +### `fetch(video_id: str)` -> `TranscriptList` + +# Formatters + +Formatters allow you to convert transcripts into various formats like JSON, Text, SRT, etc. + +## Class `Formatter` + +Formatter should be used as an abstract base class. + +Formatter classes should inherit from this class and implement +their own .format() method which should return a string. A +transcript is represented by a List of Dictionary items. + +### `format_transcript(transcript: FetchedTranscript, **kwargs)` -> `str` +### `format_transcripts(transcripts: List[FetchedTranscript], **kwargs)` +### Usage Examples +```python +class MyCustomFormatter(Formatter): + def format_transcript(self, transcript: FetchedTranscript, **kwargs) -> str: + # Do your custom work in here, but return a string. + return 'your processed output data as a string.' + + def format_transcripts(self, transcripts: List[FetchedTranscript], **kwargs) -> str: + # Do your custom work in here to format a list of transcripts, but return a string. + return 'your processed output data as a string.' +``` +## Class `PrettyPrintFormatter` (Inherits from `Formatter`) +### `format_transcript(transcript: FetchedTranscript, **kwargs)` -> `str` + +Pretty prints a transcript. + +:param transcript: +:return: A pretty printed string representation of the transcript. + +### `format_transcripts(transcripts: List[FetchedTranscript], **kwargs)` -> `str` + +Converts a list of transcripts into a JSON string. + +:param transcripts: +:return: A JSON string representation of the transcript. + +## Class `JSONFormatter` (Inherits from `Formatter`) +### `format_transcript(transcript: FetchedTranscript, **kwargs)` -> `str` + +Converts a transcript into a JSON string. + +:param transcript: +:return: A JSON string representation of the transcript. + +### `format_transcripts(transcripts: List[FetchedTranscript], **kwargs)` -> `str` + +Converts a list of transcripts into a JSON string. + +:param transcripts: +:return: A JSON string representation of the transcript. + +### Usage Examples +```python +from youtube_transcript_api.formatters import JSONFormatter + +formatter = JSONFormatter() + +# .format_transcript(transcript) turns the transcript into a JSON string. +json_formatted = formatter.format_transcript(transcript) +``` +```python +json_formatted = JSONFormatter().format_transcript(transcript, indent=2) +``` +## Class `TextFormatter` (Inherits from `Formatter`) +### `format_transcript(transcript: FetchedTranscript, **kwargs)` -> `str` + +Converts a transcript into plain text with no timestamps. + +:param transcript: +:return: all transcript text lines separated by newline breaks. + +### `format_transcripts(transcripts: List[FetchedTranscript], **kwargs)` -> `str` + +Converts a list of transcripts into plain text with no timestamps. + +:param transcripts: +:return: all transcript text lines separated by newline breaks. + +## Class `SRTFormatter` (Inherits from `_TextBasedFormatter`) +### `format_transcript(transcript: FetchedTranscript, **kwargs)` -> `str` + +A basic implementation of WEBVTT/SRT formatting. + +:param transcript: +:reference: +https://www.w3.org/TR/webvtt1/#introduction-caption +https://www.3playmedia.com/blog/create-srt-file/ + +## Class `WebVTTFormatter` (Inherits from `_TextBasedFormatter`) +### `format_transcript(transcript: FetchedTranscript, **kwargs)` -> `str` + +A basic implementation of WEBVTT/SRT formatting. + +:param transcript: +:reference: +https://www.w3.org/TR/webvtt1/#introduction-caption +https://www.3playmedia.com/blog/create-srt-file/ + +## Class `FormatterLoader` +### `load(formatter_type: str = pretty)` -> `Formatter` + +Loads the Formatter for the given formatter type. + +:param formatter_type: +:return: Formatter object + +### Usage Examples +```python +from youtube_transcript_api.formatters import FormatterLoader +loader = FormatterLoader() +formatter = loader.load("json") +``` + +# Proxy Configuration + +Proxy configuration classes for working around IP blocks. + +## Class `InvalidProxyConfig` (Inherits from `Exception`) +## Class `RequestsProxyConfigDict` (Inherits from `TypedDict`) + +This type represents the Dict that is used by the requests library to configure +the proxies used. More information on this can be found in the official requests +documentation: https://requests.readthedocs.io/en/latest/user/advanced/#proxies + +### Fields +- `http`: `str` +- `https`: `str` +## Class `ProxyConfig` (Inherits from `ABC`) + +The base class for all proxy configs. Anything can be a proxy config, as longs as +it can be turned into a `RequestsProxyConfigDict` by calling `to_requests_dict`. + +### `to_requests_dict()` -> `RequestsProxyConfigDict` + +Turns this proxy config into the Dict that is expected by the requests library. +More information on this can be found in the official requests documentation: +https://requests.readthedocs.io/en/latest/user/advanced/#proxies + +### `prevent_keeping_connections_alive()` -> `bool` + +If you are using rotating proxies, it can be useful to prevent the HTTP +client from keeping TCP connections alive, as your IP won't be rotated on +every request, if your connection stays open. + +### `retries_when_blocked()` -> `int` + +Defines how many times we should retry if a request is blocked. When using +rotating residential proxies with a large IP pool it can make sense to retry a +couple of times when a blocked IP is encountered, since a retry will trigger +an IP rotation and the next IP might not be blocked. + +## Class `GenericProxyConfig` (Inherits from `ProxyConfig`) + +This proxy config can be used to set up any generic HTTP/HTTPS/SOCKS proxy. As it +the requests library is used under the hood, you can follow the requests +documentation to get more detailed information on how to set up proxies: +https://requests.readthedocs.io/en/latest/user/advanced/#proxies + +If only an HTTP or an HTTPS proxy is provided, it will be used for both types of +connections. However, you will have to provide at least one of the two. + +### `__init__(http_url: Optional[str] = None, https_url: Optional[str] = None)` + +If only an HTTP or an HTTPS proxy is provided, it will be used for both types of +connections. However, you will have to provide at least one of the two. + +:param http_url: the proxy URL used for HTTP requests. Defaults to `https_url` + if None. +:param https_url: the proxy URL used for HTTPS requests. Defaults to `http_url` + if None. + +### `to_requests_dict()` -> `RequestsProxyConfigDict` +## Class `WebshareProxyConfig` (Inherits from `GenericProxyConfig`) + +Webshare is a provider offering rotating residential proxies, which is the +most reliable way to work around being blocked by YouTube. + +If you don't have a Webshare account yet, you will have to create one +at https://www.webshare.io/?referral_code=w0xno53eb50g and purchase a "Residential" +proxy package that suits your workload, to be able to use this proxy config (make +sure NOT to purchase "Proxy Server" or "Static Residential"!). + +Once you have created an account you only need the "Proxy Username" and +"Proxy Password" that you can find in your Webshare settings +at https://dashboard.webshare.io/proxy/settings to set up this config class, which +will take care of setting up your proxies as needed, by defaulting to rotating +proxies. + +Note that referral links are used here and any purchases made through these links +will support this Open Source project, which is very much appreciated! :) +However, you can of course integrate your own proxy solution by using the +`GenericProxyConfig` class, if that's what you prefer. + +### Fields +- `DEFAULT_DOMAIN_NAME` +- `DEFAULT_PORT` +### `__init__(proxy_username: str, proxy_password: str, filter_ip_locations: Optional[List[str]] = None, retries_when_blocked: int = 10, domain_name: str = DEFAULT_DOMAIN_NAME, proxy_port: int = DEFAULT_PORT)` + +Once you have created a Webshare account at +https://www.webshare.io/?referral_code=w0xno53eb50g and purchased a +"Residential" package (make sure NOT to purchase "Proxy Server" or +"Static Residential"!), this config class allows you to easily use it, +by defaulting to the most reliable proxy settings (rotating residential +proxies). + +:param proxy_username: "Proxy Username" found at + https://dashboard.webshare.io/proxy/settings +:param proxy_password: "Proxy Password" found at + https://dashboard.webshare.io/proxy/settings +:param filter_ip_locations: If you want to limit the pool of IPs that you will + be rotating through to those located in specific countries, you can provide + a list of location codes here. By choosing locations that are close to the + machine that is running this code, you can reduce latency. Also, this can + be used to work around location-based restrictions. + You can find the full list of available locations (and how many IPs are + available in each location) at + https://www.webshare.io/features/proxy-locations?referral_code=w0xno53eb50g +:param retries_when_blocked: Define how many times we should retry if a request + is blocked. When using rotating residential proxies with a large IP pool it + makes sense to retry a couple of times when a blocked IP is encountered, + since a retry will trigger an IP rotation and the next IP might not be + blocked. Defaults to 10. + +### `url()` -> `str` +### `http_url()` -> `str` +### `https_url()` -> `str` +### `prevent_keeping_connections_alive()` -> `bool` +### `retries_when_blocked()` -> `int` + +# Error Handling + +The library uses a hierarchy of custom exceptions to represent different failure modes. + +## Exception Hierarchy + +``` +YouTubeTranscriptApiException +├── CookieError +│ ├── CookiePathInvalid +│ └── CookieInvalid +└── CouldNotRetrieveTranscript + ├── YouTubeDataUnparsable + ├── YouTubeRequestFailed + ├── VideoUnplayable + ├── VideoUnavailable + ├── InvalidVideoId + ├── RequestBlocked + │ └── IpBlocked + ├── TranscriptsDisabled + ├── AgeRestricted + ├── NotTranslatable + ├── TranslationLanguageNotAvailable + ├── FailedToCreateConsentCookie + ├── NoTranscriptFound + └── PoTokenRequired +``` + +## Class `YouTubeTranscriptApiException` (Inherits from `Exception`) +## Class `CookieError` (Inherits from `YouTubeTranscriptApiException`) +## Class `CookiePathInvalid` (Inherits from `CookieError`) +### `__init__(cookie_path: Path)` +## Class `CookieInvalid` (Inherits from `CookieError`) +### `__init__(cookie_path: Path)` +## Class `CouldNotRetrieveTranscript` (Inherits from `YouTubeTranscriptApiException`) + +Raised if a transcript could not be retrieved. + +### `__init__(video_id: str)` +### `cause()` -> `str` +## Class `YouTubeDataUnparsable` (Inherits from `CouldNotRetrieveTranscript`) +## Class `YouTubeRequestFailed` (Inherits from `CouldNotRetrieveTranscript`) +### `__init__(video_id: str, http_error: HTTPError)` +### `cause()` -> `str` +## Class `VideoUnplayable` (Inherits from `CouldNotRetrieveTranscript`) +### `__init__(video_id: str, reason: Optional[str], sub_reasons: List[str])` +### `cause()` +## Class `VideoUnavailable` (Inherits from `CouldNotRetrieveTranscript`) +## Class `InvalidVideoId` (Inherits from `CouldNotRetrieveTranscript`) +### Usage Examples +```python +YouTubeTranscriptApi().fetch("1234") +``` +## Class `RequestBlocked` (Inherits from `CouldNotRetrieveTranscript`) +### `__init__(video_id: str)` +### `with_proxy_config(proxy_config: Optional[ProxyConfig])` -> `"RequestBlocked"` +### `cause()` -> `str` +## Class `IpBlocked` (Inherits from `RequestBlocked`) +## Class `TranscriptsDisabled` (Inherits from `CouldNotRetrieveTranscript`) +## Class `AgeRestricted` (Inherits from `CouldNotRetrieveTranscript`) +## Class `NotTranslatable` (Inherits from `CouldNotRetrieveTranscript`) +## Class `TranslationLanguageNotAvailable` (Inherits from `CouldNotRetrieveTranscript`) +## Class `FailedToCreateConsentCookie` (Inherits from `CouldNotRetrieveTranscript`) +## Class `NoTranscriptFound` (Inherits from `CouldNotRetrieveTranscript`) +### `__init__(video_id: str, requested_language_codes: Iterable[str], transcript_data: "TranscriptList")` +### `cause()` -> `str` +## Class `PoTokenRequired` (Inherits from `CouldNotRetrieveTranscript`) \ No newline at end of file