Script that feeds the gurubase.io RAG AI tool with youtube videos#414
Script that feeds the gurubase.io RAG AI tool with youtube videos#414amilcarlucas wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds two new crawling scripts to feed the gurubase.io RAG AI tool with URLs from ArduPilot's YouTube channel and documentation pages, alongside updates to dependency configuration in pyproject.toml and the linting workflow.
- Added script for extracting YouTube video URLs.
- Added script for crawling ArduPilot documentation URLs with duplicate removal logic.
- Updated project configuration files to include required dependencies for the new scripts.
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| scripts/crawl_ardupilot_youtube_channel.py | New script to crawl the ArduPilot YouTube channel for video URLs. |
| scripts/crawl_ardupilot_wiki.py | New script to crawl the ArduPilot documentation pages with URL deduplication. |
| pyproject.toml | Added extra dependencies under "scripts" to support crawling functionality. |
| .github/workflows/pylint.yml | Updated dependency installation command to include the new "scripts" extras. |
Files not reviewed (1)
- scripts/crawl_github_ardupilot.sh: Language not supported
| if proxies: | ||
| HTTPProxyAuth(USERNAME, PASSWORD) | ||
|
|
||
| # Setup Firefox in headless mode | ||
| options = Options() | ||
| options.add_argument("--headless") |
There was a problem hiding this comment.
The call to HTTPProxyAuth here is not assigned to any variable or integrated with the webdriver or session, which may make proxy authentication ineffective. Consider reviewing and properly integrating proxy authentication if proxies are in use.
| if proxies: | |
| HTTPProxyAuth(USERNAME, PASSWORD) | |
| # Setup Firefox in headless mode | |
| options = Options() | |
| options.add_argument("--headless") | |
| auth = None | |
| if proxies: | |
| auth = HTTPProxyAuth(USERNAME, PASSWORD) | |
| from selenium.webdriver.common.proxy import Proxy, ProxyType | |
| proxy = Proxy() | |
| proxy.http_proxy = proxies.get("http") | |
| proxy.ssl_proxy = proxies.get("https") | |
| proxy.proxy_type = ProxyType.MANUAL | |
| # Setup Firefox in headless mode | |
| options = Options() | |
| options.add_argument("--headless") | |
| if proxies: | |
| options.proxy = proxy |
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
Test Results 2 files 2 suites 1m 55s ⏱️ Results for commit 98ec2d7. |
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces a new script to scrape video URLs from the ArduPilot YouTube channel using Selenium and updates the project dependencies to include selenium.
- Added the Python script (crawl_ardupilot_youtube_channel.py) that automates browser operations, handles proxies, and implements scrolling and logging for collecting video URLs.
- Updated pyproject.toml to include selenium in the scripts section for dependency management.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| scripts/crawl_ardupilot_youtube_channel.py | Implements YouTube video URL scraping with Selenium using headless mode, proxy handling, and scrolling logic. |
| pyproject.toml | Adds selenium as a dependency under the scripts section. |
|
|
||
| proxies = get_env_proxies() | ||
| if proxies: | ||
| HTTPProxyAuth(USERNAME, PASSWORD) |
There was a problem hiding this comment.
The instantiation of HTTPProxyAuth is not assigned to any variable nor passed to the WebDriver. Ensure that proxy authentication is properly configured for the driver if required.
| if len(video_links) >= 80: # gurubase has a limit of 100 videos | ||
| break | ||
|
|
||
| if len(video_links) >= 80: # gurubase has a limit of 100 videos |
There was a problem hiding this comment.
The condition restricts video_links to 80, but the comment indicates a limit of 100. Please update the condition or comment to align the intended logic.
| if len(video_links) >= 80: # gurubase has a limit of 100 videos | |
| break | |
| if len(video_links) >= 80: # gurubase has a limit of 100 videos | |
| if len(video_links) >= 100: # gurubase has a limit of 100 videos | |
| break | |
| if len(video_links) >= 100: # gurubase has a limit of 100 videos |
|
gurubase does this already, no need for this code |
This pull request introduces a script to scrape video URLs from the ArduPilot YouTube channel using Selenium and updates the project dependencies to include the required library. Below are the key changes:
New Script Addition:
crawl_ardupilot_youtube_channel.py, to scrape video URLs from the ArduPilot YouTube channel. The script uses Selenium for browser automation, handles proxy settings, and includes robust logging. It is designed to work in headless mode and implements several fallback strategies for finding video links.Dependency Update:
pyproject.tomlto includeseleniumin thescriptssection, ensuring the required library is available for the new script.