Describe the Bug
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:
- Significantly increasing response time when retrieving results
- Taking up unnecessary storage space in the results database
- Potentially making the results harder to parse and use effectively
To Reproduce
- Run the crawl command on the target website containing PDF files (eg:
https://becu.org/)
- Observe the returned results in both HTML and markdown fields (eg:
http://{HOST URL}/v1/crawl/{CRAWL ID})
- Notice that PDF contents are being dumped as raw text into these fields
Screenshots


Describe the Bug
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:
To Reproduce
https://becu.org/)http://{HOST URL}/v1/crawl/{CRAWL ID})Screenshots
