[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw

**Describe the Bug**
When crawling websites containing PDF files, the crawler is including the raw PDF contents in both HTML and markdown output fields. This causes performance issues by:
1. Significantly increasing response time when retrieving results
2. Taking up unnecessary storage space in the results database
3. Potentially making the results harder to parse and use effectively

**To Reproduce**
1. Run the crawl command on the target website containing PDF files (eg: `https://becu.org/`)
2. Observe the returned results in both HTML and markdown fields (eg: `http://{HOST URL}/v1/crawl/{CRAWL ID}`)
3. Notice that PDF contents are being dumped as raw text into these fields

**Screenshots**
![image](https://github.com/user-attachments/assets/1a6a6d16-dd2b-4b2c-b159-d5ec5a887e1e)

![proof](https://github.com/user-attachments/assets/c9e2d12b-1a32-44e0-8cec-ad533dc814ab)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] PDF Content Incorrectly Dumped into HTML/Markdown Fields During Web Craw #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions