Skip to content

Commit ca541e6

Browse files
authored
Create git-scraping.md
Adding explanation of overall approach explaining fix to #77
1 parent 07e5119 commit ca541e6

1 file changed

Lines changed: 50 additions & 0 deletions

File tree

git-scraping.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Approach
2+
3+
## File Content
4+
- Download links for textbased files such as CSV, JSON, Markdown, RSS, ATOM, XML, etc. are easy to scrope with a bash one-liner
5+
- Easier to do an API Call, RSS Feed, or Static file over html pages, which often have a complicated structure or obfuscate data
6+
## HTML pages
7+
- Some webpages are full of adtech and CDN tags, which constantly change, seperate from the content - this generates false positives in the gitdiff if you just fetch a full page's html
8+
- Using [Markdownify](https://pypi.org/project/markdownify/) to convert the HTML to Markdown is an option to have a simplified version of the content insted of the complete markup/
9+
10+
## 💡Tips for Canada.ca Specificly :
11+
- Watch specifically for the CRSF token and the Akamai Boomerang
12+
- Example:
13+
- CRSF: `<meta name="_csrf_token" content="ImYyZDZhMjBiZWEwNzFkYWZkMGU5ODViYjMwMjIzOGIzOTRhZjA2OGIi.Z41oGw.hx-r3ZcKQW_DoydDr1GHWIVNRJY" />`
14+
- Boomerang: `<script>!function(a){var e="https://s.go-mpulse.net/boomerang/",t="addEventListener";...` Will be a block of script
15+
16+
![image](https://github.com/user-attachments/assets/f1e563ec-f4c3-4cfc-b698-056a0311d347)
17+
18+
## Process
19+
20+
### Diagram
21+
22+
```mermaid
23+
24+
flowchart TD
25+
A[Start: Identify Target Data] --> B[Write Scraping Script]
26+
B --> C[Run Script to Collect Data]
27+
C --> D[Store Data in Git Repository]
28+
D --> E[Commit Changes to Track Updates]
29+
E --> F{New Data Available?}
30+
F -->|Yes| C
31+
F -->|No| G[End: Monitor or Publish Data]
32+
33+
```
34+
### Description
35+
36+
Start: The process begins with identifying the target data (e.g., websites, APIs, or files).
37+
38+
Write Scraping Script: A script is created to automate data collection.
39+
40+
Run Script: The script is executed to fetch and collect data.
41+
42+
Store Data in Git: The scraped data is stored in a Git repository.
43+
44+
Commit Changes: Data updates are committed to maintain version history.
45+
46+
Decision Point: Checks if new data is available.
47+
48+
If "Yes," the script runs again to fetch the new data.
49+
50+
If "No," the process ends, with the data monitored or published for use.

0 commit comments

Comments
 (0)