You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scraping is just downloading a webpage and getting the wanted information from it.
7
-
As a start you can scrape the README.md
7
+
As a start, you can scrape the README.md
8
8
9
9
10
-
I'll use khttp for the kotlin implementation because of the ease of use, if you want something company-tier I'd recommend OkHttp.
10
+
I'll use khttp for the Kotlin implementation because of the ease of use, if you want something company-tier I'd recommend OkHttp.
11
11
12
12
**Update**: I have made an okhttp wrapper **for android apps**, check out [NiceHttp](https://github.com/Blatzar/NiceHttp)
13
13
@@ -47,7 +47,7 @@ fun main() {
47
47
```
48
48
49
49
50
-
# **2. Getting the github project description**
50
+
# **2. Getting the GitHub project description**
51
51
Scraping is all about getting what you want in a good format you can use to automate stuff.
52
52
53
53
Start by opening up the developer tools, using
@@ -60,32 +60,32 @@ or
60
60
61
61
or
62
62
63
-
Rightclick and press *Inspect*
63
+
Right-click and press *Inspect*
64
64
65
-
In here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how?
65
+
Here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how?
66
66
67
67
Either click the small mouse in the top left of the developer tools or press
68
68
69
69
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>
70
70
71
-
This makes your mouse highlight any element you hover over. Press the description to highlight up the element responsible for showing it.
71
+
This makes your mouse highlight any element you hover over. Press the description to highlight the element responsible for showing it.
72
72
73
73
Your HTML will now be focused on something like:
74
74
75
75
76
-
```html
76
+
```HTML
77
77
<pclass="f4 mt-3">
78
78
Work in progress tutorial for scraping streaming sites
79
79
</p>
80
80
```
81
81
82
-
Now there's multiple ways to get the text, but the 2 methods I always use is Regex and CSS selectors. Regex is basically a ctrl+f on steroids, you can search for anything. CSS selectors is a way to parse the HTML like a browser and select an element in it.
82
+
Now there are multiple ways to get the text, but the 2 methods I always use are Regex and CSS selectors. Regex is a ctrl+f on steroids, you can search for anything. CSS selectors are a way to parse the HTML like a browser and select an element in it.
83
83
84
84
## CSS Selectors
85
85
86
86
The element is a paragraph tag, eg `<p>`, which can be found using the CSS selector: "p".
87
87
88
-
classes helps to narrow down the CSS selector search, in this case: `class="f4 mt-3"`
88
+
classes help to narrow down the CSS selector search, in this case: `class="f4 mt-3"`
89
89
90
90
This can be represented with
91
91
```css
@@ -104,13 +104,230 @@ This prints:
104
104
NodeList [p.f4.mt-3]
105
105
```
106
106
107
-
### **NOTE**: You may not get the same results when scraping from command line, classes and elements are sometimes created by javascript on the site.
107
+
### **NOTE**: You may not get the same results when scraping from the command line, classes and elements are sometimes created by javascript on the site.
108
108
109
109
110
110
**Python**
111
111
112
+
```Python
113
+
import requests
114
+
from bs4 import BeautifulSoup # Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
val url = "https://recloudstream.github.io/devs/scraping/"
261
+
val response = khttp.get(url)
262
+
println(response.text)
263
+
}
264
+
```
265
+
266
+
267
+
# **2. Getting the GitHub project description**
268
+
Scraping is all about getting what you want in a good format you can use to automate stuff.
269
+
270
+
Start by opening up the developer tools, using
271
+
272
+
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>I</kbd>
273
+
274
+
or
275
+
276
+
<kbd>f12</kbd>
277
+
278
+
or
279
+
280
+
Right-click and press *Inspect*
281
+
282
+
Here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how?
283
+
284
+
Either click the small mouse in the top left of the developer tools or press
285
+
286
+
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>
287
+
288
+
This makes your mouse highlight any element you hover over. Press the description to highlight the element responsible for showing it.
289
+
290
+
Your HTML will now be focused on something like:
291
+
292
+
293
+
```HTML
294
+
<p class="f4 mt-3">
295
+
Work in progress tutorial for scraping streaming sites
296
+
</p>
297
+
```
298
+
299
+
Now there are multiple ways to get the text, but the 2 methods I always use are Regex and CSS selectors. Regex is a ctrl+f on steroids, you can search for anything. CSS selectors are a way to parse the HTML like a browser and select an element in it.
300
+
301
+
## CSS Selectors
302
+
303
+
The element is a paragraph tag, eg `<p>`, which can be found using the CSS selector: "p".
304
+
305
+
classes help to narrow down the CSS selector search, in this case: `class="f4 mt-3"`
306
+
307
+
This can be represented with
308
+
```css
309
+
p.f4.mt-3
310
+
```
311
+
a dot for every class [full list of CSS selectors found here](https://www.w3schools.com/cssref/css_selectors.asp)
312
+
313
+
You can test if this CSS selector works by opening the console tab and typing:
314
+
315
+
```js
316
+
document.querySelectorAll("p.f4.mt-3");
317
+
```
318
+
319
+
This prints:
320
+
```java
321
+
NodeList [p.f4.mt-3]
322
+
```
323
+
324
+
### **NOTE**: You may not get the same results when scraping from the command line, classes and elements are sometimes created by javascript on the site.
325
+
326
+
327
+
**Python**
328
+
329
+
```Python
330
+
import requests
114
331
from bs4 import BeautifulSoup # Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
0 commit comments