Skip to content

Commit 813a42b

Browse files
authored
Fixed grammatical and spelling errors (#34)
1 parent b8b38b9 commit 813a42b

1 file changed

Lines changed: 231 additions & 14 deletions

File tree

devs/scraping/starting.md

Lines changed: 231 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@ order: 999
44
icon: rocket
55
---
66
Scraping is just downloading a webpage and getting the wanted information from it.
7-
As a start you can scrape the README.md
7+
As a start, you can scrape the README.md
88

99

10-
I'll use khttp for the kotlin implementation because of the ease of use, if you want something company-tier I'd recommend OkHttp.
10+
I'll use khttp for the Kotlin implementation because of the ease of use, if you want something company-tier I'd recommend OkHttp.
1111

1212
**Update**: I have made an okhttp wrapper **for android apps**, check out [NiceHttp](https://github.com/Blatzar/NiceHttp)
1313

@@ -47,7 +47,7 @@ fun main() {
4747
```
4848

4949

50-
# **2. Getting the github project description**
50+
# **2. Getting the GitHub project description**
5151
Scraping is all about getting what you want in a good format you can use to automate stuff.
5252

5353
Start by opening up the developer tools, using
@@ -60,32 +60,32 @@ or
6060

6161
or
6262

63-
Right click and press *Inspect*
63+
Right-click and press *Inspect*
6464

65-
In here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how?
65+
Here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how?
6666

6767
Either click the small mouse in the top left of the developer tools or press
6868

6969
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>
7070

71-
This makes your mouse highlight any element you hover over. Press the description to highlight up the element responsible for showing it.
71+
This makes your mouse highlight any element you hover over. Press the description to highlight the element responsible for showing it.
7272

7373
Your HTML will now be focused on something like:
7474

7575

76-
```html
76+
```HTML
7777
<p class="f4 mt-3">
7878
Work in progress tutorial for scraping streaming sites
7979
</p>
8080
```
8181

82-
Now there's multiple ways to get the text, but the 2 methods I always use is Regex and CSS selectors. Regex is basically a ctrl+f on steroids, you can search for anything. CSS selectors is a way to parse the HTML like a browser and select an element in it.
82+
Now there are multiple ways to get the text, but the 2 methods I always use are Regex and CSS selectors. Regex is a ctrl+f on steroids, you can search for anything. CSS selectors are a way to parse the HTML like a browser and select an element in it.
8383

8484
## CSS Selectors
8585

8686
The element is a paragraph tag, eg `<p>`, which can be found using the CSS selector: "p".
8787

88-
classes helps to narrow down the CSS selector search, in this case: `class="f4 mt-3"`
88+
classes help to narrow down the CSS selector search, in this case: `class="f4 mt-3"`
8989

9090
This can be represented with
9191
```css
@@ -104,13 +104,230 @@ This prints:
104104
NodeList [p.f4.mt-3]
105105
```
106106

107-
### **NOTE**: You may not get the same results when scraping from command line, classes and elements are sometimes created by javascript on the site.
107+
### **NOTE**: You may not get the same results when scraping from the command line, classes and elements are sometimes created by javascript on the site.
108108

109109

110110
**Python**
111111

112+
```Python
113+
import requests
114+
from bs4 import BeautifulSoup # Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
115+
116+
url = "https://github.com/Blatzar/scraping-tutorial"
117+
response = requests.get(url)
118+
soup = BeautifulSoup(response.text, 'lxml')
119+
element = soup.select("p.f4.mt-3") # Using the CSS selector
120+
print(element[0].text.strip()) # Selects the first element, gets the text and strips it (removes starting and ending spaces)
121+
```
122+
123+
**Kotlin**
124+
125+
In build.gradle:
126+
```gradle
127+
repositories {
128+
mavenCentral()
129+
jcenter()
130+
maven { url 'https://jitpack.io' }
131+
}
132+
133+
dependencies {
134+
// Other dependencies above
135+
implementation "org.jsoup:jsoup:1.11.3"
136+
compile group: 'khttp', name: 'khttp', version: '1.0.0'
137+
}
138+
```
139+
In main.kt
140+
```kotlin
141+
fun main() {
142+
val url = "https://github.com/Blatzar/scraping-tutorial"
143+
val response = khttp.get(url)
144+
val soup = Jsoup.parse(response.text)
145+
val element = soup.select("p.f4.mt-3") // Using the CSS selector
146+
println(element.text().trim()) // Gets the text and strips it (removes starting and ending spaces)
147+
}
148+
```
149+
150+
151+
## **Regex:**
152+
153+
When working with Regex I highly recommend using [regex101.com](https://regex101.com/) (using the python flavor)
154+
155+
Press <kbd>Ctrl</kbd> + <kbd>U</kbd>
156+
157+
to get the whole site document as text and copy everything
158+
159+
Paste it in the test string in regex101 and try to write an expression to only capture the text you want.
160+
161+
In this case, the elements are
162+
163+
```HTML
164+
<p class="f4 mt-3">
165+
Work in progress tutorial for scraping streaming sites
166+
</p>
167+
```
168+
169+
Maybe we can search for `<p class="f4 mt-3">` (backslashes for ")
170+
171+
```regex
172+
<p class=\"f4 mt-3\">
173+
```
174+
175+
Gives a match, so let's expand the match to all characters between the two brackets ( p>....</ )
176+
Some important tokens for that would be:
177+
178+
- `.*?` to indicate everything except a newline any number of times, but take as little as possible
179+
- `\s*` to indicate whitespaces except a newline any number of times
180+
- `(*expression inside*)` to indicate groups
181+
182+
Which gives:
183+
184+
```regex
185+
<p class=\"f4 mt-3\">\s*(.*)?\s*<
186+
```
187+
188+
**Explained**:
189+
190+
Any text exactly matching `<p class="f4 mt-3">`
191+
then any number of whitespaces
192+
then any number of any characters (which will be stored in group 1)
193+
then any number of whitespaces
194+
then the text `<`
195+
196+
197+
In code:
198+
199+
**Python**
200+
201+
```python
202+
import requests
203+
import re # regex
204+
205+
url = "https://github.com/Blatzar/scraping-tutorial"
206+
response = requests.get(url)
207+
description_regex = r"<p class=\"f4 mt-3\">\s*(.*)?\s*<" # r"" stands for raw, which makes backslashes work better, used for regexes
208+
description = re.search(description_regex, response.text).groups()[0]
209+
print(description)
210+
```
211+
212+
**Kotlin**
213+
In main.kt
214+
```kotlin
215+
fun main() {
216+
val url = "https://github.com/Blatzar/scraping-tutorial"
217+
val response = khttp.get(url)
218+
val descriptionRegex = Regex("""<p class---
219+
label: Starting
220+
order: 999
221+
icon: rocket
222+
---
223+
Scraping is just downloading a webpage and getting the wanted information from it.
224+
As a start, you can scrape the README.md
225+
226+
227+
I'll use khttp for the Kotlin implementation because of the ease of use, if you want something company-tier I'd recommend OkHttp.
228+
229+
**Update**: I have made an okhttp wrapper **for android apps**, check out [NiceHttp](https://github.com/Blatzar/NiceHttp)
230+
231+
232+
# **1. Scraping the Readme**
233+
234+
**Python**
112235
```python
113236
import requests
237+
url = "https://recloudstream.github.io/devs/scraping/"
238+
response = requests.get(url)
239+
print(response.text) # Prints the readme
240+
```
241+
242+
**Kotlin**
243+
244+
In build.gradle:
245+
```gradle
246+
repositories {
247+
mavenCentral()
248+
jcenter()
249+
maven { url 'https://jitpack.io' }
250+
}
251+
252+
dependencies {
253+
// Other dependencies above
254+
compile group: 'khttp', name: 'khttp', version: '1.0.0'
255+
}
256+
```
257+
In main.kt
258+
```kotlin
259+
fun main() {
260+
val url = "https://recloudstream.github.io/devs/scraping/"
261+
val response = khttp.get(url)
262+
println(response.text)
263+
}
264+
```
265+
266+
267+
# **2. Getting the GitHub project description**
268+
Scraping is all about getting what you want in a good format you can use to automate stuff.
269+
270+
Start by opening up the developer tools, using
271+
272+
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>I</kbd>
273+
274+
or
275+
276+
<kbd>f12</kbd>
277+
278+
or
279+
280+
Right-click and press *Inspect*
281+
282+
Here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how?
283+
284+
Either click the small mouse in the top left of the developer tools or press
285+
286+
<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>
287+
288+
This makes your mouse highlight any element you hover over. Press the description to highlight the element responsible for showing it.
289+
290+
Your HTML will now be focused on something like:
291+
292+
293+
```HTML
294+
<p class="f4 mt-3">
295+
Work in progress tutorial for scraping streaming sites
296+
</p>
297+
```
298+
299+
Now there are multiple ways to get the text, but the 2 methods I always use are Regex and CSS selectors. Regex is a ctrl+f on steroids, you can search for anything. CSS selectors are a way to parse the HTML like a browser and select an element in it.
300+
301+
## CSS Selectors
302+
303+
The element is a paragraph tag, eg `<p>`, which can be found using the CSS selector: "p".
304+
305+
classes help to narrow down the CSS selector search, in this case: `class="f4 mt-3"`
306+
307+
This can be represented with
308+
```css
309+
p.f4.mt-3
310+
```
311+
a dot for every class [full list of CSS selectors found here](https://www.w3schools.com/cssref/css_selectors.asp)
312+
313+
You can test if this CSS selector works by opening the console tab and typing:
314+
315+
```js
316+
document.querySelectorAll("p.f4.mt-3");
317+
```
318+
319+
This prints:
320+
```java
321+
NodeList [p.f4.mt-3]
322+
```
323+
324+
### **NOTE**: You may not get the same results when scraping from the command line, classes and elements are sometimes created by javascript on the site.
325+
326+
327+
**Python**
328+
329+
```Python
330+
import requests
114331
from bs4 import BeautifulSoup # Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/
115332
116333
url = "https://github.com/Blatzar/scraping-tutorial"
@@ -158,9 +375,9 @@ to get the whole site document as text and copy everything
158375
159376
Paste it in the test string in regex101 and try to write an expression to only capture the text you want.
160377
161-
In this case the elements is
378+
In this case, the elements are
162379
163-
```html
380+
```HTML
164381
<p class="f4 mt-3">
165382
Work in progress tutorial for scraping streaming sites
166383
</p>
@@ -172,7 +389,7 @@ Maybe we can search for `<p class="f4 mt-3">` (backslashes for ")
172389
<p class=\"f4 mt-3\">
173390
```
174391
175-
Gives a match, so lets expand the match to all characters between the two brackets ( p>....</ )
392+
Gives a match, so let's expand the match to all characters between the two brackets ( p>....</ )
176393
Some important tokens for that would be:
177394
178395
- `.*?` to indicate everything except a newline any number of times, but take as little as possible
@@ -204,7 +421,7 @@ import re # regex
204421
205422
url = "https://github.com/Blatzar/scraping-tutorial"
206423
response = requests.get(url)
207-
description_regex = r"<p class=\"f4 mt-3\">\s*(.*)?\s*<" # r"" stands for raw, which makes blackslashes work better, used for regexes
424+
description_regex = r"<p class=\"f4 mt-3\">\s*(.*)?\s*<" # r"" stands for raw, which makes backslashes work better, used for regexes
208425
description = re.search(description_regex, response.text).groups()[0]
209426
print(description)
210427
```

0 commit comments

Comments
 (0)