|
| 1 | +# Fetch a Webpage — Step-by-Step Walkthrough |
| 2 | + |
| 3 | +[<- Back to Project README](./README.md) |
| 4 | + |
| 5 | +## Before You Start |
| 6 | + |
| 7 | +Read the [project README](./README.md) first. Try to solve it on your own before following this guide. Spend at least 15 minutes attempting it independently. The goal is to fetch a webpage with `requests.get()`, check the status code, and print some of the HTML content. If you can do that much, you are on the right track. |
| 8 | + |
| 9 | +## Thinking Process |
| 10 | + |
| 11 | +When you hear "fetch a webpage," think about what your browser does when you type a URL and press Enter. It sends an HTTP GET request to a server, the server sends back a response (status code, headers, and a body of HTML), and your browser renders the HTML into a visual page. Your Python script does the same thing, except instead of rendering the page, you inspect the raw response. |
| 12 | + |
| 13 | +The `requests` library is Python's most popular tool for making HTTP requests. It wraps the messy details of HTTP into a clean, simple API. The response object it returns has everything: the status code (did it work?), the headers (metadata about the response), and the body text (the actual HTML). Your job is to call `requests.get()`, check whether the request succeeded, and display the interesting parts. |
| 14 | + |
| 15 | +Start by thinking about what could go wrong. The server might not exist (connection error). The page might not exist (404 status). The server might be overloaded (500 status). Handling these cases is what separates a script that works from one that is actually useful. |
| 16 | + |
| 17 | +## Step 1: Import requests and Define the URL |
| 18 | + |
| 19 | +**What to do:** Import the `requests` library and choose a URL to fetch. |
| 20 | + |
| 21 | +**Why:** The `requests` library does not come with Python — you installed it with `pip install requests`. The URL `http://books.toscrape.com/` is a website built specifically for scraping practice, so you will never get blocked or cause problems by fetching it. |
| 22 | + |
| 23 | +```python |
| 24 | +import requests |
| 25 | + |
| 26 | +url = "http://books.toscrape.com/" |
| 27 | +``` |
| 28 | + |
| 29 | +**Predict:** What happens if you try to run the script without installing `requests` first? What does the error message look like? |
| 30 | + |
| 31 | +## Step 2: Send the GET Request |
| 32 | + |
| 33 | +**What to do:** Call `requests.get(url)` and store the response object. |
| 34 | + |
| 35 | +**Why:** `requests.get()` sends an HTTP GET request — the same type of request your browser sends when you visit a URL. The function returns a `Response` object that contains everything the server sent back. Think of it as an envelope: the status code is stamped on the outside, the headers are metadata inside the flap, and the body (HTML) is the letter inside. |
| 36 | + |
| 37 | +```python |
| 38 | +print(f"Fetching {url} ...") |
| 39 | +response = requests.get(url) |
| 40 | +``` |
| 41 | + |
| 42 | +**Predict:** After this line runs, `response` holds the entire server response. What type is `response`? Try `print(type(response))` to find out. |
| 43 | + |
| 44 | +## Step 3: Check the Status Code |
| 45 | + |
| 46 | +**What to do:** Read `response.status_code` and decide what to do based on its value. |
| 47 | + |
| 48 | +**Why:** The status code tells you whether the request succeeded. 200 means "OK" — the server found the page and sent it back. 404 means "not found." 500 means the server had an internal error. Checking the status code before processing the response prevents you from trying to read HTML that does not exist. |
| 49 | + |
| 50 | +```python |
| 51 | +if response.status_code == 200: |
| 52 | + print(f"Status code: {response.status_code}") |
| 53 | + # Proceed to display the content |
| 54 | +else: |
| 55 | + print(f"Request failed with status code: {response.status_code}") |
| 56 | +``` |
| 57 | + |
| 58 | +**Predict:** If you change the URL to `http://books.toscrape.com/this-does-not-exist`, what status code will you get? |
| 59 | + |
| 60 | +## Step 4: Inspect Headers and Content |
| 61 | + |
| 62 | +**What to do:** Print the Content-Type header and a preview of the response body. |
| 63 | + |
| 64 | +**Why:** Headers are metadata that the server sends along with the response. The `Content-Type` header tells you what kind of content came back (HTML, JSON, an image, etc.). The response body (`response.text`) is the actual content — in this case, raw HTML. Printing the first 500 characters gives you a preview without flooding your terminal. |
| 65 | + |
| 66 | +```python |
| 67 | +content_type = response.headers.get("Content-Type", "unknown") |
| 68 | +print(f"Content type: {content_type}") |
| 69 | +print(f"Content length: {len(response.text)} characters") |
| 70 | + |
| 71 | +print("\nFirst 500 characters of the page:") |
| 72 | +print("-" * 50) |
| 73 | +print(response.text[:500]) |
| 74 | +print("-" * 50) |
| 75 | +``` |
| 76 | + |
| 77 | +Two details to notice: |
| 78 | + |
| 79 | +- **`response.headers.get("Content-Type", "unknown")`** uses `.get()` with a default value instead of direct bracket access. This prevents a crash if the header is missing. |
| 80 | +- **`response.text[:500]`** is a string slice. It returns the first 500 characters. The full HTML of a webpage can be thousands of characters long. |
| 81 | + |
| 82 | +**Predict:** What is the difference between `response.text` and `response.content`? Try printing both and look at the types. |
| 83 | + |
| 84 | +## Step 5: Wrap It in a Function and Add a Main Guard |
| 85 | + |
| 86 | +**What to do:** Organize your code into functions and add the `if __name__ == "__main__"` guard. |
| 87 | + |
| 88 | +**Why:** Putting the logic in functions makes the code reusable — another script could import `fetch_page()` without running the whole program. The `__name__` guard ensures `main()` only runs when you execute the file directly, not when someone imports it. |
| 89 | + |
| 90 | +```python |
| 91 | +def fetch_page(url): |
| 92 | + print(f"Fetching {url} ...") |
| 93 | + response = requests.get(url) |
| 94 | + return response |
| 95 | + |
| 96 | +def display_response_info(response): |
| 97 | + print(f"Status code: {response.status_code}") |
| 98 | + # ... rest of the display logic |
| 99 | + |
| 100 | +def main(): |
| 101 | + url = "http://books.toscrape.com/" |
| 102 | + response = fetch_page(url) |
| 103 | + if response.status_code == 200: |
| 104 | + display_response_info(response) |
| 105 | + else: |
| 106 | + print(f"Request failed with status code: {response.status_code}") |
| 107 | + print("\nDone.") |
| 108 | + |
| 109 | +if __name__ == "__main__": |
| 110 | + main() |
| 111 | +``` |
| 112 | + |
| 113 | +**Predict:** What happens if you import this file from another Python file? Does `main()` run? |
| 114 | + |
| 115 | +## Common Mistakes |
| 116 | + |
| 117 | +| Mistake | Why It Happens | Fix | |
| 118 | +|---------|---------------|-----| |
| 119 | +| `ModuleNotFoundError: No module named 'requests'` | `requests` is not installed | Run `pip install requests` in your terminal | |
| 120 | +| Printing `response` instead of `response.text` | Confusion between the object and its content | `response` is the whole object; `.text` is the HTML string | |
| 121 | +| Not checking the status code | Assuming every request succeeds | Always check `response.status_code` before processing | |
| 122 | +| Using `response.content` when you want text | Mixing up bytes and strings | `.text` returns a string (decoded), `.content` returns raw bytes | |
| 123 | + |
| 124 | +## Testing Your Solution |
| 125 | + |
| 126 | +There are no pytest tests for this project — it is a script that fetches a live website. Run it and check the output: |
| 127 | + |
| 128 | +```bash |
| 129 | +python project.py |
| 130 | +``` |
| 131 | + |
| 132 | +Expected output: |
| 133 | +```text |
| 134 | +Fetching http://books.toscrape.com/ ... |
| 135 | +Status code: 200 |
| 136 | +Content type: text/html |
| 137 | +Content length: 51696 characters |
| 138 | +... |
| 139 | +Done. |
| 140 | +``` |
| 141 | + |
| 142 | +The exact character count may vary, but you should see status code 200 and recognizable HTML tags. |
| 143 | + |
| 144 | +## What You Learned |
| 145 | + |
| 146 | +- **`requests.get()`** sends an HTTP GET request and returns a Response object — the same kind of request your browser makes when you visit a URL. |
| 147 | +- **Status codes** tell you whether a request succeeded (200), the page was not found (404), or the server had an error (500). Always check before processing. |
| 148 | +- **`response.text`** gives you the response body as a string, while **`response.headers`** gives you the metadata the server sent back. |
| 149 | +- **The `if __name__ == "__main__"` pattern** lets you write code that works both as a standalone script and as an importable module. |
0 commit comments