|
267 | 267 | "id": "b7bf16d7", |
268 | 268 | "metadata": {}, |
269 | 269 | "source": [ |
270 | | - "### The OECD API\n", |
| 270 | + "### The Eurostat SDMX API\n", |
271 | 271 | "\n", |
272 | | - "Sometimes it's convenient to use APIs directly, and, as an example, the OECD API comes with a LOT of complexity that direct access can take advantage of. The OECD API makes data available in both JSON and XML formats, and we'll use [**pandasdmx**](https://pandasdmx.readthedocs.io/) (aka the Statistical Data and Metadata eXchange (SDMX) package for the Python data ecosystem) to pull down the XML format data and turn it into a regular **pandas** data frame.\n", |
| 272 | + "Sometimes it’s convenient to use APIs directly. The Eurostat API provides access to a massive repository of European statistical data using the SDMX (Statistical Data and Metadata eXchange) standard. While Eurostat offers multiple formats, using the SDMX-ML (XML) format via the sdmx1 library allows us to pull structured data into the Python ecosystem with high precision.\n", |
273 | 273 | "\n", |
274 | | - "Now, key to using the OECD API is knowledge of its many codes: for countries, times, resources, and series. You can find some broad guidance on what codes the API uses [here](https://data.oecd.org/api/sdmx-ml-documentation/) but to find exactly what you need can be a bit tricky. Two tips are:\n", |
275 | | - "1. If you know what you're looking for is in a particular named dataset, eg \"QNA\" (Quarterly National Accounts), put `https://stats.oecd.org/restsdmx/sdmx.ashx/GetDataStructure/QNA/all?format=SDMX-ML` into your browser and look through the XML file; you can pick out the sub-codes and the countries that are available.\n", |
276 | | - "2. Browse around on https://stats.oecd.org/ and use Customise then check all the \"Use Codes\" boxes to see whatever your browsing's code names.\n", |
| 274 | + "Key to using the Eurostat API is understanding the Data Structure Definition (DSD). Every dataset is essentially a multidimensional \"cube\" where each dimension (like Geography, Unit, or Frequency) has specific codes.\n", |
277 | 275 | "\n", |
278 | | - "Let's see an example of this in action. We'd like to see the productivity (GDP per hour) data for a range of countries since 2010. We are going to be in the productivity resource (code \"PDB_LV\") and we want the USD current prices (code \"CPC\") measure of GDP per employed worker (code \"T_GDPEMP) from 2010 onwards (code \"startTime=2010\"). We'll grab this for some developed countries where productivity measurements might be slightly more comparable. The comments below explain what's happening in each step." |
| 276 | + "To find the exact codes you need:\n", |
| 277 | + "\n", |
| 278 | + "The Data Browser: Browse the Eurostat Data Navigation Tree. Once you find a table (e.g., \"HICP - monthly data\"), the \"Dataset Code\" (like prc_hicp_manr) is shown in brackets.\n", |
| 279 | + "\n", |
| 280 | + "Positional Keys: Eurostat's REST API expects a \"key string\" where codes are placed in a specific order separated by dots (e.g., Freq.Unit.Item.Geo). If you know the order, you can \"slice\" the data cube directly.\n", |
| 281 | + "\n", |
| 282 | + "Let’s see an example of this in action. We want to see the Harmonised Index of Consumer Prices (HICP)—specifically the annual rate of change for all items—for Germany and France. We will use the resource prc_hicp_manr, requesting Monthly frequency (M), the Annual Rate of Change unit (RCH_A), and the \"All-items\" classification (CP00)." |
279 | 283 | ] |
280 | 284 | }, |
281 | 285 | { |
|
285 | 289 | "metadata": {}, |
286 | 290 | "source": [ |
287 | 291 | "```python\n", |
288 | | - "import pandasdmx as pdmx\n", |
289 | | - "# Tell pdmx we want OECD data\n", |
290 | | - "oecd = pdmx.Request(\"OECD\")\n", |
291 | | - "# Set out everything about the request in the format specified by the OECD API\n", |
292 | | - "data = oecd.data(\n", |
293 | | - " resource_id=\"PDB_LV\",\n", |
294 | | - " key=\"GBR+FRA+CAN+ITA+DEU+JPN+USA.T_GDPEMP.CPC/all?startTime=2010\",\n", |
295 | | - ").to_pandas()\n", |
296 | | - "\n", |
297 | | - "df = pd.DataFrame(data).reset_index()\n", |
298 | | - "df.head()\n", |
| 292 | + "import sdmx\n", |
| 293 | + "import polars as pl\n", |
| 294 | + "\n", |
| 295 | + "# Tell sdmx we want ESTAT data\n", |
| 296 | + "client = sdmx.Client('ESTAT')\n", |
| 297 | + "\n", |
| 298 | + "# 2. Build the URL-style positional key\n", |
| 299 | + "# Format: [Freq].[Unit].[Coicop].[Geo]\n", |
| 300 | + "# We use '+' to join multiple countries (DE and FR)\n", |
| 301 | + "resource_id = 'prc_hicp_manr'\n", |
| 302 | + "key_string = 'M.RCH_A.CP00.DE+FR'\n", |
| 303 | + "\n", |
| 304 | + "# 3. Fetch the data directly\n", |
| 305 | + "# 'startPeriod' limits the timeline to recent data\n", |
| 306 | + "response = client.data(\n", |
| 307 | + " resource_id=resource_id,\n", |
| 308 | + " key=key_string,\n", |
| 309 | + " params={'startPeriod': '2024-01'}\n", |
| 310 | + ")\n", |
| 311 | + "\n", |
| 312 | + "# 4. Convert the SDMX-ML response to a Polars DataFrame\n", |
| 313 | + "# We bridge through Pandas as sdmx1 is optimized for it\n", |
| 314 | + "df_pd = sdmx.to_pandas(response).to_frame(name='value').reset_index()\n", |
| 315 | + "df = pl.from_pandas(df_pd)\n", |
| 316 | + "\n", |
| 317 | + "print(df.head())\n", |
299 | 318 | "```" |
300 | 319 | ] |
301 | 320 | }, |
|
305 | 324 | "id": "e5cac233", |
306 | 325 | "metadata": {}, |
307 | 326 | "source": [ |
308 | | - "| | LOCATION | SUBJECT | MEASURE | TIME_PERIOD | value |\n", |
309 | | - "|--:|---------:|---------:|--------:|------------:|-------------:|\n", |
310 | | - "| 0 | CAN | T_GDPEMP | CPC | 2010 | 78848.604088 |\n", |
311 | | - "| 1 | CAN | T_GDPEMP | CPC | 2011 | 81422.364748 |\n", |
312 | | - "| 2 | CAN | T_GDPEMP | CPC | 2012 | 82663.028058 |\n", |
313 | | - "| 3 | CAN | T_GDPEMP | CPC | 2013 | 86368.582158 |\n", |
314 | | - "| 4 | CAN | T_GDPEMP | CPC | 2014 | 89617.632446 |" |
| 327 | + "| | TIME_PERIOD | geo | unit | freq | coicop | value |\n", |
| 328 | + "|--:|------------:|:----|:------|:-----|:-------|------:|\n", |
| 329 | + "| 0 | 2024-01 | DE | RCH_A | M | CP00 | 3.1 |\n", |
| 330 | + "| 1 | 2024-02 | DE | RCH_A | M | CP00 | 2.7 |\n", |
| 331 | + "| 2 | 2024-03 | DE | RCH_A | M | CP00 | 2.3 |\n", |
| 332 | + "| 3 | 2024-04 | DE | RCH_A | M | CP00 | 2.4 |\n", |
| 333 | + "| 4 | 2024-05 | DE | RCH_A | M | CP00 | 2.8 |" |
315 | 334 | ] |
316 | 335 | }, |
317 | 336 | { |
|
489 | 508 | "source": [ |
490 | 509 | "### Webscraping Tables\n", |
491 | 510 | "\n", |
492 | | - "Often there are times when you don't actually want to scrape an entire webpage and all you want is the data from a *table* within the page. Fortunately, there is an easy way to scrape individual tables using the **pandas** package.\n", |
| 511 | + "There are times when you don't need to scrape an entire webpage; you simply want the structured data from a specific table. While Polars is a high-performance data engine, it focuses on strict data formats (like Parquet or CSV) and does not natively include an HTML parser. However, we can easily bridge this gap by using Pandas to fetch the table and then converting it into a Polars DataFrame.\n", |
493 | 512 | "\n", |
494 | | - "We will read data from a table on 'https://webscraper.io/test-sites/tables' using **pandas**. The function we'll use is `read_html()`, which returns a list of data frames of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n", |
| 513 | + "We will read data from 'https://webscraper.io/test-sites/tables' using `pd.read_html()`. This function scans the webpage and returns a list of all tables it finds as DataFrames. To target a specific table, we use the match= keyword argument with text that uniquely appears in the table we want—in this case, \"First Name\".\n", |
495 | 514 | "\n", |
496 | | - "The example below shows how this works; looking at the website, we can see that the table we're interested in, has a 'First Name' column. Therefore we run:" |
| 515 | + "Once captured, we convert the result to Polars using pl.from_pandas() to take advantage of Polars' superior query performance and expression API." |
497 | 516 | ] |
498 | 517 | }, |
499 | 518 | { |
|
503 | 522 | "metadata": {}, |
504 | 523 | "outputs": [], |
505 | 524 | "source": [ |
506 | | - "df_list = pd.read_html(\"https://webscraper.io/test-sites/tables\", match=\"First Name\")\n", |
| 525 | + "import polars as pl\n", |
| 526 | + "import pandas as pd\n", |
| 527 | + "\n", |
| 528 | + "\n", |
| 529 | + "pd_list = pd.read_html(\"https://webscraper.io/test-sites/tables\", match=\"First Name\")\n", |
507 | 530 | "# Retrieve first entry from list of data frames\n", |
508 | | - "df = df_list[0]\n", |
509 | | - "df.head()" |
| 531 | + "df = pl.from_pandas(pd_list[0])\n", |
| 532 | + "\n", |
| 533 | + "print(df.head())" |
510 | 534 | ] |
511 | 535 | }, |
512 | 536 | { |
|
515 | 539 | "id": "31e49317", |
516 | 540 | "metadata": {}, |
517 | 541 | "source": [ |
518 | | - "This gives us the table neatly loaded into a **pandas** data frame ready for further use.\n", |
| 542 | + "This gives us the table neatly loaded into a **polars** data frame ready for further use.\n", |
519 | 543 | "\n", |
520 | 544 | "If you get a '403' error, it means that the website has blocked **pandas** because it can see that you are engaged in web scraping. This is because some people web scrape irresponsibly, or because websites have provided other, preferred ways for you to obtain the data, eg via a download of the whole thing (think Wikipedia) or through an API. (If you really need to, [you can often get around the 403 error](https://stackoverflow.com/questions/43590153/http-error-403-forbidden-when-reading-html) though.)" |
521 | 545 | ] |
|
0 commit comments