Add NYC taxi line rasterization example with custom log-sum merge (#1021)

brendancol · brendancol · commit 878fe677f269 · 2026-03-17T08:44:57.000-07:00
diff --git a/examples/user_guide/33_NYC_Taxi_Lines.ipynb b/examples/user_guide/33_NYC_Taxi_Lines.ipynb
@@ -0,0 +1,250 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# NYC taxi line rasterization with a custom merge\n",
+    "\n",
+    "This notebook rasterizes 3.4 million yellow taxi trips from January 2025 as\n",
+    "straight lines between pickup and dropoff zone centroids. We compare the\n",
+    "built-in `sum` merge against a custom log-sum merge that compresses dynamic\n",
+    "range, letting mid-volume corridors stand out alongside the busiest routes.\n",
+    "\n",
+    "Data source: [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import xarray as xr\n",
+    "import geopandas as gpd\n",
+    "import matplotlib.pyplot as plt\n",
+    "from shapely.geometry import LineString\n",
+    "\n",
+    "import xrspatial  # registers .xrs accessor\n",
+    "from xrspatial.utils import ngjit"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load taxi zones and trip data\n",
+    "\n",
+    "The TLC publishes taxi zone boundaries as a shapefile and monthly trip records\n",
+    "as parquet files. We download both, reproject zones to WGS 84, and read only\n",
+    "the columns we need from the ~100 MB parquet."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import requests, zipfile, io, tempfile, os\nimport pyproj\n\n# Download and extract taxi zone shapefile\nresp = requests.get('https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip')\ntmpdir = tempfile.mkdtemp()\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as z:\n    z.extractall(tmpdir)\n\nzones = gpd.read_file(os.path.join(tmpdir, 'taxi_zones', 'taxi_zones.shp'))\n\n# Compute centroids in the native projected CRS (EPSG:2263, feet) where\n# they're accurate, then transform to WGS 84 lon/lat for rasterization.\nmanhattan_proj = zones[zones.borough == 'Manhattan'].copy()\ncx_proj = manhattan_proj.geometry.centroid.x.values\ncy_proj = manhattan_proj.geometry.centroid.y.values\n\ntransformer = pyproj.Transformer.from_crs('EPSG:2263', 'EPSG:4326', always_xy=True)\ncx_wgs, cy_wgs = transformer.transform(cx_proj, cy_proj)\n\nzones = zones.to_crs(epsg=4326)\nmanhattan = zones[zones.borough == 'Manhattan'].copy()\nmanhattan['cx'] = cx_wgs\nmanhattan['cy'] = cy_wgs\ncentroids = manhattan.set_index('LocationID')[['cx', 'cy']]\n\nprint(f'{len(manhattan)} Manhattan taxi zones')\n\nfig, ax = plt.subplots(figsize=(4, 8))\nmanhattan.plot(ax=ax, facecolor='#f0f0f0', edgecolor='#999')\nax.scatter(centroids.cx, centroids.cy, s=8, color='steelblue', zorder=5)\nax.set_title('Manhattan taxi zones + centroids')\nax.set_axis_off()\nplt.tight_layout()"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trips = pd.read_parquet(\n",
+    "    'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet',\n",
+    "    columns=['PULocationID', 'DOLocationID', 'total_amount', 'trip_distance'],\n",
+    ")\n",
+    "\n",
+    "# Keep Manhattan-to-Manhattan trips with positive fare and nontrivial distance\n",
+    "mh_ids = set(centroids.index)\n",
+    "mask = (\n",
+    "    trips.PULocationID.isin(mh_ids)\n",
+    "    & trips.DOLocationID.isin(mh_ids)\n",
+    "    & (trips.total_amount > 0)\n",
+    "    & (trips.trip_distance > 0.1)\n",
+    "    & (trips.PULocationID != trips.DOLocationID)\n",
+    ")\n",
+    "trips = trips[mask]\n",
+    "print(f'{len(trips):,} Manhattan-to-Manhattan trips after filtering')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Build route lines\n",
+    "\n",
+    "Group trips by pickup--dropoff zone pair and sum fares and counts, then draw a\n",
+    "straight line between zone centroids. This collapses millions of trips into a\n",
+    "few thousand unique routes, each carrying aggregate statistics as columns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "routes = (\n",
+    "    trips\n",
+    "    .groupby(['PULocationID', 'DOLocationID'])\n",
+    "    .agg(trip_count=('total_amount', 'size'),\n",
+    "         total_fare=('total_amount', 'sum'))\n",
+    "    .reset_index()\n",
+    ")\n",
+    "\n",
+    "# Attach pickup and dropoff centroids\n",
+    "routes = routes.merge(centroids, left_on='PULocationID', right_index=True)\n",
+    "routes = routes.rename(columns={'cx': 'pu_x', 'cy': 'pu_y'})\n",
+    "routes = routes.merge(centroids, left_on='DOLocationID', right_index=True)\n",
+    "routes = routes.rename(columns={'cx': 'do_x', 'cy': 'do_y'})\n",
+    "\n",
+    "routes['geometry'] = [\n",
+    "    LineString([(r.pu_x, r.pu_y), (r.do_x, r.do_y)])\n",
+    "    for _, r in routes.iterrows()\n",
+    "]\n",
+    "gdf = gpd.GeoDataFrame(routes, geometry='geometry', crs='EPSG:4326')\n",
+    "\n",
+    "print(f'{len(gdf):,} unique routes')\n",
+    "print(f'Trip count per route: min={gdf.trip_count.min()}, '\n",
+    "      f'median={gdf.trip_count.median():.0f}, max={gdf.trip_count.max()}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(6, 10))\n",
+    "manhattan.plot(ax=ax, facecolor='#f0f0f0', edgecolor='#ccc')\n",
+    "gdf.plot(ax=ax, linewidth=0.3, alpha=0.4, color='steelblue')\n",
+    "ax.set_title(f'{len(gdf):,} taxi routes across Manhattan')\n",
+    "ax.set_axis_off()\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Rasterize: built-in sum vs. custom log-sum\n",
+    "\n",
+    "We create a raster grid covering Manhattan and burn route lines three ways:\n",
+    "\n",
+    "1. **Trip count** (`merge='sum'` on `trip_count`): raw volume per pixel\n",
+    "2. **Total fare** (`merge='sum'` on `total_fare`): revenue per pixel\n",
+    "3. **Log-fare** (custom merge): `sum(log(1 + fare))` per pixel\n",
+    "\n",
+    "The log transform compresses dynamic range. A \\$1M corridor and a \\$10K\n",
+    "corridor differ 100x in linear sum but only about 2.5x in log-sum. Cross-town\n",
+    "routes and side streets that are invisible in the linear panels become visible."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Manhattan bounding box with a bit of padding\n",
+    "bounds = (-74.025, 40.695, -73.930, 40.880)\n",
+    "width, height = 400, 800\n",
+    "\n",
+    "def make_template(w, h, bounds):\n",
+    "    xmin, ymin, xmax, ymax = bounds\n",
+    "    px, py = (xmax - xmin) / w, (ymax - ymin) / h\n",
+    "    x = np.linspace(xmin + px / 2, xmax - px / 2, w)\n",
+    "    y = np.linspace(ymax - py / 2, ymin + py / 2, h)\n",
+    "    return xr.DataArray(np.zeros((h, w)), dims=['y', 'x'],\n",
+    "                        coords={'y': y, 'x': x})\n",
+    "\n",
+    "template = make_template(width, height, bounds)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 1. Trip count -- linear sum\n",
+    "count_raster = template.xrs.rasterize(gdf, column='trip_count', merge='sum', fill=0)\n",
+    "\n",
+    "# 2. Total fare -- linear sum\n",
+    "fare_raster = template.xrs.rasterize(gdf, column='total_fare', merge='sum', fill=0)\n",
+    "\n",
+    "# 3. Log-fare -- custom non-linear merge\n",
+    "@ngjit\n",
+    "def log_fare_sum(pixel, props, is_first):\n",
+    "    \"\"\"Sum log(1 + fare) instead of raw fares.\n",
+    "\n",
+    "    Non-linear: compresses high values, widens the low range.\n",
+    "    A $1M route contributes log(1M) ~ 13.8 while a $1K route\n",
+    "    contributes log(1K) ~ 6.9, a 2:1 ratio instead of 1000:1.\n",
+    "    \"\"\"\n",
+    "    val = np.log1p(props[0])\n",
+    "    if is_first:\n",
+    "        return val\n",
+    "    return pixel + val\n",
+    "\n",
+    "log_raster = template.xrs.rasterize(gdf, column='total_fare', merge=log_fare_sum, fill=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import pathlib\npathlib.Path('images').mkdir(exist_ok=True)\n\nfig, axes = plt.subplots(1, 3, figsize=(18, 12), facecolor='black')\n\ntitles = ['Trip count (sum)', 'Total fare (sum)', 'Log-fare (custom merge)']\nrasters = [count_raster, fare_raster, log_raster]\n\nfor ax, raster, title in zip(axes, rasters, titles):\n    ax.set_facecolor('black')\n    masked = raster.where(raster > 0)\n    masked.plot.imshow(ax=ax, cmap='hot', add_colorbar=False,\n                       interpolation='nearest')\n    ax.set_title(title, color='white', fontsize=14, pad=10)\n    ax.set_axis_off()\n\nplt.tight_layout()\nplt.savefig('images/nyc_taxi_lines_preview.png',\n            bbox_inches='tight', dpi=120, facecolor='black')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The left two panels are dominated by a handful of bright corridors -- the\n",
+    "busiest routes running up and down the avenues. Everything else fades to\n",
+    "near-black.\n",
+    "\n",
+    "The right panel applies `log(1 + x)` before summing, which compresses the\n",
+    "hundred-fold difference between heavy and light corridors down to about 2--3x.\n",
+    "Cross-town routes and side-street connections that are invisible in the linear\n",
+    "versions show up clearly. Same data, different merge function.\n",
+    "\n",
+    "**When to use a non-linear merge:** any time a few features dominate the\n",
+    "signal and you want to see the rest of the distribution. Log-sum works well\n",
+    "for count-like or revenue-like data that spans several orders of magnitude."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### References\n",
+    "\n",
+    "- [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n",
+    "- [Bresenham's line algorithm (Wikipedia)](https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm)\n",
+    "- [xrspatial.rasterize API docs](https://xarray-spatial.readthedocs.io/en/latest/reference/_autosummary/xrspatial.rasterize.html)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.14.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}