Skip to content

Commit 878fe67

Browse files
committed
Add NYC taxi line rasterization example with custom log-sum merge (#1021)
1 parent 14113bc commit 878fe67

File tree

1 file changed

+250
-0
lines changed

1 file changed

+250
-0
lines changed
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# NYC taxi line rasterization with a custom merge\n",
8+
"\n",
9+
"This notebook rasterizes 3.4 million yellow taxi trips from January 2025 as\n",
10+
"straight lines between pickup and dropoff zone centroids. We compare the\n",
11+
"built-in `sum` merge against a custom log-sum merge that compresses dynamic\n",
12+
"range, letting mid-volume corridors stand out alongside the busiest routes.\n",
13+
"\n",
14+
"Data source: [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)"
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"metadata": {},
21+
"outputs": [],
22+
"source": [
23+
"%matplotlib inline\n",
24+
"import numpy as np\n",
25+
"import pandas as pd\n",
26+
"import xarray as xr\n",
27+
"import geopandas as gpd\n",
28+
"import matplotlib.pyplot as plt\n",
29+
"from shapely.geometry import LineString\n",
30+
"\n",
31+
"import xrspatial # registers .xrs accessor\n",
32+
"from xrspatial.utils import ngjit"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"metadata": {},
38+
"source": [
39+
"## Load taxi zones and trip data\n",
40+
"\n",
41+
"The TLC publishes taxi zone boundaries as a shapefile and monthly trip records\n",
42+
"as parquet files. We download both, reproject zones to WGS 84, and read only\n",
43+
"the columns we need from the ~100 MB parquet."
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {},
50+
"outputs": [],
51+
"source": "import requests, zipfile, io, tempfile, os\nimport pyproj\n\n# Download and extract taxi zone shapefile\nresp = requests.get('https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip')\ntmpdir = tempfile.mkdtemp()\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as z:\n z.extractall(tmpdir)\n\nzones = gpd.read_file(os.path.join(tmpdir, 'taxi_zones', 'taxi_zones.shp'))\n\n# Compute centroids in the native projected CRS (EPSG:2263, feet) where\n# they're accurate, then transform to WGS 84 lon/lat for rasterization.\nmanhattan_proj = zones[zones.borough == 'Manhattan'].copy()\ncx_proj = manhattan_proj.geometry.centroid.x.values\ncy_proj = manhattan_proj.geometry.centroid.y.values\n\ntransformer = pyproj.Transformer.from_crs('EPSG:2263', 'EPSG:4326', always_xy=True)\ncx_wgs, cy_wgs = transformer.transform(cx_proj, cy_proj)\n\nzones = zones.to_crs(epsg=4326)\nmanhattan = zones[zones.borough == 'Manhattan'].copy()\nmanhattan['cx'] = cx_wgs\nmanhattan['cy'] = cy_wgs\ncentroids = manhattan.set_index('LocationID')[['cx', 'cy']]\n\nprint(f'{len(manhattan)} Manhattan taxi zones')\n\nfig, ax = plt.subplots(figsize=(4, 8))\nmanhattan.plot(ax=ax, facecolor='#f0f0f0', edgecolor='#999')\nax.scatter(centroids.cx, centroids.cy, s=8, color='steelblue', zorder=5)\nax.set_title('Manhattan taxi zones + centroids')\nax.set_axis_off()\nplt.tight_layout()"
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {},
57+
"outputs": [],
58+
"source": [
59+
"trips = pd.read_parquet(\n",
60+
" 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet',\n",
61+
" columns=['PULocationID', 'DOLocationID', 'total_amount', 'trip_distance'],\n",
62+
")\n",
63+
"\n",
64+
"# Keep Manhattan-to-Manhattan trips with positive fare and nontrivial distance\n",
65+
"mh_ids = set(centroids.index)\n",
66+
"mask = (\n",
67+
" trips.PULocationID.isin(mh_ids)\n",
68+
" & trips.DOLocationID.isin(mh_ids)\n",
69+
" & (trips.total_amount > 0)\n",
70+
" & (trips.trip_distance > 0.1)\n",
71+
" & (trips.PULocationID != trips.DOLocationID)\n",
72+
")\n",
73+
"trips = trips[mask]\n",
74+
"print(f'{len(trips):,} Manhattan-to-Manhattan trips after filtering')"
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"metadata": {},
80+
"source": [
81+
"## Build route lines\n",
82+
"\n",
83+
"Group trips by pickup--dropoff zone pair and sum fares and counts, then draw a\n",
84+
"straight line between zone centroids. This collapses millions of trips into a\n",
85+
"few thousand unique routes, each carrying aggregate statistics as columns."
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": null,
91+
"metadata": {},
92+
"outputs": [],
93+
"source": [
94+
"routes = (\n",
95+
" trips\n",
96+
" .groupby(['PULocationID', 'DOLocationID'])\n",
97+
" .agg(trip_count=('total_amount', 'size'),\n",
98+
" total_fare=('total_amount', 'sum'))\n",
99+
" .reset_index()\n",
100+
")\n",
101+
"\n",
102+
"# Attach pickup and dropoff centroids\n",
103+
"routes = routes.merge(centroids, left_on='PULocationID', right_index=True)\n",
104+
"routes = routes.rename(columns={'cx': 'pu_x', 'cy': 'pu_y'})\n",
105+
"routes = routes.merge(centroids, left_on='DOLocationID', right_index=True)\n",
106+
"routes = routes.rename(columns={'cx': 'do_x', 'cy': 'do_y'})\n",
107+
"\n",
108+
"routes['geometry'] = [\n",
109+
" LineString([(r.pu_x, r.pu_y), (r.do_x, r.do_y)])\n",
110+
" for _, r in routes.iterrows()\n",
111+
"]\n",
112+
"gdf = gpd.GeoDataFrame(routes, geometry='geometry', crs='EPSG:4326')\n",
113+
"\n",
114+
"print(f'{len(gdf):,} unique routes')\n",
115+
"print(f'Trip count per route: min={gdf.trip_count.min()}, '\n",
116+
" f'median={gdf.trip_count.median():.0f}, max={gdf.trip_count.max()}')"
117+
]
118+
},
119+
{
120+
"cell_type": "code",
121+
"execution_count": null,
122+
"metadata": {},
123+
"outputs": [],
124+
"source": [
125+
"fig, ax = plt.subplots(figsize=(6, 10))\n",
126+
"manhattan.plot(ax=ax, facecolor='#f0f0f0', edgecolor='#ccc')\n",
127+
"gdf.plot(ax=ax, linewidth=0.3, alpha=0.4, color='steelblue')\n",
128+
"ax.set_title(f'{len(gdf):,} taxi routes across Manhattan')\n",
129+
"ax.set_axis_off()\n",
130+
"plt.tight_layout()"
131+
]
132+
},
133+
{
134+
"cell_type": "markdown",
135+
"metadata": {},
136+
"source": [
137+
"## Rasterize: built-in sum vs. custom log-sum\n",
138+
"\n",
139+
"We create a raster grid covering Manhattan and burn route lines three ways:\n",
140+
"\n",
141+
"1. **Trip count** (`merge='sum'` on `trip_count`): raw volume per pixel\n",
142+
"2. **Total fare** (`merge='sum'` on `total_fare`): revenue per pixel\n",
143+
"3. **Log-fare** (custom merge): `sum(log(1 + fare))` per pixel\n",
144+
"\n",
145+
"The log transform compresses dynamic range. A \\$1M corridor and a \\$10K\n",
146+
"corridor differ 100x in linear sum but only about 2.5x in log-sum. Cross-town\n",
147+
"routes and side streets that are invisible in the linear panels become visible."
148+
]
149+
},
150+
{
151+
"cell_type": "code",
152+
"execution_count": null,
153+
"metadata": {},
154+
"outputs": [],
155+
"source": [
156+
"# Manhattan bounding box with a bit of padding\n",
157+
"bounds = (-74.025, 40.695, -73.930, 40.880)\n",
158+
"width, height = 400, 800\n",
159+
"\n",
160+
"def make_template(w, h, bounds):\n",
161+
" xmin, ymin, xmax, ymax = bounds\n",
162+
" px, py = (xmax - xmin) / w, (ymax - ymin) / h\n",
163+
" x = np.linspace(xmin + px / 2, xmax - px / 2, w)\n",
164+
" y = np.linspace(ymax - py / 2, ymin + py / 2, h)\n",
165+
" return xr.DataArray(np.zeros((h, w)), dims=['y', 'x'],\n",
166+
" coords={'y': y, 'x': x})\n",
167+
"\n",
168+
"template = make_template(width, height, bounds)"
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"metadata": {},
175+
"outputs": [],
176+
"source": [
177+
"# 1. Trip count -- linear sum\n",
178+
"count_raster = template.xrs.rasterize(gdf, column='trip_count', merge='sum', fill=0)\n",
179+
"\n",
180+
"# 2. Total fare -- linear sum\n",
181+
"fare_raster = template.xrs.rasterize(gdf, column='total_fare', merge='sum', fill=0)\n",
182+
"\n",
183+
"# 3. Log-fare -- custom non-linear merge\n",
184+
"@ngjit\n",
185+
"def log_fare_sum(pixel, props, is_first):\n",
186+
" \"\"\"Sum log(1 + fare) instead of raw fares.\n",
187+
"\n",
188+
" Non-linear: compresses high values, widens the low range.\n",
189+
" A $1M route contributes log(1M) ~ 13.8 while a $1K route\n",
190+
" contributes log(1K) ~ 6.9, a 2:1 ratio instead of 1000:1.\n",
191+
" \"\"\"\n",
192+
" val = np.log1p(props[0])\n",
193+
" if is_first:\n",
194+
" return val\n",
195+
" return pixel + val\n",
196+
"\n",
197+
"log_raster = template.xrs.rasterize(gdf, column='total_fare', merge=log_fare_sum, fill=0)"
198+
]
199+
},
200+
{
201+
"cell_type": "code",
202+
"execution_count": null,
203+
"metadata": {},
204+
"outputs": [],
205+
"source": "import pathlib\npathlib.Path('images').mkdir(exist_ok=True)\n\nfig, axes = plt.subplots(1, 3, figsize=(18, 12), facecolor='black')\n\ntitles = ['Trip count (sum)', 'Total fare (sum)', 'Log-fare (custom merge)']\nrasters = [count_raster, fare_raster, log_raster]\n\nfor ax, raster, title in zip(axes, rasters, titles):\n ax.set_facecolor('black')\n masked = raster.where(raster > 0)\n masked.plot.imshow(ax=ax, cmap='hot', add_colorbar=False,\n interpolation='nearest')\n ax.set_title(title, color='white', fontsize=14, pad=10)\n ax.set_axis_off()\n\nplt.tight_layout()\nplt.savefig('images/nyc_taxi_lines_preview.png',\n bbox_inches='tight', dpi=120, facecolor='black')"
206+
},
207+
{
208+
"cell_type": "markdown",
209+
"metadata": {},
210+
"source": [
211+
"The left two panels are dominated by a handful of bright corridors -- the\n",
212+
"busiest routes running up and down the avenues. Everything else fades to\n",
213+
"near-black.\n",
214+
"\n",
215+
"The right panel applies `log(1 + x)` before summing, which compresses the\n",
216+
"hundred-fold difference between heavy and light corridors down to about 2--3x.\n",
217+
"Cross-town routes and side-street connections that are invisible in the linear\n",
218+
"versions show up clearly. Same data, different merge function.\n",
219+
"\n",
220+
"**When to use a non-linear merge:** any time a few features dominate the\n",
221+
"signal and you want to see the rest of the distribution. Log-sum works well\n",
222+
"for count-like or revenue-like data that spans several orders of magnitude."
223+
]
224+
},
225+
{
226+
"cell_type": "markdown",
227+
"metadata": {},
228+
"source": [
229+
"### References\n",
230+
"\n",
231+
"- [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n",
232+
"- [Bresenham's line algorithm (Wikipedia)](https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm)\n",
233+
"- [xrspatial.rasterize API docs](https://xarray-spatial.readthedocs.io/en/latest/reference/_autosummary/xrspatial.rasterize.html)"
234+
]
235+
}
236+
],
237+
"metadata": {
238+
"kernelspec": {
239+
"display_name": "Python 3 (ipykernel)",
240+
"language": "python",
241+
"name": "python3"
242+
},
243+
"language_info": {
244+
"name": "python",
245+
"version": "3.14.0"
246+
}
247+
},
248+
"nbformat": 4,
249+
"nbformat_minor": 5
250+
}

0 commit comments

Comments
 (0)