1+ {
2+ "cells" : [
3+ {
4+ "cell_type" : " markdown" ,
5+ "metadata" : {},
6+ "source" : [
7+ " # NYC taxi line rasterization with a custom merge\n " ,
8+ " \n " ,
9+ " This notebook rasterizes 3.4 million yellow taxi trips from January 2025 as\n " ,
10+ " straight lines between pickup and dropoff zone centroids. We compare the\n " ,
11+ " built-in `sum` merge against a custom log-sum merge that compresses dynamic\n " ,
12+ " range, letting mid-volume corridors stand out alongside the busiest routes.\n " ,
13+ " \n " ,
14+ " Data source: [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)"
15+ ]
16+ },
17+ {
18+ "cell_type" : " code" ,
19+ "execution_count" : null ,
20+ "metadata" : {},
21+ "outputs" : [],
22+ "source" : [
23+ " %matplotlib inline\n " ,
24+ " import numpy as np\n " ,
25+ " import pandas as pd\n " ,
26+ " import xarray as xr\n " ,
27+ " import geopandas as gpd\n " ,
28+ " import matplotlib.pyplot as plt\n " ,
29+ " from shapely.geometry import LineString\n " ,
30+ " \n " ,
31+ " import xrspatial # registers .xrs accessor\n " ,
32+ " from xrspatial.utils import ngjit"
33+ ]
34+ },
35+ {
36+ "cell_type" : " markdown" ,
37+ "metadata" : {},
38+ "source" : [
39+ " ## Load taxi zones and trip data\n " ,
40+ " \n " ,
41+ " The TLC publishes taxi zone boundaries as a shapefile and monthly trip records\n " ,
42+ " as parquet files. We download both, reproject zones to WGS 84, and read only\n " ,
43+ " the columns we need from the ~100 MB parquet."
44+ ]
45+ },
46+ {
47+ "cell_type" : " code" ,
48+ "execution_count" : null ,
49+ "metadata" : {},
50+ "outputs" : [],
51+ "source": "import requests, zipfile, io, tempfile, os\nimport pyproj\n\n# Download and extract taxi zone shapefile\nresp = requests.get('https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip')\ntmpdir = tempfile.mkdtemp()\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as z:\n z.extractall(tmpdir)\n\nzones = gpd.read_file(os.path.join(tmpdir, 'taxi_zones', 'taxi_zones.shp'))\n\n# Compute centroids in the native projected CRS (EPSG:2263, feet) where\n# they're accurate, then transform to WGS 84 lon/lat for rasterization.\nmanhattan_proj = zones[zones.borough == 'Manhattan'].copy()\ncx_proj = manhattan_proj.geometry.centroid.x.values\ncy_proj = manhattan_proj.geometry.centroid.y.values\n\ntransformer = pyproj.Transformer.from_crs('EPSG:2263', 'EPSG:4326', always_xy=True)\ncx_wgs, cy_wgs = transformer.transform(cx_proj, cy_proj)\n\nzones = zones.to_crs(epsg=4326)\nmanhattan = zones[zones.borough == 'Manhattan'].copy()\nmanhattan['cx'] = cx_wgs\nmanhattan['cy'] = cy_wgs\ncentroids = manhattan.set_index('LocationID')[['cx', 'cy']]\n\nprint(f'{len(manhattan)} Manhattan taxi zones')\n\nfig, ax = plt.subplots(figsize=(4, 8))\nmanhattan.plot(ax=ax, facecolor='#f0f0f0', edgecolor='#999')\nax.scatter(centroids.cx, centroids.cy, s=8, color='steelblue', zorder=5)\nax.set_title('Manhattan taxi zones + centroids')\nax.set_axis_off()\nplt.tight_layout()"
52+ },
53+ {
54+ "cell_type" : " code" ,
55+ "execution_count" : null ,
56+ "metadata" : {},
57+ "outputs" : [],
58+ "source" : [
59+ " trips = pd.read_parquet(\n " ,
60+ " 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-01.parquet',\n " ,
61+ " columns=['PULocationID', 'DOLocationID', 'total_amount', 'trip_distance'],\n " ,
62+ " )\n " ,
63+ " \n " ,
64+ " # Keep Manhattan-to-Manhattan trips with positive fare and nontrivial distance\n " ,
65+ " mh_ids = set(centroids.index)\n " ,
66+ " mask = (\n " ,
67+ " trips.PULocationID.isin(mh_ids)\n " ,
68+ " & trips.DOLocationID.isin(mh_ids)\n " ,
69+ " & (trips.total_amount > 0)\n " ,
70+ " & (trips.trip_distance > 0.1)\n " ,
71+ " & (trips.PULocationID != trips.DOLocationID)\n " ,
72+ " )\n " ,
73+ " trips = trips[mask]\n " ,
74+ " print(f'{len(trips):,} Manhattan-to-Manhattan trips after filtering')"
75+ ]
76+ },
77+ {
78+ "cell_type" : " markdown" ,
79+ "metadata" : {},
80+ "source" : [
81+ " ## Build route lines\n " ,
82+ " \n " ,
83+ " Group trips by pickup--dropoff zone pair and sum fares and counts, then draw a\n " ,
84+ " straight line between zone centroids. This collapses millions of trips into a\n " ,
85+ " few thousand unique routes, each carrying aggregate statistics as columns."
86+ ]
87+ },
88+ {
89+ "cell_type" : " code" ,
90+ "execution_count" : null ,
91+ "metadata" : {},
92+ "outputs" : [],
93+ "source" : [
94+ " routes = (\n " ,
95+ " trips\n " ,
96+ " .groupby(['PULocationID', 'DOLocationID'])\n " ,
97+ " .agg(trip_count=('total_amount', 'size'),\n " ,
98+ " total_fare=('total_amount', 'sum'))\n " ,
99+ " .reset_index()\n " ,
100+ " )\n " ,
101+ " \n " ,
102+ " # Attach pickup and dropoff centroids\n " ,
103+ " routes = routes.merge(centroids, left_on='PULocationID', right_index=True)\n " ,
104+ " routes = routes.rename(columns={'cx': 'pu_x', 'cy': 'pu_y'})\n " ,
105+ " routes = routes.merge(centroids, left_on='DOLocationID', right_index=True)\n " ,
106+ " routes = routes.rename(columns={'cx': 'do_x', 'cy': 'do_y'})\n " ,
107+ " \n " ,
108+ " routes['geometry'] = [\n " ,
109+ " LineString([(r.pu_x, r.pu_y), (r.do_x, r.do_y)])\n " ,
110+ " for _, r in routes.iterrows()\n " ,
111+ " ]\n " ,
112+ " gdf = gpd.GeoDataFrame(routes, geometry='geometry', crs='EPSG:4326')\n " ,
113+ " \n " ,
114+ " print(f'{len(gdf):,} unique routes')\n " ,
115+ " print(f'Trip count per route: min={gdf.trip_count.min()}, '\n " ,
116+ " f'median={gdf.trip_count.median():.0f}, max={gdf.trip_count.max()}')"
117+ ]
118+ },
119+ {
120+ "cell_type" : " code" ,
121+ "execution_count" : null ,
122+ "metadata" : {},
123+ "outputs" : [],
124+ "source" : [
125+ " fig, ax = plt.subplots(figsize=(6, 10))\n " ,
126+ " manhattan.plot(ax=ax, facecolor='#f0f0f0', edgecolor='#ccc')\n " ,
127+ " gdf.plot(ax=ax, linewidth=0.3, alpha=0.4, color='steelblue')\n " ,
128+ " ax.set_title(f'{len(gdf):,} taxi routes across Manhattan')\n " ,
129+ " ax.set_axis_off()\n " ,
130+ " plt.tight_layout()"
131+ ]
132+ },
133+ {
134+ "cell_type" : " markdown" ,
135+ "metadata" : {},
136+ "source" : [
137+ " ## Rasterize: built-in sum vs. custom log-sum\n " ,
138+ " \n " ,
139+ " We create a raster grid covering Manhattan and burn route lines three ways:\n " ,
140+ " \n " ,
141+ " 1. **Trip count** (`merge='sum'` on `trip_count`): raw volume per pixel\n " ,
142+ " 2. **Total fare** (`merge='sum'` on `total_fare`): revenue per pixel\n " ,
143+ " 3. **Log-fare** (custom merge): `sum(log(1 + fare))` per pixel\n " ,
144+ " \n " ,
145+ " The log transform compresses dynamic range. A \\ $1M corridor and a \\ $10K\n " ,
146+ " corridor differ 100x in linear sum but only about 2.5x in log-sum. Cross-town\n " ,
147+ " routes and side streets that are invisible in the linear panels become visible."
148+ ]
149+ },
150+ {
151+ "cell_type" : " code" ,
152+ "execution_count" : null ,
153+ "metadata" : {},
154+ "outputs" : [],
155+ "source" : [
156+ " # Manhattan bounding box with a bit of padding\n " ,
157+ " bounds = (-74.025, 40.695, -73.930, 40.880)\n " ,
158+ " width, height = 400, 800\n " ,
159+ " \n " ,
160+ " def make_template(w, h, bounds):\n " ,
161+ " xmin, ymin, xmax, ymax = bounds\n " ,
162+ " px, py = (xmax - xmin) / w, (ymax - ymin) / h\n " ,
163+ " x = np.linspace(xmin + px / 2, xmax - px / 2, w)\n " ,
164+ " y = np.linspace(ymax - py / 2, ymin + py / 2, h)\n " ,
165+ " return xr.DataArray(np.zeros((h, w)), dims=['y', 'x'],\n " ,
166+ " coords={'y': y, 'x': x})\n " ,
167+ " \n " ,
168+ " template = make_template(width, height, bounds)"
169+ ]
170+ },
171+ {
172+ "cell_type" : " code" ,
173+ "execution_count" : null ,
174+ "metadata" : {},
175+ "outputs" : [],
176+ "source" : [
177+ " # 1. Trip count -- linear sum\n " ,
178+ " count_raster = template.xrs.rasterize(gdf, column='trip_count', merge='sum', fill=0)\n " ,
179+ " \n " ,
180+ " # 2. Total fare -- linear sum\n " ,
181+ " fare_raster = template.xrs.rasterize(gdf, column='total_fare', merge='sum', fill=0)\n " ,
182+ " \n " ,
183+ " # 3. Log-fare -- custom non-linear merge\n " ,
184+ " @ngjit\n " ,
185+ " def log_fare_sum(pixel, props, is_first):\n " ,
186+ " \"\"\" Sum log(1 + fare) instead of raw fares.\n " ,
187+ " \n " ,
188+ " Non-linear: compresses high values, widens the low range.\n " ,
189+ " A $1M route contributes log(1M) ~ 13.8 while a $1K route\n " ,
190+ " contributes log(1K) ~ 6.9, a 2:1 ratio instead of 1000:1.\n " ,
191+ " \"\"\"\n " ,
192+ " val = np.log1p(props[0])\n " ,
193+ " if is_first:\n " ,
194+ " return val\n " ,
195+ " return pixel + val\n " ,
196+ " \n " ,
197+ " log_raster = template.xrs.rasterize(gdf, column='total_fare', merge=log_fare_sum, fill=0)"
198+ ]
199+ },
200+ {
201+ "cell_type" : " code" ,
202+ "execution_count" : null ,
203+ "metadata" : {},
204+ "outputs" : [],
205+ "source" : " import pathlib\n pathlib.Path('images').mkdir(exist_ok=True)\n\n fig, axes = plt.subplots(1, 3, figsize=(18, 12), facecolor='black')\n\n titles = ['Trip count (sum)', 'Total fare (sum)', 'Log-fare (custom merge)']\n rasters = [count_raster, fare_raster, log_raster]\n\n for ax, raster, title in zip(axes, rasters, titles):\n ax.set_facecolor('black')\n masked = raster.where(raster > 0)\n masked.plot.imshow(ax=ax, cmap='hot', add_colorbar=False,\n interpolation='nearest')\n ax.set_title(title, color='white', fontsize=14, pad=10)\n ax.set_axis_off()\n\n plt.tight_layout()\n plt.savefig('images/nyc_taxi_lines_preview.png',\n bbox_inches='tight', dpi=120, facecolor='black')"
206+ },
207+ {
208+ "cell_type" : " markdown" ,
209+ "metadata" : {},
210+ "source" : [
211+ " The left two panels are dominated by a handful of bright corridors -- the\n " ,
212+ " busiest routes running up and down the avenues. Everything else fades to\n " ,
213+ " near-black.\n " ,
214+ " \n " ,
215+ " The right panel applies `log(1 + x)` before summing, which compresses the\n " ,
216+ " hundred-fold difference between heavy and light corridors down to about 2--3x.\n " ,
217+ " Cross-town routes and side-street connections that are invisible in the linear\n " ,
218+ " versions show up clearly. Same data, different merge function.\n " ,
219+ " \n " ,
220+ " **When to use a non-linear merge:** any time a few features dominate the\n " ,
221+ " signal and you want to see the rest of the distribution. Log-sum works well\n " ,
222+ " for count-like or revenue-like data that spans several orders of magnitude."
223+ ]
224+ },
225+ {
226+ "cell_type" : " markdown" ,
227+ "metadata" : {},
228+ "source" : [
229+ " ### References\n " ,
230+ " \n " ,
231+ " - [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n " ,
232+ " - [Bresenham's line algorithm (Wikipedia)](https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm)\n " ,
233+ " - [xrspatial.rasterize API docs](https://xarray-spatial.readthedocs.io/en/latest/reference/_autosummary/xrspatial.rasterize.html)"
234+ ]
235+ }
236+ ],
237+ "metadata" : {
238+ "kernelspec" : {
239+ "display_name" : " Python 3 (ipykernel)" ,
240+ "language" : " python" ,
241+ "name" : " python3"
242+ },
243+ "language_info" : {
244+ "name" : " python" ,
245+ "version" : " 3.14.0"
246+ }
247+ },
248+ "nbformat" : 4 ,
249+ "nbformat_minor" : 5
250+ }
0 commit comments