NYC bike share: Tourist versus native (round 1)
Investigating New York’s Citi Bike system data with SQL, Python, and QGIS.
One in a series of projects highlighting my progress as a self-taught programmer.
I began teaching myself Python, SQL, and GIS in spring 2023 — starting from zero. I therefore welcome feedback on these projects and review for errors. And I’d be interested in taking a crack at your own data and geospatial questions too. Please get in touch.
29 August 2023
Background
As a long-time urban, town, and rail-to-trail cyclist, I have never quite understood the attraction of public bike share systems. Who wants to rent a bike when, surely, in the long term, cost and convenience favor owning one’s own?
But thousands of these systems operate across the world, from Beijing to Barcelona to my corner of western Massachusetts. In 2022, over 30 million rides were logged on New York City’s Citi Bike system alone, with cyclists grabbing bikes from (and returning them to) some 1,800 docking stations across the city.
BIXI bike share riders in Montreal | Matthew Muspratt
Like many systems, NYC’s Citi Bike offers both pay-as-you-go and membership pricing. Currently, non-members pay $4.49 for a 30-minute ride, adding $0.23 per additional minute, while a $205 annual membership grants unlimited 45-minute rides, with $0.17 owed per minute thereafter.
My own bike share experience is limited to a few rides in Montreal and Washington, D.C., where, riding as a tourist, I cycled under non-member schemes like New York’s—and wondered just how differently a member’s pattern-of-use compared to my own. As it happens, data on all 30 million NYC bike share trips is freely available, and in a maiden SQL effort, I sought to find out.
Though I am curious about users’ routes, speed, and range, for this post, my question is basic: In New York, do tourists (non-members) kick off their trips from the same starting stations as natives (members)?
Project outputs
Answer: They don’t. And, as we’ll see, visualizing in QGIS the distribution of popular starting stations (i.e. where riders pick up a bike) helps justify calling non-members “tourists.” But first, here are a few numbers for context, all derived from my initial Python scripts and simple SQL queries of the Citi Bike dataset in the SQLite command line in Terminal for Mac:
Total number of Citi Bike docking stations in the 2022 New York dataset: 1,843 stations
Total number of rides taken in 2022: 30,689,921, but pieces of data were missing for a small fraction of these (e.g. end station, member versus non-member rider) so, ditching those, the clean full dataset I started with totaled 30,618,148 rides.
Of those rides, 78.2 percent were taken by members and 21.8 percent by non-members. Call it an 80/20 ridership split in favor of members.
Number of rides longer than 60 seconds and less than or equal to 2 hours: 29,644,932. Citi Bike believes trip length times under a minute are “potentially false starts or users trying to re-dock a bike to ensure it's secure.” Likewise suspicious of rides exceeding two hours—and their impact on trip time averages and other calculations of interest—I ran several queries probing trip-length times and found that constraining my dataset to rides between 60 seconds and 2 hours in length eliminated only 3.5 percent of all trips and did next to nothing to the 80/20 user type ratio.
In short, after throwing out rides with incomplete data and rides shorter than 60 seconds or longer than 2 hours, I found myself working with nearly 97 percent of the original dataset: Some 29.6 million trips, over 23 million of which taken by Citi Bike members, starting from roughly 1,800 different stations.
And so, if Citi Bike users can rent a bike from 1,800 stations across New York City, which do they choose most frequently? I used an SQL query to identify the 100 most common stations from which trips started and mapped the results with QGIS:
Note the yellow dots around Central Park’s edges.
Below I present alternative visualizations of these results—which perhaps more clearly demonstrate non-members’ interest in cruising New York’s major parks—but it is first worth noting calculations I made on the histogram of starting stations: For both members and non-members alike, roughly 25 percent of all trips started at a Top 100 docking station, even though those stations represent only 5.7 percent of all possible starting stations. In other words, the Top 100 list is a decently meaningful indicator of station popularity.
Now for the alternative maps, which employ a third-party QGIS plug-in for 3D visualizations:
Top 100 member start stations. Peak height represents fraction of Top 100 starts.
Top 100 non-member start stations. Peak height represents fraction of Top 100 starts.
My main take-homes eye-balling these maps are (1) Midtown down to Lower Manhattan are popular spots to pick up bikes for members and non-members alike; (2) While the distribution of starts among Top 100 start stations is fairly even for members, non-members have a clear interest in picking up bikes on Central Park’s perimeter (not too mention Prospect Park’s in Brooklyn).
Here’s another angle, this time depicting both member and non-member data:
Top 100 start stations, members (grey) and non-members (black). Peak height represents fraction of Top 100 member/non-member starts.
Techniques
Code and QGIS tools to produce the above included:
Python scripts and glob and pandas libraries to merge 12 large Citi Bike csv data files into a single dataframe, confirm datetime formats, calculate and add trip-length columns, search for missing data, and calculate basic dataset statistics.
Python script to create definitive, QGIS-friendly dataset of docking station coordinates (longitude and latitudes).
Created database and multiple tables in DB Browser for SQLite to run SQL queries. Soon discovered my 12.46 GB database was much more quickly analyzed in the SQLite command line in Terminal for Mac, so subsequently ran queries and exported results into QGIS-friendly files using that platform.
Joined layers and summed columns for various calculation in QGIS; added spatial properties to attribute-only layers through joins and geopackage exports.
Used QGIS Translate tool to slightly offset station locations in order to ensure clearer depiction of dots and cones.
Use qgis2threejs plug-in to create 3D visualizations, experimenting with Geometry Height formulas to achieve desired effect.
Data sources
In addition downloading trip data from the Citi Bike website, I found street, park, bike route, and other New York City shapefiles at NYC Open Data.