Traditional urban planning models often assume that freight traffic is a simple function of general economic activity. As e-commerce and modern logistics reshape cities, I assisted Professor Stathopoulos in a project designed to challenge this assumption. Our primary objective was to identify the specific, modern drivers of freight demand in Chicago by analyzing the relationship between freight trip volumes and a nuanced set of land use, employment, and infrastructure indicators.
The final paper I produced for this project is on the left.
My role involved contributing to the entire research lifecycle, from initial data gathering and cleaning to the final model validation and interpretation of the results.
The first step was to build a comprehensive, cross-sectional dataset. I sourced data from multiple public and private entities, including Replica HQ for weekly freight trip counts, the Chicago Metropolitan Agency for Planning (CMAP), and the Chicago Data Portal. My key task was to clean and aggregate this disparate data to a consistent unit of analysis: Chicago's 77 community areas. This involved spatial joins, data cleaning, and creating a unified dataset ready for analysis.
Before modeling, I performed exploratory data analysis to visualize initial relationships and refine our hypothesis. This involved creating GIS-based choropleth maps (like Figure 1) to visually compare the geographic distribution of freight hotspots with indicators like median household income. This initial step immediately confirmed a key insight: the areas with the highest freight activity were not the same as the city’s wealthiest areas, suggesting a more complex story.
The core of the project was the development of a predictive model.
Model Selection: Recognizing that freight trip data is count-based and right-skewed, I implemented a log-linear regression model. Applying a natural log transformation to the dependent variable (log(Trip Ends + 1)) was a critical step to normalize the data's distribution and ensure the validity of our statistical analysis.
Problem Diagnosis & Correction: An initial version of the model suffered from severe multicollinearity, where predictor variables were too closely related. I diagnosed this by calculating the Variance Inflation Factors (VIF), which revealed values exceeding 30 for Job_Density_i and Warehousing_Employment_1000s. This indicated a flawed model. I addressed this by refining the model's variables, which resulted in a corrected model where all VIFs were close to 1.0, confirming a robust and statistically sound structure (see VIF results).
A model is only useful if its assumptions are met. I validated the final model by generating and analyzing a series of diagnostic plots (Figure 8). I assessed the Residuals vs. Fitted plot to confirm linearity and the Normal Q-Q plot to check for normality in the residuals. While the model showed some "heavy tails" due to outliers like the Loop, the diagnostics confirmed it provided a sound and defensible framework for our conclusions.
The final step was to translate the model's statistical output into a clear and compelling narrative. I generated a series of data visualizations to communicate our findings, including the correlation matrix (Figure 3) and various scatter plots (Figure 4) that clearly showed that warehousing employment was a far stronger predictor of freight trips than manufacturing or wholesale trade employment.
Our analysis successfully demonstrated that Chicago’s freight landscape has decoupled from centralized commercial activity. The true drivers are specialized logistics indicators—specifically, warehousing employment and the presence of major distribution centers.
This project was a deep dive into the practical application of statistical modeling to solve real-world urban challenges. I learned the critical importance of a methodical process: starting with careful data aggregation, diagnosing and correcting model deficiencies like multicollinearity, and rigorously validating the final results before drawing conclusions. Most importantly, it taught me how to use data to tell a nuanced story and provide data-driven tools for more equitable and effective urban planning.
Tools & Skills: Statistical Modeling (Log-Linear Regression), R, Data Visualization, Data Cleaning & Aggregation, Model Diagnostics (VIF, Residual Analysis), Technical Communication.
The objective of this project was to build a robust machine learning model capable of classifying atmospheric conditions as 'clear' or 'cloudy' using hyperspectral radiance data from the AIRS satellite instrument. This is a fundamental task in atmospheric science and remote sensing, enabling more accurate climate modeling and weather forecasting.
I began by sourcing and preprocessing a large, real-world dataset stored in .mat scientific format. The primary features consisted of over 1,500 spectral channels of brightness temperature data. My first step was to perform exploratory data analysis, which involved visualizing the average spectra for clear vs. cloudy scenes to understand the fundamental physical differences that a model could learn.
I developed a Random Forest classifier using Scikit-learn, chosen for its robustness and interpretability. A key part of the process was to prevent overfitting and maximize performance by systematically tuning the model's hyperparameters. I implemented GridSearchCV to find the optimal max_depth, using 5-fold cross-validation to ensure the model would generalize well to unseen data.
To assess the final model's performance, I went beyond simple accuracy. I generated and analyzed a confusion matrix to understand the specific types of errors (false positives vs. false negatives) and plotted the ROC AUC curve to provide a comprehensive measure of the classifier's predictive power across all decision thresholds.
The most critical phase was interpreting the "black box." I extracted and analyzed the model's feature importances to identify exactly which spectral channels were most influential in the classification decision. By plotting these importances against their corresponding wavenumbers, I was able to connect the model's statistical findings back to the underlying atmospheric science, highlighting the spectral regions most affected by clouds.
The final, tuned model achieved a high classification accuracy on the test set. This project demonstrated a full machine learning workflow: from handling a complex scientific dataset to building, optimizing, and—most importantly—interpreting a predictive model to derive actionable insights.
Tools & Skills: Python, Scikit-learn, Pandas, NumPy, H5Py, Machine Learning, Random Forest, GridSearchCV, Cross-Validation, Feature Importance Analysis, Data Visualization.