Predicting Homeless Population Sizes

Homelessness in America is a deepening crisis that is inextricably linked to poverty and systemic injustice. The current system to prevent and to house homeless individuals is inadequate and should be reconsidered. In this project, I gathered publicly available data to build a predictive model that provides insight into homelessness population growth.

*video from Princeton Eviction Lab

The United States Department of Housing and Urban Development(HUD) gives grants to "Continuum of Care" applicants through an annual competition. A Continuum of Care (CoC) is a regional or local planning body that coordinates the funding for housing and homeless prevention services. CoCs distribute resources to nonprofits and local government programs across the regions they represent. The population size and geographic areas covered by each CoC are dramatically different across state lines.

Two critical activities entrusted to CoCs are the biannual physical count of homeless people and an annual enumeration of transitional housing units and homeless shelter beds. These counts provide an overview of the state of homelessness in a CoC area and help HUD allocate resources. The counts do not reflect the people who do not wish to be seen or those who may have found shelter on that particular day.

Scope of Project:

For this final project at the Flatiron School, I had 10 days to determine a topic/hypothesis, assemble the dataset, and develop an appropriate model. This was a self-directed project with weekly student check-ins.

Primary Data Sources | Years 2007:2016

US Department of Housing and Urban Development: CoC Homeless Counts

US Social Security Administration: Supplemental Security Income Annual Reports

US Department of Labor's Bureau of Labor Statistics: Unemployment Rates

Princeton University: Eviction Lab Statistics

Kaiser Family Foundation: State Mental Health Agency Expenditures per Capita

HUD Exchange: Continuum of Care Coverage (compiled the county list manually)

Compiling Data

Using python/pandas, annual HUD CoC files were converted into a single dataframe containing CoC number, year, homeless count.
Excel workbooks of annual Social Security data contained sheets for each state with spend itemized by county and demographic on each sheet. This data was compiled into a single dataframe.
Unemployment data was downloaded via Kaggle. I grouped monthly observations into annual averages.
Location data was compiled from three sources and merged into a dataframe with CoC number, FIPS code, latitude, longitude, county, state, and state abbreviation.
Eviction state/county data was merged with location data using FIPS/GEOID codes. Social security and unemployment data were then added to the eviction data.

To converge the compiled data with CoC homeless counts,I grouped the observations into geographic areas using the CoC number. Features containing totals were summed together. Features containing percentages, averages, or medians were averaged. Data was merged to create the initial dataset with 3157 observations and 34 columns.

Exploratory Data Analysis

A scatterplot graph of population size by homeless count showed outliers with homeless counts above 20,000 & population sizes above 5,000,000. To keep these data points in the set, the homeless counts were capped at 5M+ and the population is capped at 20K+.

I checked for missing data and found that 498 observations were missing from the eviction rate feature and missing 448 observations were missing from the eviction filing rate feature. Missing values were imputed with the median values for each feature.

The distribution of many of the features are not normal, violating one assumption of linear regression. A logrithmic function did improve most of the feature distributions, but I determined not to use this in the final model.

Feature Engineering

Data on per capita spend for years 2014-2016 were scraped from the samhsa.gov site and the Kaiser Foundation provided the rest of the data for 2007-2013. Two new features were added, per capita smha and total (budget) smha.

Feature Selection

Highly correlated features were initially removed in order to prevent bias in the model, but after determining that the final model would be a random forest algorithm, they were left in.

Transformations

After scaling the data, testing logarithmic functions and polynomial regression, I determined that in the final model no transformations or scaling will be used in order to maximize interpretability.

Initial tests

Models with logged, scaled and interaction features + dropping features with > .90 correlation

Linear Regression - with logged & interaction features, R2: 0.77

Decision Tree Regression with Grid Search CV, R2: 0.85

Random Forest Regression with Grid Search CV, R2: 0.93

Final Model: Random Forest Regression

After calculating the cumulative sum of the features, any features that were contributing below the 95% threshold were removed.

Random Forest Regression with Grid Search CV

Parameters = {'criterion': 'mse', 'max_depth': 14, 'max_samples': 0.5, 'min_samples_split': 2, 'n_estimators': 1000}

R2: 0.94

MAE: 288

OOB: 0.93

Conclusion

The number of renter occupied households is a strong indicator of homelessness, followed by elderly populations who are living in poverty. It is clear that the housing crisis is driven by a lack of affordable housing and renter protections.

A predictive model can potentially eliminate the problematic manual count and/or grant competitions, allowing for systematic distribution of resources. Funding for homelessness services should be determined by need, irrespective of the current political climate and free from potential institutional bias. Homelessness can be determined at the county level, which could lead to a community-based approach and stronger data collection.

There are additional contributing factors to homelessness such as domestic violence, natural disasters, immigration policy, addiction, and divorce. Including some or all measurements of these factors could improve the model as well.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
animated_maps.ipynb		animated_maps.ipynb
data _extraction.ipynb		data _extraction.ipynb
data_assembly.ipynb		data_assembly.ipynb
eda_linear_regression.ipynb		eda_linear_regression.ipynb
eda_preprocessing.ipynb		eda_preprocessing.ipynb
feature_engineering.ipynb		feature_engineering.ipynb
final_random_forest_model.ipynb		final_random_forest_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Homeless Population Sizes

Scope of Project:

Primary Data Sources | Years 2007:2016

Compiling Data

Exploratory Data Analysis

Feature Engineering

Feature Selection

Transformations

Initial tests

Models with logged, scaled and interaction features + dropping features with > .90 correlation

Final Model: Random Forest Regression

Conclusion

About

Releases

Packages

Languages

jen-mckaig/predicting-homelessness

Folders and files

Latest commit

History

Repository files navigation

Predicting Homeless Population Sizes

Scope of Project:

Primary Data Sources | Years 2007:2016

Compiling Data

Exploratory Data Analysis

Feature Engineering

Feature Selection

Transformations

Initial tests

Models with logged, scaled and interaction features + dropping features with > .90 correlation

Final Model: Random Forest Regression

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages