---
#Add in title
#title: "| \n| \n| Estimating the Annual Size of the Los Angeles Homeless Population
# Using Point-in-Time Data \n| \n| \n"
#--author: "Jane Carlen"
bibliography: annualized_model.bib
fontsize: 12pt
subparagraph: yes
output:
pdf_document:
includes:
in_header: mystyles.sty
abstract: In this report we introduce a model-based framework for estimating the number
of people who experience homelessness in Los Angeles over the course of a year based
on point-in-time data. We refer to this as the annualized estimated of the population
size. Unlike capture-recapture methods, this model is based on a single set of anonymous
point-in-time data. This makes it viable for nearly all jurisdictions that conduct
annual counts of their homeless population and gather basic information about the
duration of homelessness within this population. In addition, it allows us to estimate the breakdown
of the annualized population by duration of homelessness. A practical application of this is to quantify the distribution of homeless services needed to provide progressively more intensive levels of intervention for individuals with longer durations of homelessness. We incorporate
measures of uncertainty in our final estimates, reflecting that demographic
data is derived from a sample of the target population. As such, our model provides more complete information about the annualized population than the model developed by @burtwilkins05 that is currently in use in Los Angeles.
---
\hrulefill
```{r setup, cache = T, echo = F, results = "hide", warning = F, message = F, fig.keep = "none"}
library(xtable)
library(plotrix)
library(dplyr)
library(questionr)
library(weights)
library(foreign)
library(stringr)
library(ggplot2)
library(scales)
library(lubridate)
source("annualized_model_data.R")
```
# Introduction
Every year Los Angeles invests significant resources, including mobilizing thousands of volunteers, to estimate the number of people experiencing homelessness on a given night. The annual Los Angeles Homeless Count, which includes information collected from sheltered and unsheltered people, is the primary tool for measuring the size and demographics of this population and how they have changed over time. This information is critical to understanding the growing need for emergency shelter, housing services, and other interventions. With the availability of hundreds of millions of dollars in funding annually for housing and supportive services, through the county's Measure H and the city's Measure HHH, this information becomes even more vital to policymakers.
The primary goal of the Homeless Count is to measure the population that is homeless on a given night. The resulting population size estimate, which has been in the ballpark of fifty thousand for the last two years, is often taken as the definitive size of the homeless population. It is the figure most widely reported in local media, and when it declines, as reported in 2018, it is considered evidence that resources are being deployed successfully.
However, the size of the point-in-time (PIT) population, the number of people who are homeless on a given night, may not always be the primary data point for policymakers. When planning for housing and other services that are provided on an extended basis it is equally important to know how many people experience homelessness over the entire period.
In this report we address the question of how to estimate the number of people who experience homelessness over the course of a year, the annualized estimate, based only on PIT data. To do this we draw on survey data that describes how long individuals have been homeless and how many stints of homelessness they have recently experienced.
We present an estimation method that can be flexibly applied regardless of the time intervals in which these measurements were gathered. Our model also allows us to present our estimate within a range of plausible estimates, reflecting limitations of the sampling methodology. In addition, it enables us to not only estimate the annualized population size, but the distribution of time spent homeless within the whole population. This is extremely important for capturing the prevalence of short-term homelessness. Although chronic homelessness is often the face of homelessness in cities like Los Angeles, more people experience shorter stints of homelessness over the course of a year. Bringing this to the forefront highlights the precarious housing situations that too many Angelenos face.
Research to date on estimating the annualized homeless population based on PIT data is sparse. In fact, we found only the model of @burtwilkins05, the model currently used Los Angeles, designed specifically for this purpose.
Related research has been conducted on modeling durations of homelessness based on shelter data and demographic surveys. For example, @metraux99 present an overview of methods to analyze the lengths of shelter stays using administrative data. This work contains examples of constructing survival curves and hazard curves to describe the distribution of lengths of stays of individuals in shelter, drawing on data from New York City. More generally, this line of inquiry falls under the active research field of survival analysis.
In the next section we present the nuts and bolts of the model of @burtwilkins05 and address some of its shortcomings.
## The Burt and Wilkins model
Although not widely publicized, the Los Angeles Homeless Services Authority (LAHSA) reports annualized estimates of the size of the homeless population and of many demographic subgroups. These estimates are calculated using the following model developed by @burtwilkins05.
\begin{equation}
Annualized~estimate = A + 51*B*(1-C),
\label{eq:original}
\end{equation}
where $A$ is the PIT count of the homeless population, $B$ is the number of currently homeless people who became homeless in the counted area during the last week, and $C$ is the proportion of currently homeless people who had a previous homeless episode during the last year.
This formula accounts for the fact that more people become homeless each week. If we were to conduct a PIT count in the remaining 51 weeks of the year we would initially add $B$ people to the population estimate every week. But, according to the model, the proportion $C$ of them would have been counted during a previous week.
The use of a consistent proportion $C$ does not reflect that fact that if a new count were performed every week then only in the final week of the 52-week period would $C$ approximate the proportion of individuals who were previously counted. On the other hand, if we assume that previous episodes of homelessness in a year are uniformly distributed, then in the second week of the 52-week period the proportion of people previously counted would be $\frac{C}{52}$. Summing over 52 weekly counts, the annualized estimate would be calculated as follows.
\begin{align}
Modified~annualized~estimate &= A + B(1-\frac{C}{52} + 1-\frac{2*C}{52} + ... + 1-\frac{51*C}{52}) \notag \\
& = A + 51*B*(1 - \frac{1}{2}*C). \label{eq:modified}
\end{align}
```{r annualized_estimate, echo = F, warning = F, eval = T, message = F, results = "hide", cache = T}
A = pit_2017 = 52442
unsheltered_2017 = 38470
B = 1513.335 # sheltered plus unsheltered
C = 0.2918899
A + (51*B*(1-C))
A + (51*B*(1-.5*C))
underestimate1 = prettyNum(round(25.5*B*C), big.mark=",")
# Calculation of C ####
# I'm not entirely clear how C is calculated in practice. Seems to take the proportion of people homeless more than once over the whole population, but more appropriate would be to restrict to people who’ve been homeless a week or less.
# Does calculation of C account for the fact that previous homeless episode may not be in LA?
# Burt and Wilkins mention this, but the closeness of the estimate below without accounting it suggests they don't
# About 30pct of people in demographic stated out of county as Location_Before_Current_Stint --> impact would be bigger?
# Attempt to recreate
p1 = prop.table((wtd.table(demographic_17$Times_Homeless_Past_Year, weights = demographic_17$Weights_rescale)$sum.of.weights))
p2 = prop.table((wtd.table(demographic_17$Times_Homeless_3yrs, weights = demographic_17$Weights_rescale)$sum.of.weights))
ratio1 = p2/p1
p3 = prop.table((wtd.table(shelter_17$Times_Homeless_3yrs, weights = shelter_17$Weights_rescale)$sum.of.weights))
#Re-created estimate
C2 = 1 - (unsheltered_2017*p1[1]/sum(p1[1:3]) + (pit_2017-unsheltered_2017)*p3[1]/sum(p3[1:3])*ratio1[1])/pit_2017
```
Thus, the original model in Equation \eqref{eq:original} is an underestimate by $\frac{51}{2}*B*C$ of the annualized population, as compared to the revised estimate in Equation \eqref{eq:modified}.
[^1]: This value of B is actually a composite estimate that we use for simplicity. In fact, since 2016 the annualized estimate calculated by LAHSA is a sum of seperate calculations for sheltered adults, sheltered youth, unsheltered adults, and unsheltered youth, where B and C are calculated for each sub-group.
For example, in 2017 when $B$ was estimated to be 1,513 people [^1] and $C$ to be 29.2 percent, the magnitude of underestimate was `r underestimate1` or about ten percent.
```{r annualized_range, echo = F, warning = F, eval = T, message = F, cache = T, results= "hide"}
#A + (51*B*(1 - 0.5*C))
B2 = 1000 #hypothetical total from unsheltered
C = .292
N = 5800
# Uncertainty with binomial ####
set.seed(5)
#range for B
B.range = prettyNum(round(quantile(rbinom(1000000, N, B2/unsheltered_2017)/N, probs = c(0.025, .5,.975))*unsheltered_2017), big.mark = ",")
#range for C
rbinom.C = rbinom(1000000, N, C)/N
quantile(rbinom.C, probs = c(0.025, .5,.975))
#overall range
range1 = 52442 + quantile(51*(rbinom(1000000, N, B2/unsheltered_2017)/N)*unsheltered_2017*(1 - 0.5*rbinom.C), probs = c(0.025, .5,.975)) + 51*(B - B2)*(1 - 0.5*C) #PIT + unsheltered plus sheltered
overall.range = prettyNum(round(range1), big.mark=",")
```
In addition, the model introduces variability due to sampling of the unsheltered population. Although the Homeless Count attempts to count every person who is homeless on a given night, only a fraction of the unsheltered population completes demographic surveys. For example, in 2017 about 5,800 individual responses were gathered from unsheltered individuals containing information about their duration of homelessness. This represents about fifteen percent of the estimated PIT unsheltered population of 38,470 individuals. On the other hand, the survey of sheltered individuals is a near-complete census of the entire unsheltered population in a given time window, with a few minor exceptions such as domestic violence shelters.
While sampling is a necessary method for collecting data about any large and hard-to-reach population, it is important to relate the impact that it has on the precision of the resulting estimate. To illustrate this, let us make the simplifying assumptions that the true value of B is 1,000 for the unsheltered population and each individual's response is independent of others who are surveyed. Using the binomial distribution and the sampling parameters from above, our estimate of B has a 95 percent chance of being between `r B.range[1]` and `r B.range[3]`. Our estimate of C has a 95 percent chance of being between 28.05 percent and 30.35 percent. Using our updated formula, we would expect that 95 percent of the time our derived annualized estimate would fall in the range from `r overall.range[1]` to `r overall.range[3]`. This is a fairly wide range. Although the underlying estimate of measurement uncertainty is not perfect due to the simplifying assumptions, we advocate for presenting such a range to better reflect the data collection process.
Another weakness of the @burtwilkins05 model is that is it highly dependent on a single measurement, B, of the number of new entrants to homelessness in the previous week. Such a measure is impacted by the time of the month and clumping in reported durations of homelessness, among other factors. (Clumping in this context refers to the tendency to round off time estimates, such as if everyone who was homeless for about a week reports being homeless for one week. See Figure \ref{fig:why_smooth} for evidence of clumping.) The measurement of B is also sensitive to changes in data collection. For instance, in 2017 the youth-specific survey stopped asking questions about how long an individual was in Los Angeles County. Data from 2016 was therefore used to fill in missing pieces of the calculation [@methodology17].
The upshot of the sensitivity of the Burt and Wilkins model to sampling variation, measurement error and changes in data collection procedures is shown in the fluctuation of the annualized estimates. While estimates of the PIT homeless population in Los Angeles always increased in reported counts from 2011 to 2017, the annualized estimate has decreased in all but one year, as shown in Table \ref{tab:annualized_estimate_table}. Part of the reason why there is such a large drop in the annualized estimate from 2016 to 2017 is that the method for calculating the length of stay in a shelter changed. In 2016, the length of stay in the shelter was used. In 2017, this length could be added to the length of stay in the prior living situation to determine whether the individual was homeless for a week or less [@methodology17].
```{r annualized_estimate_table, eval = T, echo = F, results = "asis", warning = F, message = F, cache = T}
annualized_estimate_table = data.frame(Year = c("2011", "2013", "2015", "2016", "2017"), "PIT Estimate" = c("34622", "35524", "41174", "43854", "52442"), "Annualized Estimate" = c("120,072", "187,119", "162,769", "156,431", "107,094"))
print.xtable(xtable(annualized_estimate_table, caption = "LAHSA PIT and annualized population estimates based on the Burt and Wilkins model for the Los Angeles CoC.", label = "tab:annualized_estimate_table"), sanitize.colnames.function=function(x) gsub("\\."," ",x), include.rownames = F, comment = F, caption.placement = "top")
```
A final shortcoming of the Burt and Wilkins model that we address is that that it does not account for potentially unobserved very short-term homeless populations. The proportion of people homeless for at most a week is an underestimate because it does not account for people who are homeless for only a few days but not at the time of the survey. They would drive up the value of B in the model, which means that the reported value of B is likely an underestimate.
One way to account for this would be to base the model on the number of newly homeless people in the last day rather than in the last week, but this single-day measure would be even more prone to measurement error and natural variability than B. A related concern is that people who have only been homeless for a very short time may be the most likely to be skipped over by people conducting demographic surveys, even if they are homeless when the survey is conducted. Although the count of the PIT unsheltered population occurs overnight, the demographic survey is conducted largely during the day, when it is more difficult to determine who is homeless. This would also lead to an underestimate of B.
Admittedly, this unobserved group is difficult to measure directly. It would require identifying people who experience very short stints of homelessness, who may or may not be homeless when the survey is implemented. It may be possible to estimate the size of this group using data from shelters, for which we have records over an entire month. However, that would require gathering reliable information about individuals' destinations after leaving a shelter. In 2017 most of that data was missing.
For these reasons, we propose a model that leverages a much larger pool of information about durations of homelessness to make inferences about the size of the short-term homeless population. It reflects our assessment that the percentage of individuals in the PIT population who have been homeless for a given duration should in general decrease as the duration increases, barring strong seasonal effects, policy changes or shocks to the economy or housing market. For instance, the number of people who have been homeless for two months is a subset of those who were homeless for one month a month ago. Therefore, we expect the PIT two-month group to be smaller than the one-month group. Given the minimal seasonality in Los Angeles, we assume that population entry and exit rates are fairly stable throughout the year.
## Data
As mentioned above, annual Homeless Count data includes both the overnight count data used to estimate the size of the PIT population and demographic data that contains detailed information about a subset of the population. Throughout this report we use data that describes the Los Angeles Continuum of Care (CoC), which includes all of Los Angeles County except the cities of Pasadena, Glendale and Long Beach. If not otherwise specified, the data employed is from 2017, the most recent year for which we have demographic survey data.
The demographic data covers a wide array of topics, but the available variables depend on the year, whether the survey respondent was sheltered or unsheltered, and whether they were in the ``youth'' group (age 18-24). For our model, the most important variables are those describing time spent homeless. Specifically, we use the reported durations of current stints of homelessness and number of times homeless in the previous year or, if that is unavailable, three years. We find that the it is necessary to pre-process some of the information about durations of current stints, and we describe this process with the introduction of our model framework in the next section.
# Model Framework
The most important advancement of the estimating methodology we present below over the Burt and Wilkins model is that we first model the distribution of the lengths of current stints of homelessness. We then use this fit curve to calculate an annualized estimate and derive other valuable information about the distribution of time spent homeless among the annualized population.
## Smooth the duration data
Because of the clumping of survey responses mentioned previously, we must take steps to smooth the distribution of reported stint durations before modeling it. We restrict our attention to durations under a year as this is sufficient for our model, and detailed stint durations become less reliable for longer periods. Figure \ref{fig:why_smooth} shows the distribution of durations by month before and after this process. Before smoothing there are spikes in the distribution at one month, six months and nine months. This suggest a tendency for survey respondents to significantly round off the duration of their current stint of homelessness.
Accordingly, to smooth the data we first group the data into quarters of the year. We then fit a piecewise exponential distribution to each quarter. This agrees with our previous assumption that the likelihood of observing a certain stint duration decreases with its length. It also makes the simplifying assumption that exit rates from homelessness are constant within each quarter. Finally, the data is resampled from the piecewise distribution to reflect the natural variability in observed data. While we expect a decreasing curve overall, we know that in practice it will be spiky rather than smooth. The smoothed distribution shown in Figure \ref{fig:why_smooth} represents a single sample from the piecewise exponential, but we will ultimately draw on many samples to obtain our annualized estimate and measure uncertainty.
Note that this smoothing process was designed for our data, but it may not be optimal in all cases. In some cases this step may not even be necessary, for example if most homeless people are sheltered and duration data is carefully recorded rather than self-reported.
```{r fit_curve, eval = T, echo = F, results = "hide", message = F, warning = F, fig.keep = "none", cache = T}
set.seed(2)
# Make shelter + unshletered data set ####
names1 = c(intersect(names(dplyr::select(demographic_17, contains("Duration"))),
names(dplyr::select(shelter_17, contains("Duration")))), "Weights_rescale")
total_17 = rbind(cbind(shelter_17[,names1], source = "shelter"), cbind(demographic_17[,names1], source = "unshelter"))
# Prepare the data ####
smooth1 <- function(data1, time1, days1) {
#quarters
if (time1=="quarters") {
sample1 = as.numeric(as.character(data1$Current_Stint_Duration_Quarters))[!is.na(as.numeric(as.character(data1$Current_Stint_Duration_Quarters)))]
weights1 = data1$Weights_rescale[!is.na(as.numeric(as.character(data1$Current_Stint_Duration_Quarters)))]
max1 = max(sample1)
w1 = wtd.table(sample1, weights1/sum(weights1))
#smooth out by piecewise exponential -- better than uniform
sample1 = sapply(sample1, function(x) {
fn1 = function(y) {abs(exp(-y*(x-1)) - exp(-y*x)-w1$sum.of.weights[x])}
lambda1 = optim(fn1, par = .5, method = "L-BFGS-B", lower = 0, control = list(maxit = 1000, pgtol = .001))
sample(x = seq((x-(days1-1)/days1)*days1, (x-(days1-1)/days1)*days1+(days1-1), length = days1),
size = 1,
prob = dexp(seq((x-(days1-1)/days1)/4,x/4, length = days1), lambda1$par),
replace = T)})/days1 #+ .00001 avoid error
#uniform:
#sample1 = sapply(sample1, function(x) {runif(1, x-1, x)})
#wtd.hist(sample1, weight = weights1, col = "red")
return(list(sample1 = sample1, weights1 = weights1, max1 = max1, days1 = days1))
}
if (time1=="months") {
sample1 = as.numeric(data1$Current_Stint_Duration_Months)[!is.na(as.numeric(data1$Current_Stint_Duration_Months)) & as.numeric(data1$Current_Stint_Duration_Months) < 13]
weights1 = data1$Weights_rescale[!is.na(as.numeric(data1$Current_Stint_Duration_Months)) & as.numeric(data1$Current_Stint_Duration_Months) < 13]
max1 = max(sample1)
w1 = wtd.table(c(sample1,2), c(weights1/sum(weights1),0)) #fix for no 2 values
#better than uniform?
#sample1 = sapply(sample1, function(x) {sample(seq((x-1)*31, (x-1)*31+30, length = 31),
# prob = dexp(seq((x-1)/12,x/12, length = 31), -log(1 - w1$sum.of.weights[x])),
# replace = T)})
sample1 = sapply(sample1, function(x) {runif(1, x-1, x)})
return(list(sample1 = sample1, weights1 = weights1, max1 = max1, days1 = days1))
}
}
#days
#sample1 = data1$Current_Stint_Duration_Days[!is.na(data1$Current_Stint_Duration_Days) & data1$Current_Stint_Duration_Days < 365]
#weights1 = data1$Weights_rescale[!is.na(data1$Current_Stint_Duration_Days) & data1$Current_Stint_Duration_Days < 365]
#max1 = max(sample1)
#sample1 = sapply(sample1, function(x) {runif(1, x-1, x)})
results.smooth1 = smooth1(total_17, "quarters", 92)
sample1 = results.smooth1[["sample1"]]
weights1 = results.smooth1[["weights1"]]
max1 = results.smooth1[["max1"]]
days1 = results.smooth1[["days1"]]
# Fit the data ####
# MLE method for truncated data: https://stats.stackexchange.com/questions/117782/truncated-gamma-distribution-parameter-estimation
# - exponential (benefit of always decreasing, makes sense for overall distribution) ####
par(mfrow = c(2,2))
fn.exp = function(params) {
sum(weights1*log(dexp(sample1, params[1])))
}
optim.exp = optim(par = 1, fn = fn.exp, control = list(fnscale = -1, maxit = 500), method = "BFGS")
optim.exp
wtd.hist(sample1, breaks = 40, freq = F, add = F, w = weights1, col = "green")
curve(dexp(x, optim.exp$par), xlim = c(0,max1), ylim = c(0,1), add =T)
# - truncated exponential ####
dexptrunc <- function(x, lambda, lower, upper) {
dexp(x, lambda) / diff(pexp(c(lower,upper), lambda))
}
fn.exptrunc = function(lambda) {
sum(weights1*log(dexptrunc(sample1, lambda, lower = 0, upper = max1)))
}
optim.exptrunc = optim(par = 1, fn = fn.exptrunc, control = list(fnscale = -1, maxit = 500), method = "BFGS")
optim.exptrunc
wtd.hist(sample1, breaks = 40, freq = F, add = F, w = weights1, col = "green")
curve(dexp(x, optim.exptrunc$par[1]), xlim = c(0,max1), ylim = c(0,.75), add =T)
# - weibull ####
fn.weibull = function(params) {
sum(weights1*log(dweibull(sample1, shape = params[1], scale = params[2])))
}
optim.weibull = optim(par = c(3,3), fn = fn.weibull, control = list(fnscale = -1, maxit = 500), method = "BFGS")
optim.weibull
wtd.hist(sample1, breaks = 30, freq = F, w = weights1, col= "red")
curve(dweibull(x, optim.weibull$par[1], optim.weibull$par[2]), xlim = c(0,max1), ylim = c(0,.75), add = T)
# - truncated weibull ####
dweibulltrunc <- function(x, shape, scale, lower, upper) {
dweibull(x, shape, scale=scale) / diff(pweibull(c(lower,upper), shape, scale=scale))
}
fn.weibulltrunc = function(params) {
sum(weights1*log(dweibulltrunc(sample1, shape = params[1], scale = params[2], lower = 1/days1, upper = max1)))
}
#optim.weibulltrunc = optim(par = c(1,1), fn = fn.weibulltrunc, control = list(fnscale = -1, maxit = 500), method = "BFGS")
# with upper bound on shape param:
optim.weibulltrunc = optim(par = c(1,1), fn = fn.weibulltrunc, control = list(fnscale = -1, maxit = 1000), method = "L-BFGS-B", upper = c(1,Inf))
optim.weibulltrunc
wtd.hist(sample1, breaks = 30, freq = F, w = weights1, col= "red")
curve(dweibull(x, optim.weibulltrunc$par[1], optim.weibulltrunc$par[2]), xlim = c(0,max1), ylim = c(0,.75), add = T)
# Imlied exit rates ####
int1 = 1/3
breakdown1 = pweibull(seq(int1,max1,by=int1), optim.weibulltrunc$par[1], scale = optim.weibulltrunc$par[2]) - pweibull(seq(0,max1-int1,by=int1), optim.weibulltrunc$par[1], scale = optim.weibulltrunc$par[2])
count1 = round(c(breakdown1, 1- sum(breakdown1))*pit_2017)
percent1 = c(breakdown1, 1- sum(breakdown1))
exit1 = 1 - breakdown1[-1]/(breakdown1)[-length(breakdown1)]#exit rates
exit11 = prettyNum(round(100*exit1[1], 1))
# Exponential distribution implies constant exit rates
#breakdown1 = pexp(seq(1,max1,1), optim.exptrunc$par[1]) - pexp(seq(0,max1-1,1), optim.exptrunc$par[1])
#1 - (breakdown1)[-1]/(breakdown1)[-length(breakdown1)]
# these exit rates would look more like those in the Escape Routes model
breakdown2 = pweibull(seq(int1,max1,by=int1), .7, scale = optim.weibulltrunc$par[2]) -
pweibull(seq(0,max1-int1,by=int1), .7, scale = optim.weibulltrunc$par[2])
1 - breakdown2[-1]/(breakdown2)[-length(breakdown2)]#exit rates
# What does some of the data actually suggest about exit rates?
w1 = wtd.table(shelter_17$Current_Stint_Duration_Quarters, weights = shelter_17$Weights_rescale)
barplot(w1$sum.of.weights, names.arg = w1$x)
w1 = wtd.table(shelter_17$Current_Stint_Duration_Months, weights = shelter_17$Weights_rescale)
barplot(w1$sum.of.weights, names.arg = w1$x)
```
```{r why_smooth, eval = T, echo = F, results = "hide", message = F, warning = F, cache = T, fig.cap = "\\label{fig:why_smooth}Raw and smoothed distribution of duration of homelessness (2017).", fig.height = 3, dependson = c("fit_curve")}
raw = wtd.table(as.numeric(total_17$Current_Stint_Duration_Months),
weight = total_17$Weights_rescale)$sum.of.weights[1:12]
smooth = wtd.table(ceil(sample1*3), weight = weights1/sum(weights1))$sum.of.weights
why_smooth = data.frame(Month = c(1:12,1:12),
Type = c(rep("smooth", 12), rep("raw", 12)),
Percent = c(smooth, raw/sum(raw))
)
ggplot(why_smooth, aes(x = Month, y = Percent, fill = Type, group = Type, color = Type)) +
geom_bar(stat = "identity", position = "dodge", width=.5) +
scale_x_discrete(name = "Months", limits = 1:12, labels = levels(total_17$Current_Stint_Duration_Months)[1:12]) +
theme_bw() +
theme(axis.text = element_text(size = 8))
```
## Model the distribution of durations
The next step in our modeling process is to fit a curve to the smoothed data. The class of distribution we employ to do this conveys certain assumptions about exit rates from homelessness over time.
We focus our attention on two commonly used and easily interpretable distributions used in survival analysis, the exponential distribution and the Weibull, which encompasses the exponential distribution as a special case. The ``memoryless'' property of the exponential distribution implies consistent exit rates from homelessness from month to month. In other words, the chance that a person who is homeless finds housing or leaves the area does not change over time. Although this is probably not realistic, it may be a useful conservative assumption if no additional information is available.
The Weibull distribution can capture increasing, constant or decreasing exit rates from homelessness over time depending on its parameters. The most appropriate trend in exit rates will vary by jurisdiction. We can accordingly place bounds on the parameters when optimizing based on local conditions. For example, @metraux99 note that in New York City some households stayed in a shelter until they received a subsidized housing placement, but this would not occur until at least 90 days into the episode and typically not until 9 months to a year had elapsed. This would impact exit rates in a non-intuitive way. In that case other classes of distributions or non-parametric curves may be called upon to fit the duration data.
In the absence of such programmatic considerations in Los Angeles, we believe that exit rates are likely to decrease over the course of a year. The longer one is homeless, the more likely he or she is to face health problems, social isolation, or involvement with the criminal justice system that can create a more permanent barrier to finding housing. Accordingly, we place an upper bound of one on the estimated shape parameter of the Weibull distribution, which forces exit rates to be constant or decreasing over time.
In addition, we continue to restrict our attention to durations of up to a year for the reasons given above. Accordingly, our parameters are optimized based on a truncated distribution of data between one day and less than a year. In future applications we can account for underrepresentation of very short-term homeless people in demographic survey data by further truncating the fit region to exclude very short durations. For simplicity, we have not done this here, but it may be a useful strategy for basing estimates on the most trustworthy data.
```{r plot_fit, eval = T, echo = F, warning = F, cache = T, fig.height = 3, fig.cap = "\\label{fig:plot_fit}Estimated Weibull distribution shown against the smoothed data it models (2017)."}
fit.weibulltrunc = function(x) {dweibull(x, optim.weibulltrunc$par[1], optim.weibulltrunc$par[2])}
plot_fit = ggplot(data.frame(Months = sample1, weights = weights1), aes(Months)) +
geom_histogram(position = "identity", breaks = seq(0,4,.25), fill = hue_pal()(1),
color = "black", aes(y = ..density.., weight = weights1)) +
stat_function(fun = fit.weibulltrunc) +
scale_x_continuous(name = "Months", labels = as.character(seq(0,12,3))) +
theme_bw()
plot_fit
```
Figure \ref{fig:plot_fit} plots the estimated Weibull distribution curve against the background of smoothed duration data. We found during estimation that our data did not support our hypothesis of significantly decreasing exit rates over time. Given our pre-determined parameter bounds, this resulted in fit distributions to resampled data with constant or near constant exit rates. In the plot above, the shape and scale parameters of the estimated Weibull distribution are `r as.character(round(optim.weibulltrunc$par[1], 3))` and `r as.character(round(optim.weibulltrunc$par[2], 3))`, implying a constant monthly exit rate of `r exit11[1]` percent.
## Annualized estimate
```{r annualized_estimate2, eval = T, echo = F, results = "hide", warning = F, message = F, cache = T}
# What percent have been homeless for at least a year?
persistent1 = prop.table(table(total_17$Current_Stint_Duration_Quarters =="5+"))
A + pweibull(7/days1, optim.weibulltrunc$par[1], optim.weibulltrunc$par[2])/
pweibull(max1, optim.weibulltrunc$par[1], optim.weibulltrunc$par[2])*
(pit_2017*persistent1[1])*51*(1-.5*C)
A + pweibull(1/days1, optim.weibulltrunc$par[1], optim.weibulltrunc$par[2])/
pweibull(max1, optim.weibulltrunc$par[1], optim.weibulltrunc$par[2])*
(pit_2017*persistent1[1])*364*(1-.5*C)
```
We use our fit distribution to determine the number of newly homeless individuals in a single day or week more reliably than deriving this figure from survey responses. In fact, a benefit of our approach is that it is not tied to the time intervals in which the data was gathered. We adapt the formula in Equation \eqref{eq:modified} to accept as input a daily estimate of newly homeless people, $D$, instead of the weekly value (B) used by Burt and Wilkins.
\begin{equation}
Day\text{-}based~annualized~estimate = A + 364*D*(1 - \frac{1}{2}*C) \label{eq:modified_day}.
\end{equation}
We present the estimate based on this model in Table \ref{tab:percentiles}, along with a measure of uncertainty in our estimate.
### Uncertainty
Our annualized estimate deviates from the true value for several reasons, including:
- Demographic survey respondents are a sample of the target population.
- There is natural variation in daily entries and exits from homelessness despite long-term trends.
- Demographic surveys of unsheltered people may not be representative. One factor is the challenge of surveying people who have been homeless for only a few days.
We reflect these sources of measurement error in our estimates in several ways. First, we use bootstrapping -- repeated resampling of demographic survey data -- to capture the variation due to sampling of the unsheltered population. Second, we resample from the resulting piecewise exponential fit of the duration data to capture natural variation over time. Lastly, we restrict the Weibull shape parameter to be less than one to ensure that the duration distribution curve is always decreasing. This captures the fact that we expect to find more people on their first night of homelessness on any given night than on their second, and so forth, which helps to account for undercounting. As mentioned above, we can further account for undercounting by tailoring our truncated distribution, though we did not employ that strategy here.
```{r annualized_uncertainty2, eval = T, echo = F, results = "hide", warning = F, message = F, cache = T}
# Uncertainty with bootstraping (for B), resamplng, and binomial (for C)####
# If we had the calculation formula for C we would re-calculate it from the bootstrapped samples,
# but without that we just use the binomial
set.seed(3)
data1 = total_17
time1 = "quarters"
days1 = 92
n.boot = 1000
C.sample = rbinom(n.boot, N, C)/N
results.day = 1:n.boot
results.week = 1:n.boot
results.year = 1:n.boot
for (i in 1:n.boot) {
shelter_rows = which(data1$source == "shelter")
demo_rows = which(data1$source == "unshelter")
s = sample(demo_rows, size = nrow(demographic_17), replace = T,
prob = total_17$Weights_rescale[demo_rows])
results.boot = smooth1(total_17[c(shelter_rows,s),], time1 = "quarters", days1 = 92)
sample1 = results.boot[["sample1"]]
weights1 = results.boot[["weights1"]]
max1 = results.boot[["max1"]]
days1 = results.boot[["days1"]]
optim.boot = optim(par = c(.5,.5), fn = fn.weibulltrunc, control = list(fnscale = -1, maxit = 500),
method = "L-BFGS-B", upper = c(1,Inf))
results.year[i] = A + (pit_2017*persistent1[1])*364*(1-.5*C.sample[i])*
pweibull(1/days1, optim.boot$par[1], optim.boot$par[2])/
pweibull(max1, optim.boot$par[1], optim.boot$par[2])
results.day[i] = pweibull(1/days1, optim.boot$par[1], optim.boot$par[2])/
pweibull(max1, optim.boot$par[1], optim.boot$par[2])
results.week[i] = pweibull(7/days1, optim.boot$par[1], optim.boot$par[2])/
pweibull(max1, optim.boot$par[1], optim.boot$par[2])
}
```
Applying these resampling techniques is roughly equivalent to gathering new hypothetical data sets under the same conditions that generated our observed data. Each resampled data set leads to new estimates of the annualized population, the number of new entrants to homelessness each day (D), and the PIT estimate of people whose current duration of homelessness is at most a week (B).
We repeated this resampling process `r prettyNum(n.boot, big.mark = ",")` times. Table \ref{tab:percentiles} illustrates the 5th percentile, 50th percentile (median), and 95th percentile values out of the `r prettyNum(n.boot, big.mark = ",")` estimates. The median values are used as ``point estimates,'' single values we present as estimates of the annualized population, D and B respectively. The spread between the 5th and 95th percentile values is a measure of the uncertainty in our point estimates. We find that the variation is not very large, so our estimates are fairly precise.
```{r annualized_uncertainty_summary, eval = T, warning = F, message = F, cache = T, echo = F, results = "asis", dependson = c("annualized_uncertainty2")}
print(xtable(data.frame(
"Annualized population" = prettyNum(round(quantile(results.year,
prob = c(0.05, .5,.95))), big.mark = ","),
"D" = prettyNum(round(quantile(results.day,
prob = c(0.05, .5,.95))*pit_2017*persistent1[1]), big.mark = ","),
"B" = prettyNum(round(quantile(results.week,
prob = c(0.05, .5,.95))*pit_2017*persistent1[1]), big.mark = ",")),
caption = "Percentiles of repeated estimates of the annualized population, D and B.",
label = "tab:percentiles"),
sanitize.colnames.function=function(x) gsub("\\."," ",x), comment = F, caption.placement = "top")
day1 = prettyNum(round(quantile(results.day, prob = c(.5))*pit_2017*persistent1[1]), big.mark = ",")
week1 = prettyNum(round(quantile(results.week, prob = c(.5))*pit_2017*persistent1[1]), big.mark = ",")
annualized1 = prettyNum(round(quantile(results.year, prob = c(.5))), big.mark = ",")
D7 = prettyNum(as.numeric(round( 7*quantile(results.day, prob = c(.5))*pit_2017*persistent1[1])), big.mark = ",")
undercount1 = prettyNum(as.numeric(round( 7*quantile(results.day, prob = c(.5))*pit_2017*persistent1[1] -
quantile(results.week, prob = c(.5))*pit_2017*persistent1[1])), big.mark = ",")
larger1 = prettyNum(round(quantile(results.year, prob = c(.5))-107094), big.mark = ",")
```
Our annualized population estimate of `r annualized1` shown in Table \ref{tab:percentiles} differs from LAHSA's annualized estimate for 2017 by `r larger1`, and is well below LAHSA's three previously reported annualized estimates (see Table \ref{tab:annualized_estimate_table}). We estimate B to be `r week1` and D to be `r day1`. Multiplying our estimate of D by seven to make it a weekly estimate produces a value that is `r undercount1` higher than our estimate of B. This implies that using the snapshot measure B as a proxy for the number of new entrants to homelessness over the course of a week undercounts very short-term homeless people by `r undercount1` people per week. This was described earlier as a shortcoming of the Burt and Wilkins model, and our methodology allows us to quantify its impact.
## Annualized distribution of duration of homelessness
Modeling current stint durations not only allows us to estimate the annualized population. We can also estimate how long those in the annualized population have experienced homelessness. Below we compare the distribution of duration of homelessness in PIT data, i.e. the length of the stint when surveyed, to the distribution from annualized data, where duration can include PIT stint length and time spent homeless during the year.
```{r annualized_distribution, eval = T, cache = T, message = F, warning = F, echo = F, results = "hide"}
periods1 = 365 #12 if by month
max.stints = 2 #For simplicity, currently 1-stint vs. multi. Could further track by stint number.
params1 = optim.weibulltrunc$par
# Proportions of times homeless (used C instead) ####
#times1 = (unsheltered_2017*p1[1:3]/sum(p1[1:3]) + (pit_2017-unsheltered_2017)*p3[1:3]/sum(p3[1:3])*ratio1[1:3])/pit_2017
#times1 = times1/sum(times1)
#C1 = c(times1[1],times1[2]/2,times1[2]/2, times1[3]/3,times1[3]/3, times1[3]/3)
#round(100*C1, 1)
# Breakdown of PIT population by current stint duration ####
int1 = max1/periods1
breakdown3 = pweibull(seq(int1,max1+int1,by=int1), params1[1], scale = params1[2]) -
pweibull(seq(0,max1,by=int1), params1[1], scale = params1[2]) #add an extra at the end for final exit rate
exit3 = 1 - breakdown3[-1]/(breakdown3)[-length(breakdown3)]#exit rate
breakdown3 = breakdown3[-length(breakdown3)] #then remove the extra
breakdown3 = c(persistent1[1]*breakdown3/sum(breakdown3), persistent1[2]) #scale for >year pop
# Transition matrix: ####
n1 = periods1
n2 = n1+1
transition1 = matrix(0, n2*max.stints*2, n2*max.stints*2) # index represents total duration (order in set of n2) and if single or multi stint (order/n2)
stages1 = data.frame(matrix(1:(n2*2*max.stints), nrow = n2, ncol = 2*max.stints))
names(stages1) = c("in1", "out1", "in2", "out2") #if more specifically examine second and third stints later, "in3", "out3", "in4", "out4", "in5", "out5", "in6", "out6")
out_indices = unlist(stages1[,c(str_detect(names(stages1), "out"))])
#first stint in year, more time homeless:
transition1[cbind(stages1$in1, c(1 + stages1$in1[-n2], stages1$in1[n2]))] = c(1-exit3, 1)
#multiple stints in year, more time homeless:
transition1[cbind(stages1$in2, c(1 + stages1$in2[-n2], stages1$in2[n2]))] = c(1-exit3, 1)
#exits first stint:
transition1[cbind(stages1$in1, stages1$out1)] = c(exit3, 0) #never exit persistent group
#exits additional stint:
transition1[cbind(stages1$in2, stages1$out2)] = c(exit3, 0)
#begins additional stint - will be updated in function:
reentry.t = c(0)
transition1[cbind(stages1$out1, stages1$in2+1)] = reentry.t
transition1[cbind(stages1$out2, stages1$in2+1)] = reentry.t
#stays exited:
transition1[cbind(stages1$out1, stages1$out1)] = c(rep(1 - reentry.t, periods1), 0)
transition1[cbind(stages1$out2, stages1$out2)] = c(rep(1 - reentry.t, periods1), 0)
# Progression function ####
# See what happens over time (T)
cohort_progression <- function(breakdown.start, transition, t.start, t.end = 12, C) {
n2 = length(breakdown.start)
n1 - n2-1
max.stints = nrow(transition)/n2
dist1 = c(breakdown.start*pit_2017, rep(0, nrow(transition) - n2)) # T = 1
new_stint_each_period = pit_2017*breakdown.start[1]
for (T in t.start+1:t.end) {
dist1 = dist1%*%transition #state of cohort after a period
c1 = (T-2)*C/n1 #percent with previous stint in year (current model assumes a period has to spent away before coming back in. This simplifies and works best for to day-long (365-period) version.)
#number needed to enter each cohort as REPEATS =
new_to_pop = c(new_stint_each_period*(1-c1), rep(0, nrow(transition) - 1))
dist1 = dist1 + new_to_pop
#print(sum(dist1))
c1 = (T-1)*C/n1 #next
#begins additional stint -
reentry.t = c1*new_stint_each_period/sum(dist1[out_indices])
transition[cbind(stages1$out1, stages1$in2+1)] = reentry.t
transition[cbind(stages1$out2, stages1$in2+1)] = reentry.t
#stays exited:
transition[cbind(stages1$out1, stages1$out1)] = c(rep(1 - reentry.t, periods1), 0)
transition[cbind(stages1$out2, stages1$out2)] = c(rep(1 - reentry.t, periods1), 0)
}
return(dist1)
}
# note that this function creates a lag in re-entry, and therefore a slight discrepancy with the Burt and Wilkins model (where we say that in the second week of the 52-week period the proportion of people previously counted would be $\frac{C}{52}. The way this function operates that would be true of the third week so that person has to join the exited group before rejoining the homeless group.) As long as the period is daily though (periods1 = 365) this is a very slight discrepancy, and the lag is necessary -- you must exit for at least a day to have a second stint.
# Execute cohort progression function ####
results1 = cohort_progression(breakdown3, transition1, 1, periods1, C)
```
```{r annualized_distribution_results, eval = T, cache = T, message = F, warning = F, echo = F, results= "asis", dependson = c("annualized_distribution")}
ANNUAL = matrix(results1, ncol = 4, nrow = n2)
ANNUAL = data.frame(ANNUAL)
if(nrow(ANNUAL) == 366) {
ANNUAL$month = c(month(seq(as.Date("01-01-2017", format= c("%m-%d-%Y")),
as.Date("12-31-2017", format= c("%m-%d-%Y")), by = "day")),13)
ANNUAL = aggregate(ANNUAL, by=list(ANNUAL$month), FUN=sum)[,2:5]}
ANNUAL$"Annualized total" = rowSums(ANNUAL)
ANNUAL$"Annualized percent" = ANNUAL$"Annualized total"/sum(ANNUAL[,"Annualized total"])
# We use the original data to show the PIT pop since the results1 PIT estimate has accrued 12+ month homeless people
breakdown4 = c(breakdown1[1:12]/sum(breakdown1[1:12])*persistent1[1], persistent1[2])
tmp = cbind(round(breakdown4*pit_2017), paste0(round(100*breakdown4),"%"),
round(ANNUAL[,5]), paste0(100*round(ANNUAL[,6],2),"%"))
tmp = apply(tmp, 2, FUN = prettyNum, big.mark = ",")
tmp = rbind(tmp, c(prettyNum(round(sum(breakdown4*pit_2017)), big.mark = ","), NA,
prettyNum(round(sum(ANNUAL[,5])), big.mark = ","), NA))
tmp = cbind(c("up to 1", "1-2", "2-3", "3-4", "4-5", "5-6", "6-7",
"7-8", "8-9", "9-10", "10-11", "11-12", "12+", "total"), tmp)
colnames(tmp) = c("Duration (months)", "PIT total", "PIT percent", "Annualized total", "Annualized percent")
print.xtable(xtable(tmp, caption = "PIT versus annualized distribution of duration of homelessness based on one simulation (2017). Duration for the annualized population may occur over several stints.", label = "tab:annualized_breakdown"),
sanitize.colnames.function=function(x) gsub("\\."," ",x),
include.rownames = F, include.colnames = T, comment = F, caption.placement = "top",
add.to.row = list(pos=list(13), command="\\hline \n"))
```
Table \ref{tab:annualized_breakdown} shows that those who have been homeless for at least a year make up just over a third of the annualized population, as opposed to over half of the PIT population. If the shape parameter of the estimated Weibull distribution were estimated to be less than one, implying decreasing exit rates over time, we would find this pattern to be even more stark, and the annualized population to be significantly larger.
# Discussion
In this report we have presented a model-based framework for annualized population estimation that builds on the model of Burt and Wilkins currently in use. Our method is more stable, more flexible for use with different data sources, and provides greater information about durations of homelessness experienced over the course of a year.
## Practical implications
Reliable estimates of time spent homeless during a year support an evidence-based intervention framework. Understanding the prevalence of short-term versus persistent episodes of homelessness enables accurate allocation of resources based on differing levels of need. The prevailing approach to homelessness prevention and intervention is ``progressive engagement.'' This entails treating new entrants into homelessness with a light touch and providing progressively more intensive services as needed to help individuals reestablish themselves in stable housing. The most intensive and costly intervention, permanent supportive housing, is reserved for chronically homeless individuals for whom other interventions have failed. A population model, as shown in Figure 2, can be used to plan, budget and implement the range of services needed to help individuals with differing barriers to exiting homelessness.
## Recommendations
[^2]: The code and data used to create this report are available at the Economic Roundtable github repository: \url{https://github.com/economicroundtable/homeless_LA/}.
We recommend that the estimation method presented in this report be adopted [^2]. Alternatively, if that is not feasible, based on the findings above we recommend several improvements to the Burt and Wilkins model that can easily be implemented:
- The formula adjustment shown in Equation \eqref{eq:modified}.
- Use averages of B and C over the last two to three years instead of a single-year's estimate, applying the most up-to-date method for calculating each. This will reduce year-to-year fluctuations, and is appropriate since we do not expect major changes in either from one year to the next. (An exception might be if shelter or housing policy changes significantly.)
- Publish the formulas and data sources used to calculate B and C, noting which survey variables were used. This makes the causes of year-to-year fluctuations more transparent.
- Present annualized estimates with measures of uncertainty, even if they rely on simplifying assumptions. This can be accomplished using bootstrapping or the binomial distribution.
- Ensure that the information needed to calculate the variables in the model is gathered for both sheltered and unsheltered people. For example, both surveys should ask about the number of times one has been homeless in the last year, the total amount of time spent homeless, and whether the individual was in Los Angeles during that time. Currently, the intake questionnaire for sheltered people does not ask about the number of times an individual was homeless in the previous year, but only in the previous three years. To compensate, analysts "adjust the percentage homeless more than once in the past three years for the shelter population by the ratio of the one year to three year measure taken from the data in the youth and adult demographic surveys" [@methodology17]. There are significant reported differences between the sheltered and unsheltered populations which suggest this introduces bias. For example, in 2017 the vast majority of unsheltered homeless people reported being homeless only one time in the past three years, whereas for the sheltered population this figure was similar to the number reporting multiple instances of homelessness. We refer the reader to explore this and other differences through our interactive data visualization tool on the Economic Roundtable website, \url{https://economicrt.org/los-angeles-homelessness-data-dive-app/}.
## Limitations and future work
Despite the advancements our method offers, there are many ways it can be improved upon in future work.
- The calculation of C could be adapted to specifically describe people whose current stint of homelessness has lasted a week or less. For simplicity and comparability with previous estimates we employed the estimate of C used by LAHSA. However, it seems to be biased. It is used to describe the likelihood that recently homeless individuals have already experienced homelessness during the annual period, yet it is based on the probability that anyone surveyed experienced homelessness during that time. Of course, those who have been homeless for a long time have no chance of a previous instance of homelessness during the year.
- Address seasonality of new entrants to homelessness. While it is somewhat reasonable to ignore seasonality in Los Angeles where outdoor sleeping is possible for nearly all of the year, many jurisdictions exhibit seasonality in the number of new entrants to homelessness and duration of stints of homelessness. To accurately extend our model to these jurisdictions would require knowledge of season effects in this context.
- Use our model framework to estimate annualized sub-populations. LAHSA provides annualized estimates for many sub-groups, and it would be interesting to apply our methodology to these groups and provide measures of uncertainty, especially for small sub-groups.
Finally, although we have attempted to correct for or measure the causes of error in our annualized estimate, there are deeper sources of error that we can do little about. Our annualized estimate depends on the PIT count, which is also subject to error. For example, the count is conducted by volunteers who may miss or miscount individuals. The multipliers used to count individuals in tents and vehicles are imperfect approximations. These issues are addressed in greater depth in @flaming17, and @smith18, for example. They affect both the @burtwilkins05 model and ours and are beyond the scope of this report.
# Acknowledgements
The estimation method presented in this report builds on the work of many volunteers who participated in the data dive that Economic Roundtable held with DataKind in July, 2018. At this event, over 40 data scientists in Los Angeles and San Francisco applied their skills to analyze Los Angeles Homeless Count data. We especially want to thank the members of the team that worked on annualized population size estimation: Harry Brisson, Po-Nien Chiang, Johanna He, Geovani Montoya, Bradley Rava, Mark Rodighiero, Giovanni Rosati, Sameer Soi, Fishu Wu, David Ami Wulf, and Tian Yang.
We are also indebted to the volunteer organizers with DataKind who made the data dive possible: Wade Fuller, William High, Abhishek Kapatker, Sonali Murarka, Stacey Ronaghan, and Haitham Shammaa.
# References