Title: | List Balancing for Reweighting and Population Synthesis |
---|---|
Description: | Performs iterative proportional updating given a seed table and an arbitrary number of marginal distributions. This is commonly used in population synthesis, survey raking, matrix rebalancing, and other applications. For example, a household survey may be weighted to match the known distribution of households by size from the census. An origin/ destination trip matrix might be balanced to match traffic counts. The approach used by this package is based on a paper from Arizona State University (Ye, Xin, et. al. (2009) <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf>). Some enhancements have been made to their work including primary and secondary target balance/importance, general marginal agreement, and weight restriction. |
Authors: | Kyle Ward [aut, cre, cph], Greg Macfarlane [ctb] |
Maintainer: | Kyle Ward <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 1.0.2 |
Built: | 2024-11-06 03:26:52 UTC |
Source: | https://github.com/dkyleward/ipfr |
The main function is ipu
. For a 2D/matrix problem, the
ipu_matrix
function is easier to use. The resulting
weight_tbl
from ipu()
can be fed into synthesize
to generate a synthetic population
Maintainer: Kyle Ward [email protected] [copyright holder]
Other contributors:
Greg Macfarlane [email protected] [contributor]
Useful links:
A general case of iterative proportional fitting. It can satisfy two, disparate sets of marginals that do not agree on a single total. A common example is balancing population data using household- and person-level marginal controls. This could be for survey expansion or synthetic population creation. The second set of marginal/seed data is optional, meaning it can also be used for more basic IPF tasks.
ipu( primary_seed, primary_targets, secondary_seed = NULL, secondary_targets = NULL, primary_id = "id", secondary_importance = 1, relative_gap = 0.01, max_iterations = 100, absolute_diff = 10, weight_floor = 1e-05, verbose = FALSE, max_ratio = 10000, min_ratio = 1e-04 )
ipu( primary_seed, primary_targets, secondary_seed = NULL, secondary_targets = NULL, primary_id = "id", secondary_importance = 1, relative_gap = 0.01, max_iterations = 100, absolute_diff = 10, weight_floor = 1e-05, verbose = FALSE, max_ratio = 10000, min_ratio = 1e-04 )
primary_seed |
In population synthesis or household survey expansion, this would be the household seed table (each record would represent a household). It could also be a trip table, where each row represents an origin-destination pair. |
primary_targets |
A |
secondary_seed |
Most commonly, if the primary_seed describes
households, the secondary seed table would describe the persons in each
household. Must contain the same |
secondary_targets |
Same format as |
primary_id |
The field used to join the primary and secondary seed
tables. Only necessary if |
secondary_importance |
A |
relative_gap |
After each iteration, the weights are compared to the
previous weights and the
the |
max_iterations |
maximum number of iterations to perform, even if
|
absolute_diff |
Upon completion, the For example, if if a target value was 2, and the expanded weights equaled 1, that's a 100 is only 1. Defaults to 10. |
weight_floor |
Minimum weight to allow in any cell to prevent zero weights. Set to .0001 by default. Should be arbitrarily small compared to your seed table weights. |
verbose |
Print iteration details and worst marginal stats upon
completion? Default |
max_ratio |
|
min_ratio |
|
a named list
with the primary_seed
with weight, a
histogram of the weight distribution, and two comparison tables to aid in
reporting.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf
hh_seed <- dplyr::tibble( id = c(1, 2, 3, 4), siz = c(1, 2, 2, 1), weight = c(1, 1, 1, 1), geo_cluster = c(1, 1, 2, 2) ) hh_targets <- list() hh_targets$siz <- dplyr::tibble( geo_cluster = c(1, 2), `1` = c(75, 100), `2` = c(25, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 5)
hh_seed <- dplyr::tibble( id = c(1, 2, 3, 4), siz = c(1, 2, 2, 1), weight = c(1, 1, 1, 1), geo_cluster = c(1, 1, 2, 2) ) hh_targets <- list() hh_targets$siz <- dplyr::tibble( geo_cluster = c(1, 2), `1` = c(75, 100), `2` = c(25, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 5)
This function simplifies the call to 'ipu()' for the simple case of a matrix and row/column targets.
ipu_matrix(mtx, row_targets, column_targets, ...)
ipu_matrix(mtx, row_targets, column_targets, ...)
mtx |
a |
row_targets |
a vector of targets that the row sums must match |
column_targets |
a vector of targets that the column sums must match |
... |
additional arguments that are passed to 'ipu()'. See
|
A matrix
that matches row and column targets
mtx <- matrix(data = runif(9), nrow = 3, ncol = 3) row_targets <- c(3, 4, 5) column_targets <- c(5, 4, 3) ipu_matrix(mtx, row_targets, column_targets)
mtx <- matrix(data = runif(9), nrow = 3, ncol = 3) row_targets <- c(3, 4, 5) column_targets <- c(5, 4, 3) ipu_matrix(mtx, row_targets, column_targets)
Sets up the Arizona example IPU problem and is used in multiple places throughout the package (vignettes/tests).
setup_arizona()
setup_arizona()
A list of four variables:
hh_seed, hh_targets, per_seed, and per_targets. These can be used directly
by ipu
.
setup_arizona()
setup_arizona()
A simple function that takes the weight_tbl
output from
ipu
and randomly samples based on the weight.
synthesize(weight_tbl, group_by = NULL, primary_id = "id")
synthesize(weight_tbl, group_by = NULL, primary_id = "id")
weight_tbl |
the |
group_by |
if provided, the |
primary_id |
The field used to join the primary and secondary seed
tables. Only necessary if |
A data.frame
with one record for each synthesized member of
the population (e.g. household). A new_id
column is created, but
the previous primary_id
column is maintained to facilitate joining
back to other data sources (e.g. a person attribute table).
hh_seed <- dplyr::tibble( id = c(1, 2, 3, 4), siz = c(1, 2, 2, 1), weight = c(1, 1, 1, 1), geo_cluster = c(1, 1, 2, 2) ) hh_targets <- list() hh_targets$siz <- dplyr::tibble( geo_cluster = c(1, 2), `1` = c(75, 100), `2` = c(25, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 5) synthesize(result$weight_tbl, "geo_cluster")
hh_seed <- dplyr::tibble( id = c(1, 2, 3, 4), siz = c(1, 2, 2, 1), weight = c(1, 1, 1, 1), geo_cluster = c(1, 1, 2, 2) ) hh_targets <- list() hh_targets$siz <- dplyr::tibble( geo_cluster = c(1, 2), `1` = c(75, 100), `2` = c(25, 150) ) result <- ipu(hh_seed, hh_targets, max_iterations = 5) synthesize(result$weight_tbl, "geo_cluster")