Iterative proportional fitting (IPF) is used in many disciplines to adjust an initial set of weights to match various marginal distributions. This package implements the iterative proportional updating algorithm based on the paper from Arizona State University (IPU). In survey raking or population synthesis, the IPU algorithm has the added advantage of being able to match household- and person-level marginal targets.
The primary functions of the ipfr
package are discussed
below.
ipu_matrix()
ipu()
synthesize()
The first example shows how to use ipfr
to solve one of
the most common IPF problems: Given a matrix of starting weights and a
set of row and column targets, balance the matrix in order to satisfy
both row and column targets.
set.seed(42)
mtx <- matrix(data = runif(9), nrow = 3, ncol = 3)
row_targets <- c(3, 4, 5)
column_targets <- c(5, 4, 3)
mtx
#> [,1] [,2] [,3]
#> [1,] 0.9148060 0.8304476 0.7365883
#> [2,] 0.9370754 0.6417455 0.1346666
#> [3,] 0.2861395 0.5190959 0.6569923
The resulting matrix satisfies both row and column targets while retaining a similar relative distribution of weights in each cell.
The matrix example is common and conceptually simple, but makes thinking about higher-dimension problems difficult. A much better way to think about the problem (whether in two or ten dimensions) is as a table.
This example creates a household survey that will act as the seed
(like the matrix from the last example). The number of persons and autos
for each household is listed in the size
and
autos
columns. A starting weight of 1 is also included.
survey <- tibble(
size = c(1, 2, 1, 1),
autos = c(0, 2, 2, 1),
weight = 1
)
survey
#> # A tibble: 4 × 3
#> size autos weight
#> <dbl> <dbl> <dbl>
#> 1 1 0 1
#> 2 2 2 1
#> 3 1 2 1
#> 4 1 1 1
Assume that we know the total number of households by size and
households by auto-ownership from the Census. These target marginal
distributions are just like our row and column totals from the previous
example. Targets are passed into the ipu
function as shown
below. Take the size targets, for example:
size
corresponds to the size
column in the survey.1
and 2
correspond to the
unique values in the survey’s size
column.targets <- list()
targets$size <- tibble(
`1` = 75,
`2` = 25
)
targets$autos <- tibble(
`0` = 25,
`1` = 50,
`2` = 25
)
targets
#> $size
#> # A tibble: 1 × 2
#> `1` `2`
#> <dbl> <dbl>
#> 1 75 25
#>
#> $autos
#> # A tibble: 1 × 3
#> `0` `1` `2`
#> <dbl> <dbl> <dbl>
#> 1 25 50 25
A call to ipu()
returns a named list of results.
The first element is the resulting weight table. The primary seed has three columns appended:
result$weight_tbl
#> # A tibble: 4 × 6
#> size autos weight id avg_weight weight_factor
#> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 0 25 1 25 1
#> 2 2 2 24.7 2 25 0.990
#> 3 1 2 0.255 3 25 0.0102
#> 4 1 1 50 4 25 2
The second element is a histogram of the weight_factor
.
This provides a quick overview of the distribution of weights. Many
large or small weights can indicate underlying problems even if the ipu
converged and matches targets. For instance, a record with a weight
factor of 50 means that it is greatly over-represented in the
re-weighted survey.
result$weight_dist
#> Warning: Use of `primary_seed$weight_factor` is discouraged.
#> ℹ Use `weight_factor` instead.
The next element is a comparison back to the targets provided. With complex seed and target tables, this makes investigating results quick and easy.
result$primary_comp
#> # A tibble: 5 × 6
#> geography category target result diff pct_diff
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 geo_all autos_0 25 25 0 0
#> 2 geo_all autos_1 50 50 0 0
#> 3 geo_all autos_2 25 25 0 0
#> 4 geo_all size_1 75 75.3 0.26 0.34
#> 5 geo_all size_2 25 24.7 -0.26 -1.02
If secondary targets are provided to ipu()
, a fourth
item in the list will contain a secondary_comp
table.
In household survey expansion, it is common to want to control for
certain features that describe households (like size) while controlling
for other attributes that describe people (like age). This is possible
with the ipu()
function.
This example is taken directly from the Arizona paper on page 20: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf
In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).
setup_arizona()
creates the seed and target tables used
in the example.
result <- setup_arizona()
hh_seed <- result$hh_seed
hh_targets <- result$hh_targets
per_seed <- result$per_seed
per_targets <- result$per_targets
primary_seed
primary_target
secondary_seed
secondary_target
In the interest of keeping vignette build time short, the
ipu()
algorithm is only run for 30 iterations. After
running for 400 or more iterations, the results match closely to those
shown in the paper.
The comparison table for the secondary (person) marginals is now
included in result
. Feel free to run the code chunk above
for 400 or more iterations and then look again.
result$weight_tbl
#> # A tibble: 8 × 5
#> id hhtype weight avg_weight weight_factor
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3.77 12.5 0.301
#> 2 2 1 23.0 12.5 1.84
#> 3 3 1 6.50 12.5 0.520
#> 4 4 2 25.7 12.5 2.06
#> 5 5 2 17.4 12.5 1.39
#> 6 6 2 7.26 12.5 0.581
#> 7 7 2 4.21 12.5 0.337
#> 8 8 2 7.26 12.5 0.581
result$primary_comp
#> # A tibble: 2 × 6
#> geography category target result diff pct_diff
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 geo_all hhtype_1 35 33.3 -1.73 -4.94
#> 2 geo_all hhtype_2 65 61.9 -3.15 -4.84
result$secondary_comp
#> # A tibble: 3 × 6
#> geography category target result diff pct_diff
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 geo_all pertype_1 91 88.4 -2.58 -2.83
#> 2 geo_all pertype_2 65 63.8 -1.16 -1.79
#> 3 geo_all pertype_3 104 104 0 0
ipu()
allows different geographies to be specified for
different marginal tables. There are a few rules that make this
possible, but in short, a geography field in each target table tells the
algorithm which scale to constrain to. If no geography field is present
in a target table, it applies to the entire seed table
(i.e. region).
All of the following rules are checked by the function and a warning message will show which (if any) is violated.
To demonstrate, the Arizona example is modified to add two different tracts for household controls but to still control the person targets at the regional level.
# Repeat the hh_seed to create tract 1 and 2 households
new_hh_seed <- hh_seed %>%
mutate(geo_tract = 1)
new_hh_seed <- bind_rows(
new_hh_seed,
new_hh_seed %>%
mutate(geo_tract = 2, id = id + 8)
)
new_hh_seed$geo_region <- 1
new_hh_seed
#> # A tibble: 16 × 4
#> id hhtype geo_tract geo_region
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 2 1 1 1
#> 3 3 1 1 1
#> 4 4 2 1 1
#> 5 5 2 1 1
#> 6 6 2 1 1
#> 7 7 2 1 1
#> 8 8 2 1 1
#> 9 9 1 2 1
#> 10 10 1 2 1
#> 11 11 1 2 1
#> 12 12 2 2 1
#> 13 13 2 2 1
#> 14 14 2 2 1
#> 15 15 2 2 1
#> 16 16 2 2 1
# Repeat the household targets for two tracts
new_hh_targets <- hh_targets
new_hh_targets$hhtype <- bind_rows(hh_targets$hhtype, hh_targets$hhtype)
new_hh_targets$hhtype <- new_hh_targets$hhtype %>%
mutate(geo_tract = c(1, 2))
new_hh_targets$hhtype
#> # A tibble: 2 × 3
#> `1` `2` geo_tract
#> <dbl> <dbl> <dbl>
#> 1 35 65 1
#> 2 35 65 2
# Repeat the per_seed to create tract 1 and 2 persons
new_per_seed <- bind_rows(
per_seed,
per_seed %>%
mutate(id = id + 8)
)
new_per_seed
#> # A tibble: 46 × 2
#> id pertype
#> <dbl> <dbl>
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> 4 2 1
#> 5 2 3
#> 6 3 1
#> 7 3 1
#> 8 3 2
#> 9 4 1
#> 10 4 3
#> # ℹ 36 more rows
# Double the regional person targets
new_per_targets <- per_targets
new_per_targets$pertype <- per_targets$pertype %>%
mutate_all(list(~. * 2)) %>%
mutate(geo_region = 1)
new_per_targets$pertype
#> # A tibble: 1 × 4
#> `1` `2` `3` geo_region
#> <dbl> <dbl> <dbl> <dbl>
#> 1 182 130 208 1
Run the IPU algorithm. Again, for vignette build time, only 30
iterations are performed. Run the code yourself with
max_iterations
set to 600 to see the converged result.
The tables below show the results compared back to targets.
result$primary_comp
#> # A tibble: 4 × 6
#> geography category target result diff pct_diff
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 geo_tract_1 hhtype_1 35 33.3 -1.73 -4.94
#> 2 geo_tract_1 hhtype_2 65 61.9 -3.15 -4.84
#> 3 geo_tract_2 hhtype_1 35 33.3 -1.73 -4.94
#> 4 geo_tract_2 hhtype_2 65 61.9 -3.15 -4.84
result$secondary_comp
#> # A tibble: 3 × 6
#> geography category target result diff pct_diff
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 geo_region_1 pertype_1 182 177. -5.16 -2.83
#> 2 geo_region_1 pertype_2 130 128. -2.33 -1.79
#> 3 geo_region_1 pertype_3 208 208 0 0
ipu
offers some advanced options for controlling the
behavior. Weights can be capped with min_ratio
and
max_ratio
options. Secondary target importance can be
adjusted using the secondary_importance
option. For info on
using these options appropriately, see the
common_ipf_problems
vignette.
The weight_tbl
from ipu()
can be fed into
the synthesize()
function to generate a synthetic
population. The code block below takes the output from the previous
example to demonstrate. The group_by
option tells the
function to group by the geo_tract
field before random
sampling takes places. This will ensure that the number of sampled
records in each tract equals the total weight in each tract.
A new_id
is created for each synthetic member (usually a
household), but the old ID is preserved to facilitate joins or other
operations where it could be useful.
set.seed(42)
synthesize(result$weight_tbl, group_by = "geo_tract") %>%
head()
#> # A tibble: 6 × 5
#> new_id id hhtype geo_tract geo_region
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3 1 1 1
#> 2 2 7 2 1 1
#> 3 3 2 1 1 1
#> 4 4 8 2 1 1
#> 5 5 5 2 1 1
#> 6 6 5 2 1 1
A tidyverse-style call also works, as shown below.