Title: | Reverse Engineering Summarized Data |
---|---|
Description: | Decoupled (e.g. separate averages) and censored (e.g. > 100 species) variables are continually reported by many well-established organizations (e.g. World Health Organization (WHO), Centers for Disease Control and Prevention (CDC), World Bank, and various national censuses). The challenge therefore is to infer what the original data could have been given summarized information. We present an R package that reverse engineers decoupled and/or censored count data with two main functions. The cnbinom.pars() function estimates the average and dispersion parameter of a censored univariate frequency table. The rec() function reverse engineers summarized data into an uncensored bivariate table of probabilities. |
Authors: | Samantha Duchscherer [aut, cre], UT-Battelle, LLC [cph] |
Maintainer: | Samantha Duchscherer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.3 |
Built: | 2024-11-12 04:25:43 UTC |
Source: | https://github.com/gshs-ornl/revengc |
A univariate censored frequency table is fit to a negative binomial distribution using a likelihood function customized to handle left, right, and interval censored data. The output is a list containing the average and dispersion parameter that maximizes the custom function.
cnbinom.pars(censoredtable)
cnbinom.pars(censoredtable)
censoredtable |
A frequency table. A data.frame and matrix are acceptable classes. See Details section for formatting. |
The censored table for the cnbinom.pars function has restrictions. The univariate frequency table, which can be a data.frame or matrix class, must have two columns and n number of rows. The categories must be in the first column with frequencies or probabilities in the second column. Row names should never be placed in this table (the default row names should always be 1:n). Column names can be any character string. The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
Below are three correctly formatted tables.
Category | Frequency |
<=6 | 11800 |
7-12 | 57100 |
13-19 | 14800 |
20+ | 3900 |
Category | Frequency |
LE6 | 11800 |
7I12 | 57100 |
13I19 | 14800 |
GE20 | 3900 |
Category | Frequency |
<7 | 11800 |
7I12 | 57100 |
13-19 | 14800 |
>=20 | 3900 |
The cnbinom.pars function outputs a list consisting of the estimated average (mu) and dispersion (r) parameter.
# create frequency table that follows a Poisson distribution set.seed(123) testdata1<-data.frame(table(rpois(100000,lambda = 10))) # negative binomial converges in distribution to the Poisson ## when dispersion->infinity cnbinom.pars(testdata1) # censor table of testdata1 (lambda = 10) testdata2<-cbind(as.character(c("<=5", "6-10", "11-15", "16-20", ">20")), c(6718,51329,37041,4732, 180)) cnbinom.pars(testdata2) # create frequency table that follows a negative binomial distribution # different ways to parameterize the negative binomial distribution mu = 10 r = 2 p = r/(r+mu) set.seed(123) testdata3<-data.frame(table(rnbinom(100000, size = r, mu = mu))) cnbinom.pars(testdata3) set.seed(123) testdata4<-data.frame(table(rnbinom(100000, size = r, p = p))) cnbinom.pars(testdata4) # censor table with mu = 10 and r = 2 testdata5<-cbind(as.character(c("<5", "5-15", "16I26", "27-37", "38I48", ">48")), c(26130,53979,15899,3267, 593, 132)) cnbinom.pars(testdata5)
# create frequency table that follows a Poisson distribution set.seed(123) testdata1<-data.frame(table(rpois(100000,lambda = 10))) # negative binomial converges in distribution to the Poisson ## when dispersion->infinity cnbinom.pars(testdata1) # censor table of testdata1 (lambda = 10) testdata2<-cbind(as.character(c("<=5", "6-10", "11-15", "16-20", ">20")), c(6718,51329,37041,4732, 180)) cnbinom.pars(testdata2) # create frequency table that follows a negative binomial distribution # different ways to parameterize the negative binomial distribution mu = 10 r = 2 p = r/(r+mu) set.seed(123) testdata3<-data.frame(table(rnbinom(100000, size = r, mu = mu))) cnbinom.pars(testdata3) set.seed(123) testdata4<-data.frame(table(rnbinom(100000, size = r, p = p))) cnbinom.pars(testdata4) # censor table with mu = 10 and r = 2 testdata5<-cbind(as.character(c("<5", "5-15", "16I26", "27-37", "38I48", ">48")), c(26130,53979,15899,3267, 593, 132)) cnbinom.pars(testdata5)
This function inputs a contingency table and produces a univariate frequency table from the column marginals.
column.marginal(contingencytable)
column.marginal(contingencytable)
contingencytable |
A censored contingency table. Accepted classes are a data.frame or matrix. See Details section below formatting. |
Table Format:
The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
The column names should be the Y category values. The first column should be the X category values and the row names can be arbitrary. The inside of the table are X * Y cross tabulation, which are either positive frequency values or probabilities. The row and column marginal totals corresponding to their X and Y category values need to be placed in this table. The top left, top right, and bottom left corners of the table should be NA or blank. The bottom right corner can be a total cross tabulation sum value, NA, or blank. The table below is a formatted example.
NA | <20 | 20-30 | >30 | NA |
<5 | 18 | 19 | 8 | 45 |
5-9 | 13 | 8 | 12 | 33 |
>=10 | 7 | 5 | 10 | 21 |
NA | 38 | 32 | 31 | NA |
A univariate frequency table (data.frame) created from the column marginals.
# contingencytable.csv is a preloaded contingency table example column.marginal(contingencytable.csv)
# contingencytable.csv is a preloaded contingency table example column.marginal(contingencytable.csv)
In 2010, the Population Census Data - Statistics Indonesia provided over 60 censored contingency tables containing Floor Area of Dwelling Unit (square meter) by Household Member Size. The tables are separated by province, urban, and rural. Here we show the household size by area contingency table for Indonesia 's rural Aceh Province.
contingencytable.csv is a preloaded example that is correctly formatted for the revengc package. Below shows how a user should read in a csv file to have the same format:
contingencytable.csv<-read.csv("filename.csv", row.names = NULL, header= TRUE, check.names=FALSE)
contingencytable.csv
contingencytable.csv
Population Census Data - Statistics Indonesia. (2010). Household by Floor Area of Dwelling Unit and Households Member Size. Retrieved from: http://sp2010.bps.go.id/index.php/site/tabel?wid=1100000000&tid=334&fi1=586&fi2=
rec is a statistical approach that estimates what "true" uncensored bivariate table could have been given summarized information. Note, there are assumptions used in this function. First, rec relies on a Poisson distribution when a user only provides an average, which is assuming the variance and average of that variable are equal. A more descriptive input variable(s), such as a decoupled univariate table(s) or a censored frequency table, can account for dispersion found in data. However, independence between decoupled variables still has to be assumed when there is no external information about the joint distribution. Because of these assumptions, rec also provides two options for sensitivity analysis: seed matrix the method used in updating the seed matrix are both arbitrary inputs. For more information it is recommended for a user to read the Details section below and more information can be found in the vignettes.
rec(X, Y, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound, seed.matrix, seed.estimation.method)
rec(X, Y, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound, seed.matrix, seed.estimation.method)
X |
Argument can be an average, a univariate frequency table, or a censored contingency table. The average value should be a numeric class while a data.frame or matrix are acceptable table classes. Y defaults to NULL if X argument is a censored contingency table. See Details section below formatting. |
Y |
Same description as X but this argument is for the Y variable. X defaults to NULL if Y argument is a censored contingency table. |
Xlowerbound |
A numeric class value to represent the left bound for X (row in contingency table). The value must strictly be a non-negative integer and cannot be greater than the lowest category/average value provided for X (e.g. the lower bound cannot be 6 if a table has '>= 5' as a X or row category). |
Xupperbound |
A numeric class value to represent the right bound for X (row in contingency table). The value cannot be less than the highest category/average value provided for X (e.g. the upper bound cannot be 90 if a table has '> 100' as a X or row category). |
Ylowerbound |
Same description as Xlowerbound but this argument is for Y (column in contingency table). |
Yupperbound |
Same description as Xupperbound but this argument is for Y (column in contingency table). |
seed.matrix |
An intial probability matrix to be updated. If decoupled variables is provided the default is a Xlowerbound:Xupperbound by Ylowerbound:Yupperbound seed.matrix with interior cells of 1 / sum(seed.matrix). If a censored contingency table is provided the default is the seedmatrix()$Probabilities output. |
seed.estimation.method |
A character string indicating which method is used for updating the seed.matrix. The choices are: "ipfp", "ml", "chi2", or "lsq". Default is "ipfp". |
Overview:
The rec function handles four cases.
Case I. When provided an average for both X and Y, the averages represent lambda values. These lambdas create truncated Poisson X and Y probability densities for uncensored vectors ranging from Xlowerbound:Xupperbound and Ylowerbound:Yupperbound, respectively. The Xlowerbound:Xupperbound vector with its corresponding density values represents the new row marginal. The Ylowerbound:Yupperbound vector with its corresponding density values represents the new column marginal. This is a decoupled case, and thus the seed (initial cross tabulation matrix to be updated) defaults to (a matrix of ones)/(length(Xlowerbound:Xupperbound) * length(Ylowerbound:Yupperbound)). The mipfp R package then estimates cross tabulations with a selected seed estimation method, new uncensored marginals, and seed matrix. The final result is an uncensored contingency table of probabilities.
Case II. When provided an univariate frequency table for both X and Y, the negative binomial average and dispersion parameters are estimated with a customized maximum likelihood function. These parameters then create truncated negative binomial X and Y probability densities for uncensored vectors, Xlowerbound:Xupperbound and Ylowerbound:Yupperbound, respectively. The methods listed in Case I are then implemented.
Case III. When provided a combination of an average and frequency table (X and Y could be either), the same methods stated in Case I and II are implemented.
Case IV. When provided a censored X*Y contingency table, the row marginals create a univariate X frequency table while the column marginals create a univariate Y frequency table. Both tables estimate there corresponding negative binomial average and dispersion parameters. These parameters then create truncated negative binomial X and Y probability densities for uncensored vectors Xlowerbound:Xupperbound and Ylowerbound:Yupperbound, respectively. The Xlowerbound:Xupperbound vector with its corresponding density values represents the new row marginal. The Ylowerbound:Yupperbound vector with its corresponding density values represents the new column marginal. This is not a decoupled case, and thus the default seed repeats the probability cells, which corresponding to the censored contingency table, for the newly created and compatible uncensored cross tabulations and then makes the matrix equal 1 (i.e. for each cell value j in the seed: seed[j]/sum(seed)). The mipfp R package then estimates cross tabulations with a selected seed estimation method, new uncensored marginals, and seed matrix. The final result is an uncensored contingency table of probabilities.
Table Format:
The table(s) for Case II and III has restrictions. The univariate frequency table, which can be a data.frame or matrix class, must have two columns and n number of rows. The categories must be in the first column with frequencies or probabilities in the second column. Row names should never be placed in this table (the default row names should always be 1:n). Column names can be any character string. The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
Below are three correctly formatted tables.
Category | Frequency |
<=6 | 11800 |
7-12 | 57100 |
13-19 | 14800 |
20+ | 3900 |
Category | Frequency |
LE6 | 11800 |
7I12 | 57100 |
13I19 | 14800 |
GE20 | 3900 |
Category | Frequency |
<7 | 11800 |
7I12 | 57100 |
13-19 | 14800 |
>=20 | 3900 |
The table for Case IV also has restrictions. The censored symbols should follow the requirements listed above. The table's class can be a data.frame or a matrix. The column names should be the Y category values. The first column should be the X category values and the row names can be arbitrary. The inside of the table are X * Y frequency values, which are either nonnegative frequencies or probabilities if seed_estimation_method is "ipfp" or strictly positive when method is "ml", "lsq" or "chi2". The row and column marginal totals corresponding to their X and Y category values need to be placed in this table. The top left, top right, and bottom left corners of the table should be NA or blank. The bottom right corner can be a total cross tabulation sum value, NA, or blank. The table below is a formatted example.
NA | <20 | 20-30 | >30 | NA |
<5 | 18 | 19 | 8 | 45 |
5-9 | 13 | 8 | 12 | 33 |
>=10 | 7 | 5 | 10 | 21 |
NA | 38 | 32 | 31 | NA |
Bounds:
Ideally, the four bounds should be chosen based off prior knowledge and expert elicitation, but they can also be selected intuitively with a brute force method. If rec outputs a final contingency table with higher probabilities near the edge(s) of the table, then it would make sense to increase the range of the bound(s). For both the X and Y variables, this would just involve making the lower bound less, making the upper bound more, or doing a combination of the two. The opposite holds true as well. If the final contingency table has very low probabilities near the edge(s) of the table, then a user should decrease the range of the particular bound(s).
Seed Estimation Methods:
This function implements the mipfp R package, which offers four methods to estimate cross tabulations when provided fixed marginals.
Method | Abbreviation |
Iterative proportional fitting procedure | ipfp |
Maximum likelihood method | ml |
Minimum chi-squared | chi2 |
Weighted least squares | lsq |
For a summary and understanding of all methods please refer to the vignettes and/or the papers by Little et al. (1991) and Suesse et al. (2017).
The output is a list containing an uncensored contingency table of probabilities (rows range from Xlowerbound:Xupperbound and the columns range from Ylowerbound:Yupperbound) as well as the row X and column Y parameters used in making the margins for the mipfp R package.
Frederick Novomestky and Saralees Nadarajah (2016). truncdist: Truncated Random Variables. R package version 1.0-2. https://CRAN.R-project.org/package=truncdist
Johan Barthelemy and Thomas Suesse (2018). mipfp: Multidimensional Iterative Proportional Fitting and Alternative Models. R package version 3.2. https://CRAN.R-project.org/package=mipfp
Little, R. J., Wu, M. M. (1991) Models for contingency tables with known margins when target and sampled populations differ. Journal of the American Statistical Association, 86(413): 87-95. doi: https://doi.org/10.2307/2289718
Suesse, T., Namazi-Rad, M., Mokhtarian, P., & Barthelemy, J. (2017). Estimating Cross-Classified Population Counts of Multidimensional Tables: An Application to Regional Australia to Obtain Pseudo-Census Counts, Journal of Official Statistics, 33(4), 1021-1050. doi: https://doi.org/10.1515/jos-2017-0048
# provide two averages # seed.matrix defaults to a matrix of ones # seed.estimation.method defaults to ipfp twoaverages.results<-rec( X= 4.4, Y = 571.3, Xlowerbound = 1, Xupperbound = 20, Ylowerbound = 520, Yupperbound = 620) # provide one average and one table # create a censored univariate table # seed.matrix defaults to a matrix of ones # seed.estimation.method defaults to ipfp Y.table = cbind(as.character(c("<7", "7-12", "13-19", ">19")), c(11800,57100,14800,3900)) combo.results<-rec(X= 2.3, Y = Y.table, Xlowerbound = 1, Xupperbound = 15, Ylowerbound = 1, Yupperbound = 30) # provide a censored contingency table contingencytable<-matrix(c(6185,9797,16809,11126,6156,3637,908,147,69,4, 5408,12748,26506,21486,14018,9165,2658,567,196,78, 7403,20444,44370,36285,23576,15750,4715,994,364,136, 4793,17376,44065,40751,28900,20404,6557,1296,555,228, 2354,11143,32837,33910,26203,19301,6835,1438,618,245, 1060,6038,19256,21298,17774,13864,4656,1039,430,178, 273,2521,9110,11188,9626,7433,2608,578,196,112, 119,1130,4183,5566,5053,3938,1367,318,119,66, 33,388,1707,2367,2328,1972,719,171,68,37, 38,178,1047,1672,1740,1666,757,193,158,164), nrow=10,ncol=10, byrow=TRUE) rowmarginal<-apply(contingencytable,1,sum) contingencytable<-cbind(contingencytable, rowmarginal) colmarginal<-apply(contingencytable,2,sum) contingencytable<-rbind(contingencytable, colmarginal) row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-"" contingencytable<-data.frame(c("1","2","3","4","5","6", "7", "8","9","10+", NA), contingencytable) colnames(contingencytable)<-c(NA,"<20","20-29","30-39","40-49","50-69","70-99", "100-149","150-199","200-299","300+", NA) # the contingencytable input could be put in X or Y (opposing argument = NULL) # X = rows and Y = columns # seed.matrix default = repeating the cross tabulations in the censored contingency ## table for the newly created and compatible uncensored cross tabulations # seed.estimation.method defaults to ipfp contingencytable.results<-rec( X= contingencytable, Xlowerbound = 1, Xupperbound = 15, Ylowerbound = 10, Yupperbound = 310)
# provide two averages # seed.matrix defaults to a matrix of ones # seed.estimation.method defaults to ipfp twoaverages.results<-rec( X= 4.4, Y = 571.3, Xlowerbound = 1, Xupperbound = 20, Ylowerbound = 520, Yupperbound = 620) # provide one average and one table # create a censored univariate table # seed.matrix defaults to a matrix of ones # seed.estimation.method defaults to ipfp Y.table = cbind(as.character(c("<7", "7-12", "13-19", ">19")), c(11800,57100,14800,3900)) combo.results<-rec(X= 2.3, Y = Y.table, Xlowerbound = 1, Xupperbound = 15, Ylowerbound = 1, Yupperbound = 30) # provide a censored contingency table contingencytable<-matrix(c(6185,9797,16809,11126,6156,3637,908,147,69,4, 5408,12748,26506,21486,14018,9165,2658,567,196,78, 7403,20444,44370,36285,23576,15750,4715,994,364,136, 4793,17376,44065,40751,28900,20404,6557,1296,555,228, 2354,11143,32837,33910,26203,19301,6835,1438,618,245, 1060,6038,19256,21298,17774,13864,4656,1039,430,178, 273,2521,9110,11188,9626,7433,2608,578,196,112, 119,1130,4183,5566,5053,3938,1367,318,119,66, 33,388,1707,2367,2328,1972,719,171,68,37, 38,178,1047,1672,1740,1666,757,193,158,164), nrow=10,ncol=10, byrow=TRUE) rowmarginal<-apply(contingencytable,1,sum) contingencytable<-cbind(contingencytable, rowmarginal) colmarginal<-apply(contingencytable,2,sum) contingencytable<-rbind(contingencytable, colmarginal) row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-"" contingencytable<-data.frame(c("1","2","3","4","5","6", "7", "8","9","10+", NA), contingencytable) colnames(contingencytable)<-c(NA,"<20","20-29","30-39","40-49","50-69","70-99", "100-149","150-199","200-299","300+", NA) # the contingencytable input could be put in X or Y (opposing argument = NULL) # X = rows and Y = columns # seed.matrix default = repeating the cross tabulations in the censored contingency ## table for the newly created and compatible uncensored cross tabulations # seed.estimation.method defaults to ipfp contingencytable.results<-rec( X= contingencytable, Xlowerbound = 1, Xupperbound = 15, Ylowerbound = 10, Yupperbound = 310)
This function is used in the main function: rec.
reweight.contingencytable(observed.table, estimated.table)
reweight.contingencytable(observed.table, estimated.table)
observed.table |
A censored contingency table. See Details section below formatting this data.frame. |
estimated.table |
A data.frame with uncensored row names and column names. |
Format for observed.table:
The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
The column names should be the Y category values. The first column should be the X category values and the row names can be arbitrary. The inside of the table are X * Y cross tabulation, which are either positive frequency values or probabilities. The row and column marginal totals corresponding to their X and Y category values need to be placed in this table. The top left, top right, and bottom left corners of the table should be NA or blank. The bottom right corner can be a total cross tabulation sum value, NA, or blank. The table below is a formatted example.
NA | <20 | 20-30 | >30 | NA |
<5 | 18 | 19 | 8 | 45 |
5-9 | 13 | 8 | 12 | 33 |
>=10 | 7 | 5 | 10 | 21 |
NA | 38 | 32 | 31 | NA |
Interior probability cells of a censored contingency table (observed.table) is reweighted to match interior probability cells of an uncensored contingency table (estimated.table). If observed.table consist of frequencies, the reweight.contingencytable() function changes frequencies to probabilities.
## going through the coding step of rec ## # first create contingency table contingencytable<-matrix(1:9, nrow = 3, ncol = 3) rowmarginal<-apply(contingencytable,1,sum) contingencytable<-cbind(contingencytable, rowmarginal) colmarginal<-apply(contingencytable,2,sum) contingencytable<-rbind(contingencytable, colmarginal) row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-"" contingencytable<-data.frame(c("<5", "5I9", ">9", NA), contingencytable) colnames(contingencytable)<-c(NA,"<=19","20-30",">=31", NA) # provided upper and lower bound values for table # X=row and Y=column Xlowerbound=1 Xupperbound=15 Ylowerbound=15 Yupperbound=35 # table of row marginals provides average x and phi x row.marginal.table<-row.marginal(contingencytable) x<-cnbinom.pars(row.marginal.table) # table of column marginals provides average y and phi y column.marginal.table<-column.marginal(contingencytable) y<-cnbinom.pars(column.marginal.table) # create row and column ranges rowrange<-Xlowerbound:Xupperbound colrange<-Ylowerbound:Yupperbound library(truncdist) # new uncensored row marginal table = truncated negative binomial distribution # rowrange = X is distributed given a < X <= b uncensored.row.margin<-dtrunc(rowrange, mu=x$Average, size = x$Dispersion, a = Xlowerbound-1, b = Xupperbound, spec = "nbinom") # new uncensored column margin table = = truncated negative binomial distribution # colrange = Y is distributed given a < Y <= b uncensored.column.margin<-dtrunc(colrange, mu=y$Average, size = y$Dispersion, a = Ylowerbound-1, b = Yupperbound, spec = "nbinom") # sum of truncated distributions equal 1 sum(uncensored.row.margin) sum(uncensored.column.margin) # look at the seed for this example (probabilities) seed.output<-seedmatrix(contingencytable, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound)$Probabilities # run mipfp # store the new margins in a list tgt.data<-list(uncensored.row.margin, uncensored.column.margin) # list of dimensions of each marginal constrain tgt.list<-list(1,2) # calling the estimated function ## seed has to be in array format for mipfp package ## ipfp is the selected seed.estimation.method ## $p.hat gives probabilies = x.hat/sum(x.hat) library(mipfp) final1<-Estimate(array(seed.output,dim=c(length(Xlowerbound:Xupperbound), length(Ylowerbound:Yupperbound))), tgt.list, tgt.data, method="ipfp")$x.hat # filling in names of updated seed final1<-data.frame(final1) row.names(final1)<-Xlowerbound:Xupperbound names(final1)<-Ylowerbound:Yupperbound # reweight estimates to known censored interior cell probabilities final1<-reweight.contingencytable(observed.table = contingencytable, estimated.table = final1) # see that they sum to one sum(final1) # rec function outputs the same table # default of rec seed.estimation.method is ipfp # default of rec seed.matrix is the output of the seedmatrix() function final2<-rec(X= contingencytable, Xlowerbound = 1, Xupperbound = 15, Ylowerbound = 15, Yupperbound = 35) # check that both data.frame results have same values all(final1 == final2$Probability.Estimates)
## going through the coding step of rec ## # first create contingency table contingencytable<-matrix(1:9, nrow = 3, ncol = 3) rowmarginal<-apply(contingencytable,1,sum) contingencytable<-cbind(contingencytable, rowmarginal) colmarginal<-apply(contingencytable,2,sum) contingencytable<-rbind(contingencytable, colmarginal) row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-"" contingencytable<-data.frame(c("<5", "5I9", ">9", NA), contingencytable) colnames(contingencytable)<-c(NA,"<=19","20-30",">=31", NA) # provided upper and lower bound values for table # X=row and Y=column Xlowerbound=1 Xupperbound=15 Ylowerbound=15 Yupperbound=35 # table of row marginals provides average x and phi x row.marginal.table<-row.marginal(contingencytable) x<-cnbinom.pars(row.marginal.table) # table of column marginals provides average y and phi y column.marginal.table<-column.marginal(contingencytable) y<-cnbinom.pars(column.marginal.table) # create row and column ranges rowrange<-Xlowerbound:Xupperbound colrange<-Ylowerbound:Yupperbound library(truncdist) # new uncensored row marginal table = truncated negative binomial distribution # rowrange = X is distributed given a < X <= b uncensored.row.margin<-dtrunc(rowrange, mu=x$Average, size = x$Dispersion, a = Xlowerbound-1, b = Xupperbound, spec = "nbinom") # new uncensored column margin table = = truncated negative binomial distribution # colrange = Y is distributed given a < Y <= b uncensored.column.margin<-dtrunc(colrange, mu=y$Average, size = y$Dispersion, a = Ylowerbound-1, b = Yupperbound, spec = "nbinom") # sum of truncated distributions equal 1 sum(uncensored.row.margin) sum(uncensored.column.margin) # look at the seed for this example (probabilities) seed.output<-seedmatrix(contingencytable, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound)$Probabilities # run mipfp # store the new margins in a list tgt.data<-list(uncensored.row.margin, uncensored.column.margin) # list of dimensions of each marginal constrain tgt.list<-list(1,2) # calling the estimated function ## seed has to be in array format for mipfp package ## ipfp is the selected seed.estimation.method ## $p.hat gives probabilies = x.hat/sum(x.hat) library(mipfp) final1<-Estimate(array(seed.output,dim=c(length(Xlowerbound:Xupperbound), length(Ylowerbound:Yupperbound))), tgt.list, tgt.data, method="ipfp")$x.hat # filling in names of updated seed final1<-data.frame(final1) row.names(final1)<-Xlowerbound:Xupperbound names(final1)<-Ylowerbound:Yupperbound # reweight estimates to known censored interior cell probabilities final1<-reweight.contingencytable(observed.table = contingencytable, estimated.table = final1) # see that they sum to one sum(final1) # rec function outputs the same table # default of rec seed.estimation.method is ipfp # default of rec seed.matrix is the output of the seedmatrix() function final2<-rec(X= contingencytable, Xlowerbound = 1, Xupperbound = 15, Ylowerbound = 15, Yupperbound = 35) # check that both data.frame results have same values all(final1 == final2$Probability.Estimates)
This function is used in the main function: rec.
reweight.univariatetable(observed.table, estimated.table)
reweight.univariatetable(observed.table, estimated.table)
observed.table |
A data.frame or matrix with two columns. See Details section for formatting. |
estimated.table |
A numeric vector with uncensored row names. |
Format for observed.table:
This univariate frequency table, which can be a data.frame or matrix class, must have two columns and n number of rows. The categories must be in the first column with frequencies or probabilities in the second column. Row names should never be placed in this table (the default row names should always be 1:n). Column names can be any character string. The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
Below are three correctly formatted tables.
Category | Frequency |
<=6 | 11800 |
7-12 | 57100 |
13-19 | 14800 |
20+ | 3900 |
Category | Frequency |
LE6 | 11800 |
7I12 | 57100 |
13I19 | 14800 |
GE20 | 3900 |
Category | Frequency |
<7 | 11800 |
7I12 | 57100 |
13-19 | 14800 |
>=20 | 3900 |
A censored univariate table (observed.table) of probabilities is reweighted to match probabilities in an uncensored numeric vector (estimated.table). If observed.table consist of frequencies, the reweight.contingencytable() function changes the frequencies to probabilities.
# use preloaded univariate table for observed table observed.table<-univariatetable.csv # estimate parameters from observed table pars=cnbinom.pars(observed.table) mu=pars$Average r = pars$Dispersion # create estimated table # truncated negative binomial probabilities # uncensored range is from 1:15 library(truncdist) estimated.table<-dtrunc(1:15, size = r, mu = mu, spec = "nbinom", a = 1-1, b = 15) names(estimated.table)<-1:15 # reweight observed table to estimated table results<-reweight.univariatetable(observed.table, estimated.table) # check results for >=9 reweightedresults<-sum(results[9:15]) # observed.table$V2[5] = 4.1 observedresults<-observed.table$V2[5]/sum(observed.table$V2) # matching probabilities all.equal(reweightedresults, observedresults)
# use preloaded univariate table for observed table observed.table<-univariatetable.csv # estimate parameters from observed table pars=cnbinom.pars(observed.table) mu=pars$Average r = pars$Dispersion # create estimated table # truncated negative binomial probabilities # uncensored range is from 1:15 library(truncdist) estimated.table<-dtrunc(1:15, size = r, mu = mu, spec = "nbinom", a = 1-1, b = 15) names(estimated.table)<-1:15 # reweight observed table to estimated table results<-reweight.univariatetable(observed.table, estimated.table) # check results for >=9 reweightedresults<-sum(results[9:15]) # observed.table$V2[5] = 4.1 observedresults<-observed.table$V2[5]/sum(observed.table$V2) # matching probabilities all.equal(reweightedresults, observedresults)
This function inputs a contingency table and produces a univariate frequency table from the row marginals.
row.marginal(contingencytable)
row.marginal(contingencytable)
contingencytable |
A censored contingency table. Accepted classes are a data.frame or matrix. See Details section for formatting. |
Table Format:
The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
The column names should be the Y category values. The first column should be the X category values and the row names can be arbitrary. The inside of the table are X * Y cross tabulation, which are either positive frequency values or probabilities. The row and column marginal totals corresponding to their X and Y category values need to be placed in this table. The top left, top right, and bottom left corners of the table should be NA or blank. The bottom right corner can be a total cross tabulation sum value, NA, or blank. The table below is a formatted example.
NA | <20 | 20-30 | >30 | NA |
<5 | 18 | 19 | 8 | 45 |
5-9 | 13 | 8 | 12 | 33 |
>=10 | 7 | 5 | 10 | 21 |
NA | 38 | 32 | 31 | NA |
A univariate frequency table (data.frame) created from the row marginals.
# contingencytable.csv is a preloaded contingency table example row.marginal(contingencytable.csv)
# contingencytable.csv is a preloaded contingency table example row.marginal(contingencytable.csv)
To implement the mipfp R package in our 'rec' function, an initial N-dimensional array (called a seed) needs to be provided. This function creates a seed matrix for an uncensored contingency table (Xlowerbound:Xupperbound, Ylowerbound:Yupperbound) when provided a censored contingency table.
seedmatrix(censoredtable, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound)
seedmatrix(censoredtable, Xlowerbound, Xupperbound, Ylowerbound, Yupperbound)
censoredtable |
A censored contingency table. A data.frame and matrix are accepted table classes. See Details section for formatting. |
Xlowerbound |
A numeric class value to represent a lower bound for the row values. The value must strictly be >= 0 and cannot be greater than the lowest row category value (e.g. the lower bound cannot be 6 if a table has '>= 5' as a row category value) |
Xupperbound |
A numeric class value to represent an upper bound for the row values. The value cannot be less than the highest row category value (e.g. the upper bound cannot be 90 if a table has '> 100' as a row category value). |
Ylowerbound |
Same description as Xlowerbound but this argument is for the columns in contingency table). |
Yupperbound |
Same description as Xupperbound but this argument is for the columns in contingency table). |
Table Format:
The only symbols accepted for censored data are listed below. Note, less than or equal to (<= and LE) is not equivalent to less than (< and L) and greater than or equal to (>=, +, and GE) is not equivalent to greater than (> and G). Also, calculations use closed intervals.
left censoring: <, L, <=, LE
interval censoring: - or I (symbol has to be placed in the middle of the two category values)
right censoring: >, >=, +, G, GE
uncensored: no symbol (only provide category value)
The column names should be the Y category values. The first column should be the X category values and the row names can be arbitrary. The inside of the table are X * Y cross tabulation, which are either positive frequency values or probabilities. The row and column marginal totals corresponding to their X and Y category values need to be placed in this table. The top left, top right, and bottom left corners of the table should be NA or blank. The bottom right corner can be a total cross tabulation sum value, NA, or blank. The table below is a formatted example.
NA | <20 | 20-30 | >30 | NA |
<5 | 18 | 19 | 8 | 45 |
5-9 | 13 | 8 | 12 | 33 |
>=10 | 7 | 5 | 10 | 21 |
NA | 38 | 32 | 31 | NA |
The output is a list of two tables: 'Exact' and 'Probabilities'. Both tables have rows ranging from Xlowerbound, Xlowerbound + 1,..., Xupperbound and columns ranging from Ylowerbound, Ylowerbound + 1,..., Yupperbound. Also, the marginals are not placed in either table. The 'Exact' table uses the cross tabulation probabilities corresponding to the censored categories and repeats these values to the newly created and compatible uncensored cross tabulations (frequencies are changed to probabilities). The 'Probabilities' table takes the the cells of the 'Exact' table are divides by sum(Exact). The rec() function uses the 'Probabilities' table.
# create a censored contingency table contingencytable<-matrix(1:9, nrow = 3, ncol = 3) rowmarginal<-apply(contingencytable,1,sum) contingencytable<-cbind(contingencytable, rowmarginal) colmarginal<-apply(contingencytable,2,sum) contingencytable<-rbind(contingencytable, colmarginal) row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-"" contingencytable<-data.frame(c("<5", "5I9", ">9", NA), contingencytable) colnames(contingencytable)<-c(NA,"<=19","20-30",">=31", NA) # look at the seed for this example (probabilities) seed.output<-seedmatrix(contingencytable, 1, 15, 15, 35)$Probabilities
# create a censored contingency table contingencytable<-matrix(1:9, nrow = 3, ncol = 3) rowmarginal<-apply(contingencytable,1,sum) contingencytable<-cbind(contingencytable, rowmarginal) colmarginal<-apply(contingencytable,2,sum) contingencytable<-rbind(contingencytable, colmarginal) row.names(contingencytable)[row.names(contingencytable)=="colmarginal"]<-"" contingencytable<-data.frame(c("<5", "5I9", ">9", NA), contingencytable) colnames(contingencytable)<-c(NA,"<=19","20-30",">=31", NA) # look at the seed for this example (probabilities) seed.output<-seedmatrix(contingencytable, 1, 15, 15, 35)$Probabilities
In 2011, a Nepal Living Standards Survey provided multiple censored freqency tables containing household size (percent). Here we show the urban household size table.
univariatetable.csv is a preloaded example that is correctly formatted for the revengc package. Below shows how a user should read in a csv file to have the same format:
univariatetable.csv<-read.csv("filename.csv", row.names = NULL, header= FALSE, check.names=FALSE)
univariatetable.csv
univariatetable.csv
National Planning Commissions Secretariat, Government of Nepal. (2011). Nepal Living Standards Survey. Page 28. Retrieved from: http://siteresources.worldbank.org/INTLSMS/Resources/3358986-1181743055198/3877319-1329489437402/Statistical_Report_Vol1.pdf