Title: | Functions for Optimal Non-Bipartite Matching |
---|---|
Description: | Perform non-bipartite matching and matched randomization. A "bipartite" matching utilizes two separate groups, e.g. smokers being matched to nonsmokers or cases being matched to controls. A "non-bipartite" matching creates mates from one big group, e.g. 100 hospitals being randomized for a two-arm cluster randomized trial or 5000 children who have been exposed to various levels of secondhand smoke and are being paired to form a greater exposure vs. lesser exposure comparison. At the core of a non-bipartite matching is a N x N distance matrix for N potential mates. The distance between two units expresses a measure of similarity or quality as mates (the lower the better). The 'gendistance()' and 'distancematrix()' functions assist in creating this. The 'nonbimatch()' function creates the matching that minimizes the total sum of distances between mates; hence, it is referred to as an "optimal" matching. The 'assign.grp()' function aids in performing a matched randomization. Note bipartite matching can be performed using the prevent option in 'gendistance()'. |
Authors: | Cole Beck [aut, cre], Bo Lu [aut], Robert Greevy [aut] |
Maintainer: | Cole Beck <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.5.6 |
Built: | 2025-02-21 05:46:08 UTC |
Source: | https://github.com/couthcommander/nbpmatching |
This package will take an input distance matrix and generate the set of pairwise matches that minimizes the sum of distances between the pairs by running nonbimatch.
The most current documentation is available at https://github.com/couthcommander/nbpMatching.
Bo Lu, Robert Greevy, Cole Beck
Maintainer: Cole Beck [email protected]
Lu B, Greevy R, Xu X, Beck C. Optimal Nonbipartite Matching and its Statistical Applications. The American Statistician. Vol. 65, no. 1. : 21-30. 2011.
Greevy RA Jr, Grijalva CG, Roumie CL, Beck C, Hung AM, Murff HJ, Liu X, Griffin MR. Reweighted Mahalanobis distance matching for cluster-randomized trials with missing data. Pharmacoepidemiol Drug Saf. 2012 May;21 Suppl 2:148-54. doi: 10.1002/pds.3260.
Useful links:
# create a covariate matrix df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) # create distances df.dist <- gendistance(df, idcol=1) # create distancematrix object df.mdm <- distancematrix(df.dist) # create matches df.match <- nonbimatch(df.mdm) # review quality of matches df.qom <- qom(df.dist$cov, df.match$matches) # some helper functions are available # runner -- start with the covariate, run through the entire process df.1 <- runner(df, idcol=1) # full.qom -- start with the covariate, generate a full quality of match report df.2 <- full.qom(df) ## Not run: try a large matrix nonbimatch(distancematrix(as.matrix(dist(sample(1:10^8, 5000, replace=TRUE))))) ## End(Not run)
# create a covariate matrix df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) # create distances df.dist <- gendistance(df, idcol=1) # create distancematrix object df.mdm <- distancematrix(df.dist) # create matches df.match <- nonbimatch(df.mdm) # review quality of matches df.qom <- qom(df.dist$cov, df.match$matches) # some helper functions are available # runner -- start with the covariate, run through the entire process df.1 <- runner(df, idcol=1) # full.qom -- start with the covariate, generate a full quality of match report df.2 <- full.qom(df) ## Not run: try a large matrix nonbimatch(distancematrix(as.matrix(dist(sample(1:10^8, 5000, replace=TRUE))))) ## End(Not run)
Randomly assign each element into treatment group A or B.
assign.grp(matches, seed = 68, ...)
assign.grp(matches, seed = 68, ...)
matches |
A data.frame or nonbimatch object. Contains information on how to match the covariate data set. |
seed |
Seed provided for random-number generation. Default value of 68. |
... |
Additional arguments, not used at the moment. |
This function takes the matched pairs generated by nonbimatch and randomly assigns each element to a group.
original data.frame with treatment group column
Cole Beck
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) assign.grp(df.match) assign.grp(df.match$matches)
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) assign.grp(df.match) assign.grp(df.match$matches)
The distancematrix function is used to reformat the input distance matrix
into the format required by the nonbipartite matching Fortran code. The
original matrix should have dimensions , where
is the total
number of elements to be matched. The matrix may be created in R and input
into the distancematrix function. Alternately, the matrix may be read in
from a CSV file, i.e. a text file where distances in a given row are
delimited by commas. If a list element is given, it should have a data.frame
element named "dist", preferably generated by the gendistance function.
distancematrix(x, ...)
distancematrix(x, ...)
x |
A matrix, data.frame, list or filename. This should be an
|
... |
Additional arguments, potentially used when reading in a filename and passed into read.csv. |
The distancematrix function is used to reformat the input distance matrix into the format required by the nonbipartite matching Fortran code.
If an extra column or row is present, it will be converted into row
names. In other words, if the matrix has dimensions x
, or
x
, then the function will take the first row, or column, as
an ID column. If both row and column names are present, i.e. a
x
matrix, the function cannot identify the names.
If an odd number of elements exist, a ghost element, or sink, will be created whose distance is zero to all of the other elements. For example, when matching 17 elements, the function will create an 18th element that matches every element perfectly. This sink may or not be appropriate for your application. Naturally, you may create sinks as needed in the distance matrix you input to the distancematrix function.
The elements of distancematrix may not be re-assigned once created. In other words, you cannot edit the formatted distance matrix. You need to edit the matrix being input into the distancematrix function.
distancematrix S4 object
Cole Beck
plainmatrix<-as.matrix(dist(sample(1:25, 8, replace=TRUE))) diag(plainmatrix) <- 99999 # setting diagonal to an infinite distance for # pedagogical reasons (the diagonal may be left # as zero) mdm<-distancematrix(plainmatrix) df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA df.dist <- gendistance(df, idcol=1, ndiscard=2) mdm2 <- distancematrix(df.dist)
plainmatrix<-as.matrix(dist(sample(1:25, 8, replace=TRUE))) diag(plainmatrix) <- 99999 # setting diagonal to an infinite distance for # pedagogical reasons (the diagonal may be left # as zero) mdm<-distancematrix(plainmatrix) df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA df.dist <- gendistance(df, idcol=1, ndiscard=2) mdm2 <- distancematrix(df.dist)
The fill.missing function uses the transcan
function from the
Hmisc package to impute values for the given data.frame.
fill.missing(x, seed = 101, simplify = TRUE, idcol = "id", ...)
fill.missing(x, seed = 101, simplify = TRUE, idcol = "id", ...)
x |
A data.frame object. It should have missing values. |
seed |
Seed provided for random-number generation. Default value of 101. |
simplify |
logical: whether to remove duplicate missingness columns. |
idcol |
An integer value or character string. Indicates the column containing IDs, specified as column index or column name. Defaults to "id", or NA, when not found. |
... |
Additional arguments, potentially passed to |
The fill.missing function will fill the missing values within a data.frame
with the values imputed with the transcan
function. An idcol may be
specified to prevent including the use of IDs in the imputation. In addition
for every column that contains missing data, a new column will be attached to
the data.frame containing an indicator of missingness. A "1" indicates that
the value was missing and has been imputed.
data.frame with imputed values
Cole Beck
set.seed(1) df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA df <- fill.missing(df)
set.seed(1) df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA df <- fill.missing(df)
The gendistance function creates an x
distance matrix
from an
x
covariates matrix, where
is the number
of subjects,
the number of covariates, and
the number of
phantom subjects requested (see
ndiscard
option). Provided the
covariates' covariance matrix is invertible, the distances computed are
Mahalanobis distances, or if covariate weights are provided, Reweighted
Mahalanobis distances (see weights
option and Greevy, et al.,
Pharmacoepidemiology and Drug Safety 2012).
gendistance( covariate, idcol = NULL, weights = NULL, prevent = NULL, force = NULL, rankcols = NULL, missing.weight = 0.1, ndiscard = 0, singular.method = "solve", talisman = NULL, prevent.res.match = NULL, outRawDist = FALSE, ... )
gendistance( covariate, idcol = NULL, weights = NULL, prevent = NULL, force = NULL, rankcols = NULL, missing.weight = 0.1, ndiscard = 0, singular.method = "solve", talisman = NULL, prevent.res.match = NULL, outRawDist = FALSE, ... )
covariate |
A data.frame object, containing the covariates of the data set. |
idcol |
An integer or column name, providing the index of the column containing row ID's. |
weights |
A numeric vector, the length should match the number of columns. This value determines how much weight is given to each column when generating the distance matrix. |
prevent |
A vector of integers or column names, providing the index of columns that should be used to prevent matches. When generating the distance matrix, elements that match on these columns are given a maximum distance. |
force |
An integer or column name, providing the index of the column containing information used to force pairs to match. |
rankcols |
A vector of integers or column names, providing the index of columns that should have the rank function applied to them before generating the distance matrix. |
missing.weight |
A numeric value, or vector, used to generate the weight of missingness indicator columns. Missingness indicator columns are created if there is missing data within the data set. Defaults to 0.1. If a single value is supplied, weights are generating by multiplying this by the original columns' weight. If a vector is supplied, it's length should match the number of columns with missing data, and the weight is used as is. |
ndiscard |
An integer, providing the number of elements that should be allowed to match phantom values. The default value is 0. |
singular.method |
A character string, indicating the function to use
when encountering a singular matrix. By default, |
talisman |
An integer or column name, providing location of talisman column. The talisman column should only contains values of 0 and 1. Records with zero will match phantoms perfectly, while other records will match phantoms at max distance. |
prevent.res.match |
An integer or column name, providing location of the column containing assigned treatment groups. This is useful in some settings, such as trickle-in randomized trials. When set, non-NA values from this column are replaced with the value 1. This prevents records with previously assigned treatments (the ‘reservior’) from matching each other. |
outRawDist |
a logical, indicating if the raw distance matrix should also be returned. The raw form is before distance modifiers such as ‘prevent’ take effect. |
... |
Additional arguments, not used at this time. |
Given a data.frame of covariates, generate a distance matrix. Missing values
are imputed with fill.missing
. For each column with missing
data, a missingness indicator column will be added. Phantoms are fake
elements that perfectly match all elements. They can be used to discard a
certain number of elements.
a list object with several elements
dist |
generated distance matrix |
cov |
covariate matrix used to generate distances |
ignored |
ignored columns from original covariate matrix |
weights |
weights applied to each column in covariate matrix |
prevent |
columns used to prevent matches |
mates |
index of rows that should be forced to match |
rankcols |
index of columns that should use rank |
missing.weight |
weight to apply to missingness indicator columns |
ndiscard |
number of elements that will match phantoms |
rawDist |
raw distance matrix, only provided if ‘outRawDist’ is TRUE |
Cole Beck
set.seed(1) df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) # add some missing data df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA df.dist <- gendistance(df, idcol=1, ndiscard=2) # up-weight the second column df.weighted <- gendistance(df, idcol=1, weights=c(1,2,1), ndiscard=2, missing.weight=0.25) df[,3] <- df[,2]*2 df.sing.solve <- gendistance(df, idcol=1, ndiscard=2) df.sing.ginv <- gendistance(df, idcol=1, ndiscard=2, singular.method="ginv")
set.seed(1) df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) # add some missing data df[sample(seq_len(nrow(df)), ceiling(nrow(df)*0.1)), 2] <- NA df.dist <- gendistance(df, idcol=1, ndiscard=2) # up-weight the second column df.weighted <- gendistance(df, idcol=1, weights=c(1,2,1), ndiscard=2, missing.weight=0.25) df[,3] <- df[,2]*2 df.sing.solve <- gendistance(df, idcol=1, ndiscard=2) df.sing.ginv <- gendistance(df, idcol=1, ndiscard=2, singular.method="ginv")
Create a factor variable using the names from a matched data set.
get.sets(matches, remove.unpaired = TRUE, ...)
get.sets(matches, remove.unpaired = TRUE, ...)
matches |
A data.frame or nonbimatch object. Contains information on how to match the covariate data set. |
remove.unpaired |
A boolean value. The default is to remove elements matched to phantom elements. |
... |
Additional arguments, not used at this time. |
Calculate a name for each pair by using the ID columns from the matched data set. Return a factor of these named pairs.
a factor vector
Jake Bowers, http://www.jakebowers.org/, Cole Beck
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) get.sets(df.match) get.sets(df.match$matches) # include the phantom match get.sets(df.match$matches, FALSE)
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) get.sets(df.match) get.sets(df.match$matches) # include the phantom match get.sets(df.match$matches, FALSE)
The make.phantoms function will take an x
matrix and add
phantom elements, thus creating a matrix with
x
dimensions.
make.phantoms(x, nphantoms, name = "phantom", maxval = Inf, ...)
make.phantoms(x, nphantoms, name = "phantom", maxval = Inf, ...)
x |
A matrix or data.frame object, with |
nphantoms |
An integer, providing the number of phantom elements to add. |
name |
A character string, indicating the name attribute for new elements. Defaults to "phantom". |
maxval |
An integer value, the default value to give the pairs of phantoms (indeces [N+1:N+NP, N+1:N+NP]), assumed to be a maximum distance. Defaults to Inf. |
... |
Additional arguments, not used at this time. |
This function is internal to the gendistance
function, but may be
useful in manufacturing personalized distance matrices. Phantoms are fake
elements that perfectly match all elements. They can be used to discard a
certain number of elements.
a matrix or data.frame object
Cole Beck
# 5x5 distance matrix dist.mat <- matrix(c(0,5,10,15,20,5,0,15,25,35,10,15,0,25,40,15,25,25,0,15,20,35,40,15,0), nrow=5) # add one phantom element dm.ph <- make.phantoms(dist.mat, 1) # create distancematrix object distancematrix(dm.ph) # add three phantoms make.phantoms(dist.mat, 3)
# 5x5 distance matrix dist.mat <- matrix(c(0,5,10,15,20,5,0,15,25,35,10,15,0,25,40,15,25,25,0,15,20,35,40,15,0), nrow=5) # add one phantom element dm.ph <- make.phantoms(dist.mat, 1) # create distancematrix object distancematrix(dm.ph) # add three phantoms make.phantoms(dist.mat, 3)
The nonbinmatch function creates the set of pairwise matches that minimizes the sum of distances between the pairs.
nonbimatch(mdm, threshold = NA, precision = 6, ...)
nonbimatch(mdm, threshold = NA, precision = 6, ...)
mdm |
A distancematrix object. See the distancematrix function. |
threshold |
An numeric value, indicating the distance needed to create chameleon matches. |
precision |
The largest value in the matrix will have at most this many digits. The default value is six. |
... |
Additional arguments, these are not used. |
The nonbinmatch function calls the Fortran code (Derigs) and set of pairwise matches that minimizes the sum of distances between the pairs.
nonbimatch S4 object with several elements
matches |
data.frame containing matches |
halves |
data.frame containing each match |
total |
sum of the distances across all pairs |
mean |
mean distance for each pair |
Cole Beck
plainmatrix<-as.matrix(dist(sample(1:25, 8, replace=TRUE))) diag(plainmatrix) <- 99999 # setting diagonal to an infinite distance for # pedagogical reasons (the diagonal may be left # as zero) mdm<-distancematrix(plainmatrix) res<-nonbimatch(mdm)
plainmatrix<-as.matrix(dist(sample(1:25, 8, replace=TRUE))) diag(plainmatrix) <- 99999 # setting diagonal to an infinite distance for # pedagogical reasons (the diagonal may be left # as zero) mdm<-distancematrix(plainmatrix) res<-nonbimatch(mdm)
Quality of matches show how well matched pairs differ. For each variable the average distance is generated. Each item in a pair is assigned a group and after several iterations the quantile of these average distances is returned.
qom( covariate, matches, iterations = 10000, probs = NA, use.se = FALSE, all.vals = FALSE, seed = 101, ... )
qom( covariate, matches, iterations = 10000, probs = NA, use.se = FALSE, all.vals = FALSE, seed = 101, ... )
covariate |
A data.frame object. |
matches |
A data.frame or nonbimatch object. Contains information on how to match the covariate data set. |
iterations |
An integer. Number of iterations to run, defaults to 10,000. |
probs |
A numeric vector. Probabilities to pass to the quantile function. |
use.se |
A logical value. Determines if the standard error should be computed. Default value of FALSE. |
all.vals |
A logical value. Determines if false matches should be included in comparison. Default value of FALSE. |
seed |
Seed provided for random-number generation. Default value of 101. |
... |
Additional arguments, not used at the moment. |
This fuction is useful for determining the effectiveness of your weights
(when generating a distance matrix). Weighting a variable more will lower
the average distance, but it could penalize the distance of the other
variables. Calculating the standard error requires calling
hdquantile
from Hmisc. The quantiles may be slightly
different when using hdquantile
.
a list object containing elements with quality of match information
q |
data.frame with quantiles for each covariate |
se |
data.frame with standard error for each covariate |
sd |
vector with standard deviate for each covariate |
Cole Beck
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) qom(df.dist$cov, df.match) qom(df.dist$cov, df.match$matches)
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) qom(df.dist$cov, df.match) qom(df.dist$cov, df.match$matches)
Extend the stats quantile
function for handling distancematrix objects.
## S4 method for signature 'distancematrix' quantile(x, probs, ...)
## S4 method for signature 'distancematrix' quantile(x, probs, ...)
x |
A distancematrix object. |
probs |
numeric vector or probabilities with values in [0,1]. |
... |
Additional arguments, passed to |
The upper.triangular values of the distance matrix object are passed to the
quantile
function.
numeric vector of quantiles corresponding to the given probabilities
Cole Beck
plainmatrix<-as.matrix(dist(sample(1:25, 8, replace=TRUE))) mdm<-distancematrix(plainmatrix) quantile(mdm, probs=c(0.0, 0.25, 0.50, 0.75, 1.00))
plainmatrix<-as.matrix(dist(sample(1:25, 8, replace=TRUE))) mdm<-distancematrix(plainmatrix) quantile(mdm, probs=c(0.0, 0.25, 0.50, 0.75, 1.00))
Calculate the scalar distance between elements of a matrix.
scalar.dist(x, ...)
scalar.dist(x, ...)
x |
A vector of numeric values. |
... |
Additional arguments, not used at this time. |
Take the absolute difference between all elements in a vector, and return a matrix of the distances.
a matrix object
Jake Bowers, http://www.jakebowers.org/, Cole Beck
scalar.dist(1:10)
scalar.dist(1:10)
Remove unpaired or unnecessary matches.
subsetMatches( matches, phantom = TRUE, chameleon = TRUE, ghost = TRUE, infinite = TRUE, halvesOnly = TRUE )
subsetMatches( matches, phantom = TRUE, chameleon = TRUE, ghost = TRUE, infinite = TRUE, halvesOnly = TRUE )
matches |
A nonbimatch object. |
phantom |
A logical value. Remove elements matched to phantom elements. |
chameleon |
A logical value. Remove elements matched to chameleon elements. |
ghost |
A logical value. Remove elements matched to ghost elements. |
infinite |
A logical value. Remove elements matched at infinite
distance. This will include elements forced to match in spite of having an
infinite distance set by the prevent option in |
halvesOnly |
A logical value. Use halves element instead of matches. |
Given a nonbimatch object, remove elements matched to phantoms, chameleons, or ghosts. Also remove pairs whose distance is infinite.
a data.frame
Cole Beck
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1, ndiscard=4) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) subsetMatches(df.match) subsetMatches(df.match, halvesOnly=FALSE) subsetMatches(df.match, phantom=FALSE)
df <- data.frame(id=LETTERS[1:25], val1=rnorm(25), val2=rnorm(25)) df.dist <- gendistance(df, idcol=1, ndiscard=4) df.mdm <- distancematrix(df.dist) df.match <- nonbimatch(df.mdm) subsetMatches(df.match) subsetMatches(df.match, halvesOnly=FALSE) subsetMatches(df.match, phantom=FALSE)