Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GLM with design-based standard errors #304

Merged
merged 60 commits into from
Nov 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
2914fa6
reg.jl : variable names + comments on svyglm() and added tests for 4 …
nadiaenh Jun 2, 2023
95f9a99
added GLM dependency to Project.toml
nadiaenh Jun 2, 2023
e6c0b88
include GLM tests in runtests.jl
nadiaenh Jun 3, 2023
7bf9872
added GLM dependency for tests
nadiaenh Jun 3, 2023
85a5243
start documentation on GLM
nadiaenh Jun 11, 2023
9916170
fix missing dependencies for reg.jl testing
nadiaenh Jun 11, 2023
40bb225
lower replicates in reg.jl test for faster runtime
nadiaenh Jun 11, 2023
b7ed7e4
comment out failing tests for now
nadiaenh Jun 11, 2023
97bfb60
update GLMs documentation
nadiaenh Jun 17, 2023
3def6f3
add tests for reg.jl
nadiaenh Jun 17, 2023
57f0700
update docstring for svyglm
nadiaenh Jun 17, 2023
a0845e6
update example in docstring
nadiaenh Jun 17, 2023
5603d88
update tests to use 500 replicates instead of 4000
nadiaenh Jun 17, 2023
0db8232
added DataFrames dependency for reg.jl tests
nadiaenh Jun 17, 2023
20e553e
add new variance definition and associate mean and ratio definitions
nadiaenh Jul 14, 2023
b1ee02d
remove 1 doctest until total() gets redefined for use in new variance…
nadiaenh Jul 14, 2023
afb23a5
restore _bootweights_cluset_sorted
nadiaenh Jul 14, 2023
efe7ace
update documentation
nadiaenh Jul 19, 2023
649ffaf
update variance function to support jackknife
nadiaenh Jul 19, 2023
69aba95
comment out ratio and jackknife tests temporarily
nadiaenh Jul 19, 2023
4290ad0
update doctests
nadiaenh Jul 19, 2023
ba5689b
change function names
nadiaenh Aug 4, 2023
df7c81b
Change by.jl to incorporate the new variance function.
ayushpatnaikgit Aug 4, 2023
42130f5
Remove comment
ayushpatnaikgit Aug 4, 2023
b1c05da
Add datatype to the list.
ayushpatnaikgit Aug 4, 2023
0aed990
Code reuse in bydomain for SurveyDesign
ayushpatnaikgit Aug 4, 2023
6689fe3
More code reuse
ayushpatnaikgit Aug 4, 2023
59f639d
use new variance function in ratio and total for domain
nadiaenh Aug 4, 2023
54f8220
remove jackknife replicate support in bydomain
nadiaenh Aug 9, 2023
b7d8e7b
define quantile by domain
nadiaenh Aug 9, 2023
c4acd9d
add domain name column
nadiaenh Aug 14, 2023
5f89ce1
add domain name column
nadiaenh Aug 14, 2023
b47edb6
update doctest for bydomain
nadiaenh Aug 15, 2023
56cd148
update docstring
nadiaenh Aug 15, 2023
230cb47
update docstring
nadiaenh Aug 15, 2023
36b93ab
update docstring
nadiaenh Aug 15, 2023
18a77dd
Merge branch 'repdesign-variance' of github.com:nadiaenh/Survey.jl in…
nadiaenh Aug 15, 2023
8c1d86c
uncomment tests
nadiaenh Aug 15, 2023
6f75a27
remove show
nadiaenh Aug 15, 2023
7a472dc
Merge pull request #2 from ayushpatnaikgit/dontderive
nadiaenh Aug 18, 2023
eaab6b1
Merge branch 'repdesign-variance' of github.com:nadiaenh/Survey.jl in…
nadiaenh Aug 18, 2023
8507360
update by.jl
nadiaenh Aug 18, 2023
ae2daa1
update tests
nadiaenh Aug 21, 2023
e74bee9
fix NaN SE error
nadiaenh Aug 21, 2023
737b1d6
replicatedesign input in bydomain
nadiaenh Aug 21, 2023
47ee696
Merge pull request #4 from nadiaenh/repdesign-variance
nadiaenh Aug 21, 2023
b0015c1
update svyglm to use new variance func
nadiaenh Aug 25, 2023
fde31ba
use new variance in svyglm + update tests
nadiaenh Aug 25, 2023
eae8d98
Update docs/src/man/glm.md
nadiaenh Aug 27, 2023
32f5b2c
update GLM tests
nadiaenh Aug 27, 2023
8e7fd70
Merge branch 'glm' of github.com:nadiaenh/Survey.jl into glm
nadiaenh Aug 27, 2023
58b2fc2
fix documentation errors
nadiaenh Aug 27, 2023
66baa4d
add GLM tests
nadiaenh Aug 27, 2023
ade6f10
update doctests + rename variance func
nadiaenh Aug 27, 2023
d3077aa
rename svyglm to glm
nadiaenh Aug 27, 2023
7d45116
Label was already used.
ayushpatnaikgit Aug 27, 2023
9e00ac3
glm exported twice
ayushpatnaikgit Aug 27, 2023
f9856f8
Add the glm documentation page to the TOC. Remove fix = true
ayushpatnaikgit Aug 27, 2023
f4e93b3
variance has been changed to standerr, and glm needs to be in API ref…
ayushpatnaikgit Aug 27, 2023
c677962
Minor formatting corrections.
ayushpatnaikgit Aug 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ to compute the standard errors.

```julia
julia> bootsrs = bootweights(srs; replicates=1000)
ReplicateDesign:
ReplicateDesign{BootstrapReplicates}:
data: 200×1047 DataFrame
strata: none
cluster: none
Expand Down
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ makedocs(;
"Replicate weights" => "man/replicate.md",
"Plotting" => "man/plotting.md",
"Comparison with other survey analysis tools" => "man/comparisons.md",
"Generalised linear models" => "man/glm.md",
],
"API reference" => "api.md",
],
Expand Down
3 changes: 2 additions & 1 deletion docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,12 @@ JackknifeReplicates
load_data
bootweights
jackknifeweights
variance
stderr
mean
total
quantile
ratio
glm
plot
boxplot
hist
Expand Down
99 changes: 99 additions & 0 deletions docs/src/man/glm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Generalized Linear Models in Survey
The `glm()` function in the Julia Survey package is used to fit generalized linear models (GLMs) to survey data. It incorporates survey design information, such as sampling weights, stratification, and clustering, to produce valid estimates and standard errors that account for the type of survey design.

As of June 2023, the [GLM.jl documentation](https://juliastats.org/GLM.jl/stable/) lists the supported distribution families and their link functions as:
```txt
Bernoulli (LogitLink)
Binomial (LogitLink)
Gamma (InverseLink)
InverseGaussian (InverseSquareLink)
NegativeBinomial (NegativeBinomialLink, often used with LogLink)
Normal (IdentityLink)
Poisson (LogLink)
```

Refer to the GLM.jl documentation for more information about the GLM package.

## Fitting a GLM to a Survey Design object

You can fit a GLM to a Survey Design object the same way you would fit it to a regular data frame. The only difference is that you need to specify the survey design object as the second argument to the `glm()` function.

```julia
using Survey
apisrs = load_data("apisrs")

# Simple random sample survey
srs = SurveyDesign(apisrs, weights = :pw)

# Survey stratified by stype
dstrat = SurveyDesign(apistrat, strata = :stype, weights = :pw)

# Survey clustered by dnum
dclus1 = SurveyDesign(apiclus1, clusters = :dnum, weights = :pw)
```

Once you have the survey design object, you can fit a GLM using the `glm()` function. Specify the formula for the model and the distribution family.

The `glm()` function supports all distribution families supported by GLM.jl, i.e. Bernoulli, Binomial, Gamma, Geometric, InverseGaussian, NegativeBinomial, Normal, and Poisson.

For example, to fit a GLM with a Bernoulli distribution and a Logit link function to the `srs` survey design object we created above:
```julia
nadiaenh marked this conversation as resolved.
Show resolved Hide resolved
formula = @formula(api00 ~ api99)
my_glm = glm(formula, srs, family = Normal())

# View the coefficients and standard errors
my_glm.Coefficients
my_glm.SE
```

## Examples

The examples below use the `api` datasets, which contain survey data collected about California schools. The datasets are included in the Survey.jl package and can be loaded by calling `load_data("name_of_dataset")`.

### Bernoulli with Logit Link

A school being eligible for the awards program (`awards`) is a binary outcome (0 or 1). Let's assume it follows a Bernoulli distribution. Suppose we want to predict `awards` based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:

```julia
using Survey
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs, weights = :pw)

# Convert yes/no to 1/0
apisrs.awards = ifelse.(apisrs.awards .== "Yes", 1, 0)

# Fit the model
model = glm(@formula(awards ~ meals + ell), apisrs, Bernoulli(), LogitLink())
```

### Poisson with Log Link

Let us assume that the number of students tested (`api_stu`) follows a Poisson distribution, which models the number of successes out of a fixed number of trials. Suppose we want to predict the number of students tested based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:

```julia
using Survey
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs, weights = :pw)

# Rename api.stu to api_stu
rename!(apisrs, Symbol("api.stu") => :api_stu)

# Fit the model
model = glm(@formula(api_stu ~ meals + ell), apisrs, Poisson(), LogLink())
```

### Gamma with Inverse Link

Let us assume that the average parental education level (`avg_ed`) follows a Gamma distribution, which is suitable for modeling continuous, positive-valued variables with a skewed distribution. Suppose we want to predict the average parental education level based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:

```julia
using Survey
apisrs = load_data("apisrs")
srs = SurveyDesign(apisrs, weights = :pw)

# Rename api.stu to api_stu
rename!(apisrs, Symbol("avg.ed") => :avg_ed)

# Fit the model
model = glm(@formula(avg_ed ~ meals + ell), apisrs, Gamma(), InverseLink())
```
18 changes: 14 additions & 4 deletions docs/src/man/replicate.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ Replicate weights are a method for estimating the standard errors of survey stat

The basic idea behind replicate weights is to create multiple versions of the original sample weights, each with small, randomly generated perturbations. The multiple versions of the sample weights are then used to calculate the survey statistic of interest, such as the mean or total, on multiple replicate samples. The variance of the survey statistic is then estimated by computing the variance across the replicate samples.

Currently, the Rao-Wu bootstrap[^1] is the only method in the package for generating replicate weights.
Currently, the Rao-Wu bootstrap[^1] and the Jackknife [^2] are the only methods in the package for generating replicate weights. In the future, the package will support additional types of inference methods, which will be passed when creating a `ReplicateDesign` object.

The `bootweights` function of the package can be used to generate a `ReplicateDesign` from a `SurveyDesign`
The `bootweights` function of the package can be used to generate a `ReplicateDesign` using the Rao-Wu bootstrap method from a `SurveyDesign`.
For example:
```@repl bootstrap
using Survey
Expand All @@ -15,7 +15,16 @@ dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
bstrat = bootweights(dstrat; replicates = 10)
```

For each replicate, the `DataFrame` of `ReplicateDesign` has an additional column. The of the column is `replicate_` followed by the replicate number.
The `jackknifeweights` function of the package can be used to generate a `ReplicateDesign` using the Jackknife method from a `SurveyDesign`.
For example:
```@repl bootstrap
using Survey
apistrat = load_data("apistrat")
dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
bstrat = jackknifeweights(dstrat; replicates = 10)
```

For each replicate, the `DataFrame` of `ReplicateDesign` has an additional column. The name of the column is `replicate_` followed by the replicate number.

```@repl bootstrap
names(bstrat.data)
Expand All @@ -38,4 +47,5 @@ For each replicate weight, the statistic is calculated using it instead of the w

## References

[^1]: [Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.](https://journals.sagepub.com/doi/abs/10.1177/096228029600500305?journalCode=smma)
[^1]: [Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.](https://journals.sagepub.com/doi/abs/10.1177/096228029600500305?journalCode=smma)
[^2]: [Miller, Rupert G. “The Jackknife--A Review.” Biometrika 61, no. 1 (1974): 1–15. https://doi.org/10.2307/2334280.](https://www.jstor.org/stable/2334280)
9 changes: 7 additions & 2 deletions src/Survey.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
module Survey

using DataFrames
import DataFrames: rename!
using Statistics
import Statistics: quantile
import Statistics: std, quantile
using StatsBase
import StatsBase: mean, quantile
using CSV
Expand All @@ -12,6 +13,8 @@ using AlgebraOfGraphics
using CategoricalArrays
using Random
using Missings
using GLM
import GLM: @formula, glm

include("SurveyDesign.jl")
include("bootstrap.jl")
Expand All @@ -26,17 +29,19 @@ include("boxplot.jl")
include("show.jl")
include("ratio.jl")
include("by.jl")
include("reg.jl")

export load_data
export AbstractSurveyDesign, SurveyDesign, ReplicateDesign
export BootstrapReplicates, JackknifeReplicates
export dim, colnames, dimnames
export mean, total, quantile
export mean, total, quantile, std
export plot
export hist, sturges, freedman_diaconis
export boxplot
export bootweights
export ratio
export jackknifeweights, variance
export @formula, glm

end
68 changes: 49 additions & 19 deletions src/bootstrap.jl
Original file line number Diff line number Diff line change
Expand Up @@ -54,45 +54,75 @@ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwis
end

"""
variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates})
stderr(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...)

Compute the standard error of the estimated mean using the bootstrap method.

Use replicate weights to compute the standard error of the estimated mean using the bootstrap method. The variance is calculated using the formula
# Arguments
- `x::Union{Symbol, Vector{Symbol}}`: Symbol or vector of symbols representing the variable(s) for which the mean is estimated.
- `func::Function`: Function used to calculate the mean.
- `design::ReplicateDesign{BootstrapReplicates}`: Replicate design object.
- `args...`: Additional arguments to be passed to the function.
- `kwargs...`: Additional keyword arguments.

# Returns
- `df`: DataFrame containing the estimated mean and its standard error.

The standard error is calculated using the formula

```math
\\hat{V}(\\hat{\\theta}) = \\dfrac{1}{R}\\sum_{i = 1}^R(\\theta_i - \\hat{\\theta})^2
```

where above ``R`` is the number of replicate weights, ``\\theta_i`` is the estimator computed using the ``i``th set of replicate weights, and ``\\hat{\\theta}`` is the estimator computed using the original weights.

```jldoctest
julia> using Survey, StatsBase;
# Examples

julia> apiclus1 = load_data("apiclus1");
```jldoctest; setup = :(using Survey, StatsBase, DataFrames; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)

julia> dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw);
julia> mean(df::DataFrame, column, weights) = StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights]));

julia> bclus1 = dclus1 |> bootweights;

julia> weightedmean(x, y) = mean(x, weights(y));

julia> variance(:api00, weightedmean, bclus1)
julia> stderr(:api00, mean, bclus1)
1×2 DataFrame
Row │ estimator SE
│ Float64 Float64
─────┼────────────────────
1 │ 644.169 23.4107

```
"""
function variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates})
θ̂ = func(design.data[!, x], design.data[!, design.weights])
θ̂t = [
func(design.data[!, x], design.data[!, "replicate_"*string(i)]) for
i = 1:design.replicates
function stderr(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...)

# Compute the estimators
θs = func(design.data, x, design.weights, args...; kwargs...)

# Compute the estimators for each replicate
θts = [
func(design.data, x, "replicate_" * string(i), args...; kwargs...) for i in 1:design.replicates
]
variance = sum((θ̂t .- θ̂) .^ 2) / design.replicates
return DataFrame(estimator = θ̂, SE = sqrt(variance))

# Convert θs and θts to a vector if they are not already
θs = (θs isa Vector) ? θs : [θs]
θts2 = Vector[]
if !(θts[1] isa Vector)
for θt in θts
push!(θts2, [θt])
end
θts = θts2
end

# Calculate variances for each estimator
variance = Float64[]

θts = hcat(θts...)
for i in 1:length(θs)
θ = θs[i]
θt = θts[i, :]
θt = filter(!isnan, θt)
num = sum((θt .- θ) .^ 2) / length(θt)
push!(variance, num)
end

return DataFrame(estimator = θs, SE = sqrt.(variance))
end

function _bootweights_cluster_sorted!(cluster_sorted,
Expand Down
41 changes: 18 additions & 23 deletions src/by.jl
Original file line number Diff line number Diff line change
@@ -1,28 +1,23 @@
function bydomain(x::Symbol, domain, design::SurveyDesign, func::Function)
gdf = groupby(design.data, domain)
X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic)
return X
function subset(group, design::SurveyDesign)
return SurveyDesign(DataFrame(group);clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights)
end

function subset(group, design::ReplicateDesign)
return ReplicateDesign{typeof(design.inference_method)}(DataFrame(group), design.replicate_weights;clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights)
end

function bydomain(x::Symbol, domain, design::ReplicateDesign, func::Function)
function bydomain(x::Union{Symbol, Vector{Symbol}}, domain,design::Union{SurveyDesign, ReplicateDesign}, func::Function, args...; kwargs...)
domain_names = unique(design.data[!, domain])
gdf = groupby(design.data, domain)
nd = length(gdf)
X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic)
Xt_mat = Array{Float64,2}(undef, (nd, design.replicates))
for i = 1:design.replicates
Xt_mat[:, i] =
combine(
gdf,
[x, Symbol("replicate_" * string(i))] =>
((a, c) -> func(a, weights(c))) => :statistic,
).statistic
domain_names = [join(collect(keys(gdf)[i]), "-") for i in 1:length(gdf)]
vars = DataFrame[]
for group in gdf
push!(vars, func(x, subset(group, design), args...; kwargs...))
end
ses = Float64[]
for i = 1:nd
filtered_dx = filter(!isnan, Xt_mat[i, :] .- X.statistic[i])
push!(ses, sqrt(sum(filtered_dx .^ 2) / length(filtered_dx)))
estimates = vcat(vars...)
if isa(domain, Vector{Symbol})
domain = join(domain, "_")
end
replace!(ses, NaN => 0)
X.SE = ses
return X
end
estimates[!, domain] = domain_names
return estimates
end
Loading