Advanced Topics¶
Deep dive into the mathematical foundations, advanced configurations, and production deployment of GeoLift.
Mathematical Background¶
Synthetic Control Method¶
The synthetic control method creates a weighted combination of control units to serve as a counterfactual for treated units. For a treated unit \(j\), the synthetic control is:
where \(\mathcal{D}\) denotes the donor pool (control units), and \(w_i\) are donor weights that minimize the pre-treatment prediction error:
subject to the constraints:
\(\sum_{i \in \mathcal{D}} w_i = 1\) (weights sum to one)
\(w_i \geq 0\) for all \(i \in \mathcal{D}\) (non-negative weights)
where \(T_0\) denotes the last pre-treatment period.
SparseSC Enhancement¶
SparseSC improves traditional synthetic control by jointly optimizing:
Feature weights (V-matrix): Diagonal matrix determining which pre-treatment features matter most
Unit weights (W-matrix): Vector of weights for combining control units
The optimization problem becomes:
where:
\(X_1 \in \mathbb{R}^{K \times 1}\) represents \(K\) pre-treatment features for the treated unit
\(X_0 \in \mathbb{R}^{K \times N}\) represents \(K\) pre-treatment features for \(N\) control units
\(W \in \mathbb{R}^{N \times 1}\) is the vector of unit weights
\(V \in \mathbb{R}^{K \times K}\) is a diagonal matrix of feature weights
\(\lambda > 0\) is the regularization parameter controlling sparsity
\(\|x\|_V^2 = x^T V x\) is the V-weighted squared norm
\(\|W\|_1 = \sum_{i=1}^{N} |w_i|\) is the L1 norm promoting sparsity
Statistical Inference¶
Bootstrap Inference¶
Resample the data with replacement to create empirical distribution of treatment effects:
# Bootstrap procedure for statistical inference
bootstrap_effects = []
n_bootstrap = 1000 # Number of bootstrap iterations
for b in range(n_bootstrap):
# Resample control units with replacement
boot_indices = np.random.choice(len(control_units),
size=len(control_units),
replace=True)
boot_data = original_data[boot_indices]
# Refit synthetic control model
boot_effect = fit_and_estimate(boot_data, treated_unit)
bootstrap_effects.append(boot_effect)
# Calculate 95% confidence intervals (percentile method)
ci_lower = np.percentile(bootstrap_effects, 2.5)
ci_upper = np.percentile(bootstrap_effects, 97.5)
# Standard error estimation
se_bootstrap = np.std(bootstrap_effects)
Placebo Inference (Permutation Test)¶
Test significance by applying the same method to untreated units:
# Placebo test procedure for inference
placebo_effects = []
true_effect = estimate_effect_for_unit(treated_unit)
for control_unit in control_units:
# Pretend control unit received treatment at same time
# Use remaining controls as donor pool
remaining_controls = [c for c in control_units if c != control_unit]
placebo_effect = fit_synthetic_control(
treated=control_unit,
controls=remaining_controls,
pre_period=pre_period,
post_period=post_period
)
placebo_effects.append(placebo_effect)
# Calculate two-sided p-value using Fisher's exact test logic
# Proportion of placebo effects as extreme as true effect
p_value = np.mean(np.abs(placebo_effects) >= np.abs(true_effect))
# Alternative: one-sided test if direction is hypothesized
p_value_right = np.mean(placebo_effects >= true_effect)
p_value_left = np.mean(placebo_effects <= true_effect)
Advanced Configuration¶
Custom Donor Selection¶
# Manual donor specification
analyzer.set_custom_donors(
donor_markets=[501, 505, 506, 507, 508, 509, 510],
donor_weights=[0.25, 0.20, 0.15, 0.15, 0.10, 0.10, 0.05]
)
# Geographic constraints
analyzer.set_donor_constraints(
exclude_regions=['West Coast'], # Exclude specific regions
min_distance_km=500, # Minimum distance from treatment
max_correlation=0.95 # Avoid overly similar markets
)
# Time-varying donor weights
analyzer.set_time_varying_donors(
pre_period_donors=[501, 502, 503],
post_period_donors=[504, 505, 506] # Different donors for different periods
)
Multiple Treatment Cohorts¶
# Staggered adoption design
cohort_config = {
'cohort_1': {
'markets': [502, 503],
'start_date': '2023-06-01',
'end_date': '2023-08-31'
},
'cohort_2': {
'markets': [504, 505],
'start_date': '2023-07-01',
'end_date': '2023-09-30'
},
'cohort_3': {
'markets': [506, 507],
'start_date': '2023-08-01',
'end_date': '2023-10-31'
}
}
results = analyzer.analyze_staggered_adoption(cohort_config)
Regularization and Model Selection¶
# Cross-validation for optimal regularization
from sparsesc.cross_validation import CV_score
# Grid search over lambda values
lambda_grid = np.logspace(-4, 2, 20)
cv_scores = []
for lam in lambda_grid:
score = CV_score(
X=features,
Y=outcomes,
treated_units=treated,
control_units=controls,
lambda_=lam,
cv_folds=5
)
cv_scores.append(score)
optimal_lambda = lambda_grid[np.argmin(cv_scores)]
Performance considerations¶
The following options are supported and relevant to performance:
Stage 1 (Power analysis)
CPU parallelism: set
parallel: trueandn_jobs: -1inpower_analysis_config.yamlGPU acceleration: set
use_gpu: trueand install CuPy matching your CUDA version, e.g.pip install cupy-cuda12x
Stage 2 (Donor evaluation)
CPU parallelism: set
parallel: trueandn_jobs: -1indonor_eval_config.yaml
Stage 3 (Inference)
Progress indicator: set
use_progress: true(estimation runs on CPU)
General
When using CPU parallelism, avoid oversubscription by setting environment variables such as
OMP_NUM_THREADSorMKL_NUM_THREADSto a sensible value (e.g. your core count). analyzer.export_diagnostic_report(diagnostics, ‘diagnostic_report.html’)
### Sensitivity Analysis
```python
# Systematic sensitivity testing
sensitivity_tests = {
'pre_period_length': [12, 16, 20, 24],
'donor_threshold': [0.6, 0.7, 0.8, 0.9],
'regularization': [0.001, 0.01, 0.1, 1.0],
'inference_method': ['bootstrap', 'placebo', 'jackknife']
}
sensitivity_results = {}
for param, values in sensitivity_tests.items():
param_results = []
for value in values:
# Update configuration
config_copy = config.copy()
config_copy[param] = value
# Run analysis
result = analyzer.run_analysis(**config_copy)
param_results.append({
'value': value,
'effect': result.relative_lift,
'p_value': result.p_value
})
sensitivity_results[param] = param_results
# Visualize sensitivity
analyzer.plot_sensitivity_analysis(sensitivity_results)
Research Extensions¶
Difference-in-Differences Integration¶
# Combine synthetic control with DiD for robust causal inference
class SyntheticDiD:
"""
Synthetic Difference-in-Differences (SDID) estimator.
Combines SC weights with DiD time weights for double robustness.
"""
def __init__(self, sc_weight=0.7, did_weight=0.3):
self.sc_weight = sc_weight
self.did_weight = did_weight
def fit(self, data, treatment_markets, control_markets):
# Fit synthetic control weights
sc_weights = self.compute_sc_weights(
data, treatment_markets, control_markets
)
# Compute DiD estimator
# ATT = E[Y₁(1) - Y₀(1)|D=1] where D is treatment indicator
did_effect = self.compute_did(
data, treatment_markets, control_markets
)
# Compute SC estimator with optimal weights
sc_effect = self.compute_sc_effect(
data, treatment_markets, control_markets, sc_weights
)
# Weighted average of estimators (ensemble approach)
combined_effect = (
self.sc_weight * sc_effect +
self.did_weight * did_effect
)
# Compute variance using Delta method
combined_variance = (
self.sc_weight**2 * sc_effect.variance +
self.did_weight**2 * did_effect.variance
)
return combined_effect, np.sqrt(combined_variance)
def compute_did(self, data, treated, controls):
"""
Classic 2x2 DiD: (Ȳ₁ᵗʳᵉᵃᵗ - Ȳ₀ᵗʳᵉᵃᵗ) - (Ȳ₁ᶜᵒⁿᵗʳᵒˡ - Ȳ₀ᶜᵒⁿᵗʳᵒˡ)
"""
# Pre and post period means
y_treat_post = data[data.unit.isin(treated) & data.post].y.mean()
y_treat_pre = data[data.unit.isin(treated) & ~data.post].y.mean()
y_control_post = data[data.unit.isin(controls) & data.post].y.mean()
y_control_pre = data[data.unit.isin(controls) & ~data.post].y.mean()
# DiD estimate
did = (y_treat_post - y_treat_pre) - (y_control_post - y_control_pre)
return did
# Use hybrid approach with optimal weighting
hybrid_analyzer = SyntheticDiD(sc_weight=0.8, did_weight=0.2)
Machine Learning Integration¶
# ML-enhanced donor selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
class MLDonorSelector:
def __init__(self, model=None):
self.model = model or RandomForestRegressor(n_estimators=100)
def select_donors(self, features, outcomes, treatment_markets):
# Use ML to predict which donors will work best
X_train, y_train = self.prepare_training_data(features, outcomes)
self.model.fit(X_train, y_train)
# Score potential donors
donor_scores = self.model.predict(donor_features)
best_donors = np.argsort(donor_scores)[-10:] # Top 10
return best_donors
# Use ML-enhanced selection
ml_selector = MLDonorSelector()
analyzer.set_donor_selector(ml_selector)
This advanced guide covers the mathematical foundations, production deployment options, and research extensions for GeoLift. For practical usage, start with the Quick Start Guide and User Guide.