Changelog
Source:NEWS.md
fastconley 0.10.0
GLM support and documented IV support.
fixest::feglm() / fixest::fepois() fits supported
vcovSpHAC.fixest now accepts GLM fits (any feglm family, including fepois). The variance is the M-estimation sandwich H^{-1} B H^{-1}, built from the maximum-likelihood score matrix and inverse Hessian that fixest stores on every (non-lean) fit — the same construction fixest’s own vcov_conley() uses for GLMs, verified against it at the distance-formulation tolerance and against an exact same-distance yardstick at ~1e-15. No estimation flag is needed (demeaned = TRUE is only required for feols); weights, offsets, and the fixed-effect profiling are already folded into the stored scores. Because the scores ride through the existing engines unchanged, everything composes: pairwise and grid/FFT engines, pixel aggregation, ssc, psd_fix, and — beyond what fixest offers — the panel spatial + serial HAC via lag_cutoff, now available for Poisson/GLM panels. lean = TRUE fits (no stored scores) and femlm()/feNmlm() fits are rejected with clear errors.
IV/2SLS support documented and tested
IV fits have in fact always produced the correct 2SLS Conley sandwich through both methods — lfe and fixest store the projected (second-stage) design in cX / X_demeaned and the structural residuals in residuals, which is exactly what the score construction needs. This is now documented and covered by the validation suite: felm IV and feols IV agree with each other and with the exact yardstick at ~1e-15, including weighted IV, multiple endogenous regressors, and IV panels with serial HAC. See tests/manual/test-glm-iv-parity.R.
fastconley 0.9.0
fixest feature parity: weighted fits, small-sample correction, PSD repair, lat/lon auto-detection. Breaking: ssc and psd_fix default to TRUE to match fixest’s defaults out of the box; pass ssc = FALSE, psd_fix = FALSE to reproduce earlier fastconley versions and rbluhm/conley bit-for-bit.
Weighted (WLS) fits supported
vcovSpHAC now accepts weighted felm() and feols() fits. The meat scores become s_i = w_i * e_i * x_i and the bread (X'WX)^{-1} — the formula fixest’s own weighted Conley vcov uses (verified exactly against it with a self-pairs-only cutoff, rel. err ~1e-15). Weights enter only the scores and the bread, so every engine — pairwise, grid/FFT, serial HAC, pixel aggregation, balanced CSR reuse — works unchanged. Note lfe stores sqrt(w) on the fit; vcovSpHAC squares it back.
ssc: small-sample correction (default TRUE)
Scales the variance matrix by n / (n - K), with K counting all estimated parameters including absorbed fixed-effect levels (taken from the fit’s residual degrees of freedom). This is exactly fixest’s default Conley correction — its cluster adjustment (G.adj / cluster.adj) is a no-op for Conley vcovs, so this one factor reproduces fixest defaults. ssc = FALSE applies no correction, matching rbluhm/conley and previous fastconley versions.
psd_fix: positive semi-definite repair (default TRUE)
Conley spatial kernels do not guarantee a PSD variance matrix. With psd_fix = TRUE (default, as in fixest) negative eigenvalues are clamped to 1e-16 — the same semantics as fixest’s vcov_fix — with a warning when the fix noticeably changed the matrix (> 1e-8). With psd_fix = FALSE the matrix is returned as computed and a warning is emitted when it is noticeably non-PSD.
lat/lon auto-detection
lat and lon now default to NULL and are auto-detected from the data’s column names (lat/latitude and lon/long/longitude/lng, case-insensitive exact matches). A message reports the pick; ambiguous or missing matches error with instructions.
Validation
- 24-config battery (balanced/unbalanced/cross-section x kernels x distances x lags x engines) bitwise-identical to v0.8.0 with
ssc = FALSE, psd_fix = FALSE(the C++ engines and the default-off code paths are untouched). - Weighted fits: exact match to
fixestat self-pairs-only cutoff; ~1e-10 against a dense brute-force reference at 200 km for both kernels; felm and fixest paths agree to ~1e-10. - New testthat coverage:
tests/testthat/test-weights-ssc.R(14 tests).
fastconley 0.8.0
Grid engine, part two: bartlett support (ring-FFT) and dateline wrap.
method = "grid" / "auto" now covers kernel = "bartlett"
On a lattice the bartlett weight varies with the longitude offset, so the per-ring-pair inner sum is a true 1D convolution rather than a boxcar. FastGridMeat computes it via FFT (arma::fft): per ring pair, the even-symmetric weight vector’s (real) spectrum multiplies cached per-ring score spectra, with one inverse FFT per target ring. Score spectra are cached for a sliding latitude band plus a cutoff halo, and the reduction is deterministically chunked — results remain bit-identical across ncores.
Weights use the same per-distance arithmetic as the pairwise engine (atan2 haversine, acos spherical, sqrt chord), and the same-cell distance is hard-set to 0, so agreement with the pairwise engine is ~1e-15 (haversine) to ~1e-12 (spherical/chord — inherent conditioning of acos/sqrt near zero distance, not algorithm error). The "auto" rule uses an FFT-aware cost model for bartlett.
- C1-study config (2.25M cells at ~1.1 km, 250 km cutoff, ~1.8e11 pairs, bartlett/spherical): 2.9 s vs 710 s for the 16-core pairwise engine (242x); single-threaded (20.5 s) it still beats 16-core pairwise 35x. The kernel’s evenness halves the work (T(r2,r1) = T(r1,r2)’, so only upper-triangle ring pairs are computed, with self-ring weights halved and the half-meat symmetrized).
Dateline wrap (bug fix for global rasters)
v0.7.0’s grid engine clamped longitude windows at the lattice edges, so on a raster spanning the full 360° circle it silently missed pairs that are close “the short way” across the dateline (observed ~5e-3 relative error on a global test raster). The engine now detects when the accept window reaches across the dateline gap and switches both kernels to circular windows (modular prefix-sum arcs for uniform; circular convolution with period n_col_full for bartlett) — exact, validated against the pairwise engine. When wrap would be needed but the lon step does not tile 360° evenly (no consistent circular lattice exists), method = "grid" stops with an informative error and method = "auto" falls back to the pairwise engine. Non-wrapping rasters are unaffected: results are bitwise identical to v0.7.0 (verified on the 30-config battery).
fastconley 0.7.0
Workstream C2: exact grid-native meat for raster data.
New: method = c("auto", "pairwise", "grid")
For the uniform kernel on a regular lat/lon lattice (raster cell centers, gridded covariates), the within-cutoff accept set between two latitude rings is a longitude-index interval, so the spatial meat reduces to sliding-window sums over per-ring prefix sums — FastGridMeat. Cost is O(n_ring * window * n_col * k), independent of the pair count, and the accept threshold is the same dot-product constant the pairwise engine uses, so the result is exact (agrees to FP summation order; no approximation anywhere for natively gridded data).
- C1-study config (2.25M cells at ~1.1 km, 250 km cutoff, ~1.8e11 pairs): 0.81 s vs 244 s for the 16-core pairwise engine — 302x, error 5e-15.
- 9M-cell continental raster (3000 x 3000, 250 km, ~7e11 pairs): 3.5 s.
- Supports all three
dist_fns (monotone in the chord), multiple time blocks (panels apply it per period), sparse occupancy, duplicate cells, and is deterministic acrossncores.
method = "auto" (the new default) switches to the grid engine only when a lattice is detected, the kernel is uniform, and a flop-balance estimate says it wins; otherwise the pairwise engine runs as before. Scattered (non-lattice) data is unaffected. method = "grid" errors informatively when its requirements are not met.
The bartlett ring-FFT variant (exact per the C1 study) is planned as a follow-up; bartlett rasters currently stay on the pairwise engine.
fastconley 0.6.1
Minor-backlog items M1-M4 from notes/OPTIMIZATION_PLAN.md.
- Serial HAC is now O(T * k) per unit instead of O(T^2 * k): the Bartlett-in-time weight decomposes over a sliding two-pointer window (with a per-block time shift for conditioning). A 100k-row panel with T = 2000 and lag = 50 computes in ~6 ms. Rows must be time-sorted within unit blocks (the R layer always sorts; direct callers get a clear error).
-
vcovSpHAC.felmgainsdata =: when the passed frame has the same row count as the fit (no NAs dropped, no subset), coordinates are taken by direct column access with no model-frame re-evaluation; otherwise the frame overrides the call-recovered data in the aligned model-frame path. -
Prep path parallelized (grid sort via TBB parallel_sort, coordinate cache trig, permutation gathers) — all order-preserving, so results stay bit-identical across
ncores. 1M-row global cross-section at 16 threads: 0.46 -> 0.21 s (scaling 2.2x -> 4.8x). - Spatial results are bitwise identical to v0.6.0; the serial meat differs by ~1e-15 relative from the new summation algebra.
- README benchmarks refreshed against fixest 0.14.1 at scale: 10.7-45.8x on cross-sections up to n = 1M, 1.4-2.9x on panel SHAC up to 1.5M obs (where fastconley computes ~T x more spatial work by construction). A brute-force arbitration shows fastconley matches the exact great-circle uniform meat to ~1e-15 while fixest 0.14.1’s “spherical” distance deviates ~2e-2.
fastconley 0.6.0
Phase 2 of notes/OPTIMIZATION_PLAN.md: memory diet, deterministic reduction, and screen work. Results remain exact (same pairs, same weights); summation order changed, so values differ from v0.5.0 by <= ~5e-15 relative.
Results are now invariant to ncores
All meat accumulations use a deterministic chunked reduction (fixed-size row/block chunks, partials summed in chunk order). ncores = 1 and ncores = 16 produce bit-identical matrices — the old multicore tolerance caveat is gone.
Memory
- The C++ entry points take a pre-computed score matrix and alias R memory directly (no Rcpp input copies, no intermediate unsorted score buffer; one gather pass builds the sorted layout).
vcovSpHACpasses the aggregated scores straight through;X/earguments remain on the internal wrappers for convenience. Measured peak RSS: 4M-row cross-section 2321 -> 1558 MB (-33%); 40k x 40 balanced panel 1028 -> 748 MB (-27%). Same speed, identical checksums. - New
csr_weight = c("double", "float"): opt-in float storage for the balanced-path bartlett weights (8 -> 4 bytes per pair, <= ~6e-8 relative error per weight). Default stays exact double.
Speed
- Haversine gains a conservative 3D dot-product pre-screen (exact identity
a = (1 - dot)/2); the exact a-test remains the arbiter, so results are bit-identical. CONUS 100k / 500 km bartlett/haversine: 15.1 -> 10.5 s single-threaded (1.4x). - Balanced 16-thread uniform meat: 2.30 -> 1.56 s on a 40k x 40 panel (better task granularity from the chunked reduction).
Tried and reverted (documented for the record)
A unit-major T*k stacked score layout that streams the balanced CSR once instead of once per period was implemented and benchmarked. It lost to the period-major layout: spatial sorting makes neighbor gathers a sliding window of k-wide rows that stays L2-resident per period; any wider stacking pushes the window past L2 and thrashes the shared L3 at high thread counts (2.6 s vs 2.0 s at 16 cores). The period-major traversal stays, now deterministic and float-capable.
fastconley 0.5.0
Phase 1 of notes/OPTIMIZATION_PLAN.md: 3D cell-grid neighbor search.
New: neighbor = c("grid", "band")
The spatial meat’s candidate enumeration now defaults to a 3D cell grid. Points are bucketed by their unit vectors into a cubic grid whose edge is the unit-sphere chord equivalent of the cutoff; every supported distance is monotone in the chord, so accepted pairs are never more than one cell apart per axis — no pole or dateline special cases. Each row scans its own cell plus five contiguous row ranges covering the 13 forward neighbor cells: ~3–4 candidates per accepted pair, independent of geographic extent, versus the latitude band scan’s 2*L_lon/(pi*r).
Both strategies call the identical per-pair accept test, so pair sets and weights are exactly the same; results differ only by floating-point summation order (observed <= 6e-15 relative across the validation matrix; neighbor = "band" remains bitwise identical to v0.4.1). The band path is kept for one release and will be removed in v0.6.0.
Measured single-threaded speedups (FastSpatialMeat, k = 10):
- global extent, n = 1M, 100 km: 8.6x (uniform/spherical) to 25.3x (bartlett/haversine)
- CONUS, n = 100k, 100 km: 4.5x; 500 km: 2.1-2.3x (that config is dominated by true-pair accumulation, not candidate search)
- balanced CSR build, 200k units x 4 periods, 100 km: 3.6x
- 16-thread scaling on CONUS 100k/500 km improves from ~5.3x to ~7.9x (less memory-bandwidth pressure from candidate streaming)
fastconley 0.4.1
Phase 0 quick wins from notes/OPTIMIZATION_PLAN.md (Q1–Q6). Numerical results are unchanged (verified bitwise against v0.4.0 on the balanced / general / unbalanced × kernel × distance matrix at ncores = 1).
Memory
- The balanced-path CSR no longer allocates its weight array for
kernel = "uniform"(it was filled but never read), and column indices are now 32-bit. Per stored pair: 16 bytes -> 4 (uniform) / 12 (bartlett). A guard errors if a single period exceeds 2^32 - 1 units.
fastconley 0.3.0.9000
Development build for testing a faster spatial HAC path.
Spatial meat changes
- Adds a
FastSpatialMeat()internal Rcpp routine. - Replaces dense
n x nspatial distance matrices invcovSpHAC()with a screened spatial edge list. - Computes the meat via cumulative scores:
S = e * X,C_i = 0.5 S_i + sum_j w_ij S_j, andS'C + C'S. - Supports both
kernel = "bartlett"andkernel = "uniform". - Supports
dist_fnchoices: haversine, spherical, chord. The upstreamflatearthoption was dropped — its formula was not symmetric in(i, j)and the equirectangular approximation is no longer worth a separate code path now that spherical+uniform runs as a 3D dot-product threshold with no trig. - In balanced panels, builds one spatial edge list from the first period and reuses it for all periods.
Performance work (most recent)
-
Templated
(dist_fn, kernel)dispatch. The screen + distance + kernel-weight stack is specialised at compile time for each of the six(distance, kernel)combinations, removing per-pair runtime branches. -
3D unit-vector threshold for spherical and chord.
(spherical, uniform)pair inclusion now reduces to a 3D dot product compared againstcos(cutoff / R)— no trig in the inner loop.(chord, uniform)uses squared-Euclidean against(cutoff / R)². Thebartlettvariants only payacos/sqrton accepted pairs. -
Row-major score buffer. Scores
s_i = e_i · X_iare stored row-major in a single flatstd::vector<double>instead of being recomputed against column-major Armadillo memory on every pair. - Sorted-order layout. Both the coordinate cache and the scores buffer are permuted into latitude-sorted order before the meat workers run, so every read in the hot loop is sequential — independent of how the caller laid out the input.
-
pixelargument for score pre-aggregation (R/vcovSpHAC.R). Atpixel = 0rows that share(lat, lon)within a time period are collapsed exactly; atpixel > 0rows are first snapped to a uniformpixel-km grid (speed/accuracy trade-off). - Cross-section vs
fixest::vcov_conley(kernel = "uniform", spherical, 500 km cutoff):fastconleyis now 3-4× faster atpixel = 0and 27-62× faster atpixel = 25acrossn ∈ {5k … 100k}, and scales better than fixest past 4 threads.
Notes
- This is meant as a test branch/source drop, not a CRAN-ready release.
- The implementation uses RcppParallel for CSR construction and meat accumulation.
- Rcpp exports were updated manually. Running
Rcpp::compileAttributes()is recommended after further edits.