We propose new ways of robustifying goodness-of-fit tests for structural equation modeling under non-normality. These test statistics have limit distributions characterized by eigenvalues whose estimates are highly unstable and biased in known directions. To take this into account, we design model-based trend predictions to approximate the population eigenvalues. We evaluate the new procedures in a large-scale simulation study with three confirmatory factor models of varying size (10, 20, or 40 manifest variables) and six non-normal data conditions. The eigenvalues in each simulated dataset are available in a database. Some of the new procedures markedly outperform presently available methods. We demonstrate how the new tests are calculated with a new R package and provide practical recommendations.
Current methods for reading difficulty risk detection at school entry remain error-prone. We present a novel approach utilizing machine learning analysis of data from GraphoGame, a fun and pedagogical literacy app. The app was played in class daily for ten minutes by 1676 Norwegian first graders, over a five-week period during the first months of schooling, generating rich process data. Models were trained on the process data combined with results from the endof-year national screening test. The best machine learning models correctly identified 75% of the students at risk for developing reading difficulties. The present study is among the first to investigate the potential of predicting emerging learning difficulties using machine learning on game process data.
Grønneberg, Steffen & Irmer, Julien (2024)
Non-parametric Regression Among Factor Scores: Motivation and Diagnostics for Nonlinear Structural Equation Models
We provide a framework for motivating and diagnosing the functional form in the structural part of nonlinear or linear structural equation models when the measurement model is a correctly specified linear confirmatory factor model. A mathematical population-based analysis provides asymptotic identification results for conditional expectations of a coordinate of an endogenous latent variable given exogenous and possibly other endogenous latent variables, and theoretically well-founded estimates of this conditional expectation are suggested. Simulation studies show that these estimators behave well compared to presently available alternatives. Practically, we recommend the estimator using Bartlett factor scores as input to classical non-parametric regression methods.
Moss, Jonas & Grønneberg, Steffen (2023)
Partial Identification of Latent Correlations with Ordinal Data
The polychoric correlation is a popular measure of association for ordinal data. It estimates a latent correlation, i.e., the correlation of a latent vector. This vector is assumed to be bivariate normal, an assumption that cannot always be justified. When bivariate normality does not hold, the polychoric correlation will not necessarily approximate the true latent correlation, even when the observed variables have many categories. We calculate the sets of possible values of the latent correlation when latent bivariate normality is not necessarily true, but at least the latent marginals are known. The resulting sets are called partial identification sets, and are shown to shrink to the true latent correlation as the number of categories increase. Moreover, we investigate partial identification under the additional assumption that the latent copula is symmetric, and calculate the partial identification set when one variable is ordinal and another is continuous. We show that little can be said about latent correlations, unless we have impractically many categories or we know a great deal about the distribution of the latent vector. An open-source R package is available for applying our results.
Grønneberg, Steffen & Foldnes, Njål (2022)
Factor analyzing ordinal items requires substantive knowledge of response marginals
In the social sciences, measurement scales often consist of ordinal items and are commonly analyzed using factor analysis. Either data are treated as continuous, or a discretization framework is imposed in order to take the ordinal scale properly into account. Correlational analysis is central in both approaches, and we review recent theory on correlations obtained from ordinal data. To ensure appropriate estimation, the item distributions prior to discretization should be (approximately) known, or the thresholds should be known to be equally spaced. We refer to such knowledge as substantive because it may not be extracted from the data, but must be rooted in expert knowledge about the data-generating process. An illustrative case is presented where absence of substantive knowledge of the item distributions inevitably leads the analyst to conclude that a truly two-dimensional case is perfectly one-dimensional. Additional studies probe the extent to which violation of the standard assumption of underlying normality leads to bias in correlations and factor models. As a remedy, we propose an adjusted polychoric estimator for ordinal factor analysis that takes substantive knowledge into account. Also, we demonstrate how to use the adjusted estimator in sensitivity analysis when the continuous item distributions are known only approximately.
In factor analysis and structural equation modeling non-normal data simulation is traditionally performed by specifying univariate skewness and kurtosis together with the target covariance matrix. However, this leaves little control over the univariate distributions and the multivariate copula of the simulated vector. In this paper we explain how a more flexible simulation method called vine-to-anything (VITA) may be obtained from copula-based techniques, as implemented in a new R package, covsim. VITA is based on the concept of a regular vine, where bivariate copulas are coupled together into a full multivariate copula. We illustrate how to simulate continuous and ordinal data for covariance modeling, and how to use the new package discnorm to test for underlying normality in ordinal data. An introduction to copula and vine simulation is provided in the appendix.
Foldnes, Njål & Grønneberg, Steffen (2021)
Non-normal Data Simulation using Piecewise Linear Transforms
We present PLSIM, a new method for generating nonnormal data with a pre-specified covariance matrix that is based on coordinate-wise piecewise linear transformations of standard normal variables. In our presentation, the piecewise linear transforms are chosen to match pre-specified skewness and kurtosis values for each marginal distribution. We demonstrate the flexibility of the new method, and an implementation using R software is provided.
Foldnes, Njål & Grønneberg, Steffen (2021)
The sensitivity of structural equation modeling with ordinal data to underlying non-normality and observed distributional forms
Structural equation modeling (SEM) of ordinal data is often performed using normal theory maximum likelihood estimation based on the Pearson correlation (cont-ML) or using least squares principles based on the polychoric correlation matrix (cat-LS). While cont-ML ignores the categorical nature of the data, cat-LS assumes underlying multivariate normality. Theoretical results are provided on the validity of treating ordinal data as continuous when the number of categories increases, leading to an adjustment to cont-ML (cont-ML-adj). Previous simulation studies have concluded that cat-LS outperforms cont-ML, and that it is quite robust to violations of underlying normality. However, this conclusion was based on a data simulation methodology equivalent to discretizing exactly normal data. The present study employs a new simulation method for ordinal data to reinvestigate whether ordinal SEM is robust to underlying non-normality. In contrast to previous studies, we include a large set of ordinal distributions, and our results indicate that ordinal SEM estimation and inference is highly sensitive to the interaction between underlying non-normality and the ordinal observed distributions. Our results show that cont-ML-adj consistently outperforms cont-ML, and that cat-LS is less biased than cont-ML-adj. The sensitivity of cat-LS to violation of underlying normality necessitates the need for a test of underlying normality. A bootstrap test is found to reliably detect underlying non-normality.
Grønneberg, Steffen; Moss, Jonas & Foldnes, Njål (2020)
Partial identification of latent correlations with binary data
The tetrachoric correlation is a popular measure of association for binary data and estimates the correlation of an underlying normal latent vector. However, when the underlying vector is not normal, the tetrachoric correlation will be different from the underlying correlation. Since assuming underlying normality is often done on pragmatic and not substantial grounds, the estimated tetrachoric correlation may therefore be quite different from the true underlying correlation that is modeled in structural equation modeling. This motivates studying the range of latent correlations that are compatible with given binary data, when the distribution of the latent vector is partly or completely unknown. We show that nothing can be said about the latent correlations unless we know more than what can be derived from the data. We identify an interval constituting all latent correlations compatible with observed data when the marginals of the latent variables are known. Also, we quantify how partial knowledge of the dependence structure of the latent variables affect the range of compatible latent correlations. Implications for tests of underlying normality are briefly discussed.
Sucarrat, Genaro & Grønneberg, Steffen (2020)
Risk Estimation with a Time-Varying Probability of Zero Returns
The probability of an observed financial return being equal to zero is not necessarily zero, or constant. In ordinary models of financial return, however, e.g. ARCH, SV, GAS and continuous-time models, the zero-probability is zero, constant or both, thus frequently resulting in biased risk estimates (volatility, Value-at-Risk, Expected Shortfall, etc.). We propose a new class of models that allows for a time varying zero-probability that can either be stationary or non-stationary. The new class is the natural generalisation of ordinary models of financial return, so ordinary models are nested and obtained as special cases. The main properties (e.g. volatility, skewness, kurtosis, Value-at-Risk, Expected Shortfall) of the new model class are derived as functions of the assumed volatility and zero-probability specifications, and estimation methods are proposed and illustrated. In a comprehensive study of the stocks at New York Stock Exchange (NYSE) we find extensive evidence of time varying zero-probabilities in daily returns, and an out-of-sample experiment shows that corrected risk estimates can provide significantly better forecasts in a large number of instances.
Grønneberg, Steffen & Foldnes, Njål (2019)
A problem with discretizing Vale-Maurelli in simulation studies
Previous influential simulation studies investigate the effect of underlying non-normality in ordinal data using the Vale–Maurelli (VM) simulation method. We show that discretized data stemming from the VM method with a prescribed target covariance matrix are usually numerically equal to data stemming from discretizing a multivariate normal vector. This normal vector has, however, a different covariance matrix than the target. It follows that these simulation studies have in fact studied data stemming from normal data with a possibly misspecified covariance structure. This observation affects the interpretation of previous simulation studies.
Foldnes, Njål & Grønneberg, Steffen (2019)
On Identification and Non-normal Simulation in Ordinal Covariance and Item Response Models
A standard approach for handling ordinal data in covariance analysis such as structural equation modeling is to assume that the data were produced by discretizing a multivariate normal vector. Recently, concern has been raised that this approach may be less robust to violation of the normality assumption than previously reported. We propose a new perspective for studying the robustness toward distributional misspecification in ordinal models using a class of non-normal ordinal covariance models. We show how to simulate data from such models, and our simulation results indicate that standard methodology is sensitive to violation of normality. This emphasizes the importance of testing distributional assumptions in empirical studies. We include simulation results on the performance of such tests.
The assessment of model fit has received widespread interest by researchers in the structural equation modeling literature for many years. Various model fit test statistics have been suggested for conducting this assessment. Selecting an appropriate test statistic in order to evaluate model fit, however, can be difficult as the selection depends on the distributional characteristics of the sampled data, the magnitude of the sample size, and/or the proposed model features. The purpose of this paper is to present a selection procedure that can be used to algorithmically identify the best test statistic and simplify the whole assessment process. The procedure is illustrated using empirical data along with an easy to use computerized implementation.
Foldnes, Njål & Grønneberg, Steffen (2019)
Pernicious Polychorics: The Impact and Detection of Underlying Non-normality
Ordinal data in social science statistics are often modeled as discretizations of a multivariate normal vector. In contrast to the continuous case, where SEM estimation is also consistent under non-normality, violation of underlying normality in ordinal SEM may lead to inconsistent estimation. In this article, we illustrate how underlying non-normality induces bias in polychoric estimates and their standard errors. This bias is strongly affected by how we discretize. It is therefore important to consider tests of underlying multivariate normality. In this study we propose a parametric bootstrap test for this purpose. Its performance relative to the test of Maydeu-Olivares is evaluated in a Monte Carlo study. At realistic sample sizes, the bootstrap exhibited substantively better Type I error control and power than the Maydeu-Olivares test in ordinal data with ten dimensions or higher. R code for the bootstrap test is provided.
We establish general and versatile results regarding the limit behavior of the partial-sum process of ARMAX residuals. Illustrations include ARMA with seasonal dummies, misspecified ARMAX models with autocorrelated errors, nonlinear ARMAX models, ARMA with a structural break, a wide range of ARMAX models with infinite-variance errors, weak GARCH models and the consistency of kernel estimation of the density of ARMAX errors. Our results identify the limit distributions, and provide a general algorithm to obtain pivot statistics for CUSUM tests.
Over the last few decades, many robust statistics have been proposed in order to assess the fit of structural equation models. To date, however, no clear recommendations have emerged as to which test statistic performs best. It is likely that no single statistic will universally outperform all contenders across all conditions of data, sample size, and model characteristics. In a real-world situation, a researcher must choose which statistic to report. We propose a bootstrap selection mechanism that identifies the test statistic that exhibits the best performance under the given data and model conditions among any set of candidates. This mechanism eliminates the ambiguity of the current practice and offers a wide array of test statistics available for reporting. In a Monte Carlo study, the bootstrap selector demonstrated promising performance in controlling Type I errors compared to current test statistics.
We propose a new and flexible simulation method for non-normal data with user-specified marginal distributions, covariance matrix and certain bivariate dependencies. The VITA (VIne To Anything) method is based on regular vines and generalizes the NORTA (NORmal To Anything) method. Fundamental theoretical properties of the VITA method are deduced. Two illustrations demonstrate the flexibility and usefulness of VITA in the context of structural equation models. R code for the implementation is provided.
Foldnes, Njål & Grønneberg, Steffen (2017)
The asymptotic covariance matrix and its use in simulation studies
The asymptotic performance of structural equation modeling tests and standard errors are influenced by two factors: the model and the asymptotic covariance matrix Γ of the sample covariances. Although most simulation studies clearly specify model conditions, specification of Γ is usually limited to values of univariate skewness and kurtosis. We illustrate that marginal skewness and kurtosis are not sufficient to adequately specify a nonnormal simulation condition by showing that asymptotic standard errors and test statistics vary substantially among distributions with skewness and kurtosis that are identical. We argue therefore that Γ should be reported when presenting the design of simulation studies. We show how Γ can be exactly calculated under the widely used Vale–Maurelli transform. We suggest plotting the elements of Γ and reporting the eigenvalues associated with the test statistic. R code is provided.
Foldnes, Njål & Grønneberg, Steffen (2017)
Approximating Test Statistics Using Eigenvalue Block Averaging
We introduce and evaluate a new class of approximations to common test statistics in structural equation modeling. Such test statistics asymptotically follow the distribution of a weighted sum of i.i.d. chi-square variates, where the weights are eigenvalues of a certain matrix. The proposed eigenvalue block averaging (EBA) method involves creating blocks of these eigenvalues and replacing them within each block with the block average. The Satorra–Bentler scaling procedure is a special case of this framework, using one single block. The proposed procedure applies also to difference testing among nested models. We investigate the EBA procedure both theoretically in the asymptotic case, and with simulation studies for the finite-sample case, under both maximum likelihood and diagonally weighted least squares estimation. Comparison is made with 3 established approximations: Satorra–Bentler, the scaled and shifted, and the scaled F tests.
The Vale-Maurelli (VM) approach to generating non-normal mul- tivariate data involves the use of Fleishman polynomials applied to an underly- ing Gaussian random vector. This method has been extensively used in Monte Carlo studies during the last three decades to investigate the nite-sample per- formance of estimators under non-Gaussian conditions. The validity of con- clusions drawn from these studies clearly depends on the range of distributions obtainable with the VM method. We deduce the distribution and the copula for a vector generated by a generalized VM transformation, and show that it is fundamentally linked to the underlying Gaussian distribution and copula. In the process we derive the distribution of the Fleishman polynomial in full generality. While data generated with the VM approach appears to be highly non-normal, its truly multivariate properties are close to the Gaussian case. A Monte Carlo study illustrates that generating data with a di erent copula than that implied by the VM approach severely weakens the performance of normal-theory based ML estimates.The Vale-Maurelli (VM) approach to generating non-normal mul- tivariate data involves the use of Fleishman polynomials applied to an underly- ing Gaussian random vector. This method has been extensively used in Monte Carlo studies during the last three decades to investigate the nite-sample per- formance of estimators under non-Gaussian conditions. The validity of con- clusions drawn from these studies clearly depends on the range of distributions obtainable with the VM method. We deduce the distribution and the copula for a vector generated by a generalized VM transformation, and show that it is fundamentally linked to the underlying Gaussian distribution and copula. In the process we derive the distribution of the Fleishman polynomial in full generality. While data generated with the VM approach appears to be highly non-normal, its truly multivariate properties are close to the Gaussian case. A Monte Carlo study illustrates that generating data with a di erent copula than that implied by the VM approach severely weakens the performance of normal-theory based ML estimates.
We derive two types of Akaike information criterion (AIC)-like model-selection formulae for the semiparametric pseudo-maximum likelihood procedure. We first adapt the arguments leading to the original AIC formula, related to empirical estimation of a certain Kullback–Leibler information distance. This gives a significantly different formula compared with the AIC, which we name the copula information criterion. However, we show that such a model-selection procedure cannot exist for copula models with densities that grow very fast near the edge of the unit cube. This problem affects most popular copula models. We then derive what we call the cross-validation copula information criterion, which exists under weak conditions and is a first-order approximation to exact cross validation. This formula is very similar to the standard AIC formula but has slightly different motivation. A brief illustration with real data is given.
Grønneberg, Steffen (2011)
The Copula Information Criterion and Its Implications for the Maximum Pseudo-Likelihood Estimator
Consider a sequence of estimators ˆ n which converges almost surely to 0 as the sample size n tends to infinity. Under weak smoothness conditions, we identify the asymptotic limit of the last time ˆ n is further than " away from 0 when " → 0+. These limits lead to the construction of sequentially fixed width confidence regions for which we find analytic approximations. The smoothness conditions we impose is that ˆ n is to be close to a Hadamard-differentiable functional of the empirical distribution, an assumption valid for a large class of widely used statistical estimators. Similar results were derived in Hjort and Fenstad (1992, Annals of Statistics) for the case of Euclidean parameter spaces; part of the present contribution is to lift these results to situations involving parameter functionals. The apparatus we develop is also used to derive appropriate limit distributions of other quantities related to the far tail of an almost surely convergent sequence of estimators, like the number of times the estimator is more than " away from its target. We illustrate our results by giving a new sequential simultaneous confidence set for the cumulative hazard function based on the Nelson–Aalen estimator and investigate a problem in stochastic programming related to computational complexity
Foldnes, Njål; Grønneberg, Steffen, Hermansen, Gudmund Horn & Wellen, Einar Christopher (2024)
Statistikk og dataanalyse
[Textbook].
Grønneberg, Steffen (2022)
Bias in Ordinal SEM
[Conference Lecture]. Event
Grønneberg, Steffen (2022)
Substantive knowledge is required for Ordinal factor analysis