## Abstract

Although statistical models serve as the foundation of data analysis in clinical studies, their interpretation requires sufficient understanding of the underlying statistical framework. Statistical modeling is inherently a difficult task because of the general lack of information of the nature of observable data. In this article, we aim to provide some guidance when using regression models to aid clinical researchers to better interpret results from their statistical models and to encourage investigators to collaborate with a statistician to ensure that their studies are designed and analyzed appropriately.

## Keywords

## Introduction

A statistical model is a mathematical representation of statistical assumptions on how observable data are generated. It is particularly useful for clinical studies that relate multiple variables, such as patients’ background factors, to an outcome, such as survival time, because it allows the compact representation of the relationship as a mathematical function, called a regression function. However, a statistical model is just a simplification of the true underlying relationship, and with incorrect assumptions, it can easily lead to misleading results. “All models are wrong, but some are useful”; this is the famous remark by the statistician George E. P. Box.

Statistical modeling is inherently a difficult task because of our general lack of understanding regarding the nature of observable data and it should be appropriately guided by clinical expertise. Another challenging aspect of statistical modeling is the underlying statistical assumptions that may not be well understood by the intended audience or even the analysts. This is particularly concerning because statistical software is readily available and often used by researchers without appropriate expertise to perform complicated data analyses. This article thus intends to provide some basis and principles of statistical models, specifically, regression models, with the hope of helping clinical researchers to better interpret results from their statistical models, and more importantly, to strongly encourage investigators to collaborate with a statistician to ensure that their studies are designed and analyzed appropriately.

## Components of a Statistical Model

Statistical models are typically expressed as equations with the outcome of interest (called the dependent variable) on the left side of the equation and a set of predictors (called covariates or independent variables) on the right side, called regression models.

### Outcome Variables and Types of Regression Model

The outcome of interest can be a continuous variable, a dichotomous variable, a count variable, or a time-to-event variable. The type of outcome variable dictates the type of regression model used to analyze the data. This is because a statistical model is fit to the observed data to not only understand the relationship between the outcome and the predictors for the observed patients but to also generalize the conclusions drawn from the observed data to a larger population. The generalization is inferred from the observed data on the basis of a set of assumptions on the probability distribution of the outcome. Table 1 summarizes the common types of outcome variables, some example outcomes, their associated probability distributions, and corresponding regression model.

Table 1Overview of Regression Models

Type of Outcome Variable | ||||
---|---|---|---|---|

Continuous | Binary a Binary variables are frequently transformed from multinominal variables (such as cancer types) or ordinal variables (such as disease severities or grades) by introducing a grouping or threshold. Alternatively, such variables can be modeled on the basis of extensions of logistic models, that is, multinominal logistic models and ordinal logistic models (such as adjacent-category logistic models and cumulative or proportional-odds models).20 | Count | Time-to-Event | |

Examples | Weight change | Objective response of complete/partial response or not^{4} | Number of occurrences of a rare adverse event per patient | OS or PFS times |

Assumed probability distribution | Normal distribution or unspecified (nonparametric) | Bernoulli or binomial | Poisson | Exponential, Weibull, and log-normal, or can be unspecified |

Common regression model | Linear regression | Logistic regression | Poisson regression | Cox proportional hazards regression |

Function of outcome being modeled | The mean | Odds, (=P/[1 − P]) or log(odds)(= log( P/[1 − P]) | Expected count (e.g., expected number of adverse events per patient) | Hazard rate (e.g., hazard rate of mortality for OS, of death or disease progression for PFS) |

OS, overall survival; PFS, progression-free survival.

a Binary variables are frequently transformed from multinominal variables (such as cancer types) or ordinal variables (such as disease severities or grades) by introducing a grouping or threshold. Alternatively, such variables can be modeled on the basis of extensions of logistic models, that is, multinominal logistic models and ordinal logistic models (such as adjacent-category logistic models and cumulative or proportional-odds models).

b The Cox regression is semiparametric because the particular distributional form of the time-to-event distribution (or survival distribution) is unspecified, whereas the particular form of predictor effects is specified for the hazard rate.

c

*P*, probability of success.Examples of outcome types and regression models in lung cancer literature include the following: weight change (continuous variable) from the start of cancer therapy in patients with NSCLC was modeled using linear regression; objective overall response rate (binary variable), grade 3, or worse adverse events (binary) after treatment of gemcitabine and carboplatin with or without cediranib as first-line therapy in advanced NSCLC were evaluated using logistic regression models; and overall survival (time-to-event) and progression-free survival (time-to-event) in patients with advanced NSCLC treated with programmed cell death protein-1 or programmed death-ligand 1 checkpoint inhibitors were evaluated using Cox proportional hazards models.

^{3}

^{4}

### Predictor Variables and Interpretation

The predictors in a regression model can be categorical or continuous variables. The simplest categorical predictor has two levels, for example, sex (male versus female). A linear regression model evaluating the association between sex and weight loss can be stated as $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +\beta \cdot sex$ where female is the reference category ($sex=0$) and $sex=1$ represents male. In this model, the intercept $\alpha $ represents the mean weight loss of female patients, whereas $\beta $ represents the difference in mean weight loss between male and female patients. A positive $\beta $ means, on average, male patients lost more weight than female patients, whereas a negative $\beta $ means, on average, male patients lost less weight than female patients.

^{3}

In general, a categorical predictor with K categories is represented by K-1 dummy variables, that is, binary (0/1) variables, in a regression model with one category serving as the reference. For example, a model evaluating the association posttreatment weight loss and body mass index (BMI, categorized as four groups: underweight, normal weight, overweight, or obese) at the start of chemotherapy can be stated as $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +{\gamma}_{1}\cdot bm{i}_{1}+{\gamma}_{2}\cdot bm{i}_{2}+{\gamma}_{3}\cdot bm{i}_{3},$ where normal weight is the reference category and $bm{i}_{1}=1\phantom{\rule{0.25em}{0ex}}$ for underweight and $bm{i}_{1}=0$ otherwise, $bm{i}_{2}=1\phantom{\rule{0.25em}{0ex}}$ for overweight and $bm{i}_{2}=0$ otherwise, and $bm{i}_{3}=1\phantom{\rule{0.25em}{0ex}}$ for obese and $bm{i}_{3}=0$ otherwise. $\alpha $ is the mean weight for patients with normal weight and$\phantom{\rule{0.25em}{0ex}}{\gamma}_{1},{\gamma}_{2},{\gamma}_{3}$ are the differences in mean weight between patients who are underweight, overweight, and obese compared with patients with normal weight.

For a continuous predictor, for example, age at baseline in years, its association with the outcome can be evaluated in the linear model, $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +\varphi \cdot age$. The intercept $\alpha $ represents $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)$ at $age=0$, and $\varphi $ represents the change in mean weight loss with every year increase in age. However, it is worth noting that $age=0$ may be far outside the range of the study population, and one can avoid such an extrapolation by introducing a typical age level, such as 50, as the reference age, by subtracting 50 from $age$. In this case, the model can be expressed as $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +\varphi \cdot (age-50)$, where $\alpha $ now represents $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)$ at $age=50$. In some cases, when $\varphi $, the change in 1-year increment is deemed negligibly small, it is more meaningful to consider the effect of a substantial change in age, such as 10 years. As such, one may work with a modified predictor, ${\text{age}}_{\text{trans}}=(age-50)/10$, and assume$\phantom{\rule{0.25em}{0ex}}Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +\varphi \cdot {\text{age}}_{\text{trans}}$, where $\varphi $ now represents the change in mean weight loss associated with every 10-years increase in age.

This example assumes that the effect of age is linear, that is, the incremental impact on mean weight loss associated with each year increase in age is constant over its entire range. We can check the linearity assumption simply on the basis of a scattergram of the observed weight loss versus age in this univariable setting. Such a plot can also help in identifying outliers that can substantially influence the parameter estimation (see the “Univariable Versus Multivariable Models” section for related plots and detection of influential observations in the multivariable setting). When nonlinearity is suggested, a more complex model should be considered, for example, a model with a quadratic effect of age,$\phantom{\rule{0.25em}{0ex}}Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +{\varphi}_{1}\cdot {\text{age}}_{\text{trans}}+{\varphi}_{2}\cdot {\text{age}}_{\text{trans}}^{2}$. More flexible splines and other nonparametric models are also available. Another approach that addresses nonlinearity is transformation of the outcome variable, such as log transformation. This approach can also help to yield a distribution that is nearer to a normal distribution with variance that is constant across the levels of the predictor variable. Alternatively, continuous predictor variables such as age are transformed into a categorical variable, for example, less than or greater than 65 years. Such transformation may facilitate interpretation but at the cost of information loss owing to grouping.

Categorical and continuous predictors are included in logistic regression models and Cox proportional hazards models in the same manner as described previously for linear regression model (Table 1). However, their associations are modeled on the log(odds) of a binary outcome and on the log(hazard rate) of a time-to-event outcome, respectively. Because of the complexity of time-to-event data analysis, Cox proportional hazards model will be covered in more detail in a future article in this series.

## Univariable Versus Multivariable Models

So far, we have discussed models with only one predictor, often called univariable models. Such models can be extended to simultaneously include multiple predictors, called multivariable models. Most often, these are referred to incorrectly as univariate and multivariate models in the clinical literature, and it is important to emphasize that the appropriate terminology is univariable and multivariable models. An example of the multivariable model is the following: $\phantom{\rule{0.25em}{0ex}}Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +{\beta}_{1}\cdot sex+{\beta}_{2}\cdot bm{i}_{1}+{\beta}_{3}\cdot bm{i}_{2}+{\beta}_{4}\cdot bm{i}_{3}+{\beta}_{5}\cdot {\text{age}}_{\text{trans}},$ (Eq. 1)

where

*sex*,*bmi*, and $ag{e}_{trans}$ are defined as in the previous models. Here, the intercept $\alpha $ represents the mean weight loss of a female patient with normal weight whose age is 50 years, corresponding to ${\text{age}}_{\text{trans}}=(age-50)/10=0.$ However, interpretation of the regression parameters ${\beta}^{\prime}\text{s}$ is now conditional on the values of the other predictors. For example, ${\beta}_{1}$ represents the difference in mean weight loss of a male patient compared with a female patient in the same BMI category and the same age. Similarly, ${\beta}_{2}$ represents the difference in mean weight loss of an underweight patient compared with a normal weight patient in the same sex and age category. More generally, the effect of each predictor in the above-mentioned multivariable model compares the outcome of individuals with the same attributes except for the predictor being evaluated. This aspect of multivariable models is particularly useful when adjusting for confounding factors. Here, confounding represents the effect of treatment which is not distinguishable from those of other factors, called confounding factors, which typically relate to both the treatment selection and the outcome variable, but are not mediators of the treatment effect on the outcome. Multivariable models allow for evaluation of the treatment effect conditional on the same attributes of confounding factors by including them as predictors.Another aspect of the above-mentioned multivariable model is that the effect of a predictor is the same regardless of the value of the other predictors. For example, the effect of sex ${\beta}_{1}$ on mean weight loss is constant irrespective of the BMI levels or age (Fig. 1

*A*). This assumption can be relaxed by introducing an interaction term, such as $sex\times ag{e}_{trans}$. Specifically, $Mean\left(weight\phantom{\rule{0.25em}{0ex}}loss\right)=\alpha +{\beta}_{1}\cdot sex+{\beta}_{2}\cdot bm{i}_{1}+{\beta}_{3}\cdot bm{i}_{2}+{\beta}_{4}\cdot bm{i}_{3}+{\beta}_{5}\cdot ag{e}_{trans}+{\beta}_{6}\cdot (sex\times ag{e}_{trans}),$ (Eq. 2)which allows for differential effect of sex on mean weight loss across the levels of age (Fig. 1

*B*). Inclusion of interaction terms should be considered to capture effect modifications between predictors, even though it can make a model and its interpretation more complex.We have thus far considered additive effect models. The advantages of this type of model are ease of interpretation and suitability for evaluating absolute effects of factors or interventions in a population. However, additive effect models can suffer from technical issues, especially for noncontinuous outcome variables, and models with multiplicative effects, such as logistic, Poisson, and Cox regression models, can be considered (Table 1).

## Model Complexity Versus Data Information

In general, statistical models become complex as the number of parameters (${\beta}^{\prime}\text{s}$) increases, for example, by entering many predictors, possibly including nonlinear or interaction terms, in the regression model. However, a complex model will not work when it is fit to a data set that does not contain enough information to estimate the parameters. The limiting sample size represents the amount of data information required for model fitting, typically measured by the number of subjects for continuous outcome variables and by the number of events in the analysis of censored time-to-event data (Table 1).

In regression modeling, there are rules for limiting sample size, such as “at least 10 subjects per predictor.” However, it should be noted that these are crude criteria that do not take into account the joint distribution among variables, such as multicollinearity, which represents high correlations among predictors and multivariable models must be free of multicollinearity for independent predictors. An estimation that is unstable owing to lack of limiting sample size, multicollinearity, or other reasons can be recognized by unrealistic parameter estimates or confidence intervals or by warning messages from the statistical software indicating the estimates or the variances cannot be obtained.

## Modeling for Effect Assessment Versus Modeling for Classification or Prediction

The strategy for regression modeling depends on the intended use of the model. Two common uses of statistical models are effect assessment and risk classification or prediction. If the model is used to evaluate the treatment effect or the impact of a risk factor, it is important that the model provides unbiased estimates of the treatment effect or impact of the risk factor of interest after adjusting for established prognostic factors and confounding factors. As such, careful selection of predictors on the basis of both statistical and clinical perspectives is warranted. Various variable selection techniques and their implications can be found in statistical literature.

^{7}

^{, }^{8}

^{, }^{9}

^{, }It must be emphasized that model checking and diagnostics are critical. For example, in the multivariable linear model of mean weight loss, the linearity assumption for a continuous predictor, for example, age, can be checked by a scattergram of the residual versus age, in which the residual for a patient is defined as the observed weight loss minus the fitted weight loss for that patient. Another aspect of model diagnostics is identification of influential observations, representing patients or groups of patients with particular predictor profiles, in estimating respective regression coefficients. Specifically, when the change of a regression coefficient estimate after deleting certain observations, called delta-beta, is substantial, it would be regarded as influential in estimating the regression coefficient. As a more theoretical remark, a linear model for a continuous outcome variable requires four main assumptions—linearity, independence (observations are independent from each other), normality (residuals follow a normal distribution), and homoscedasticity of residuals across the levels of the predictors—although the linear regression is relatively robust to deviations in the latter two assumptions. In case in which different regression models with different sets of predictors or from different sets of observations are considered important, it would probably be wise to present the results of all these models. This is helpful to evaluate the robustness of the main results or conclusions from the regression analysis to different regression models as a sensitivity analysis.

If the regression model is intended to be used as a scoring system for risk classification or for prediction, then the overall prediction accuracy of the model—as measured by sensitivity, specificity, the C-statistic for classification, and the Hosmer-Lemeshow statistic and Brier score for prediction—is more important than providing unbiased estimate for each predictor in the model. For example, Mandrekar et al. developed a prognostic model for advanced NSCLC and evaluated its accuracy in classifying prognostic risk using the C-statistic, which represents the probability that a randomly selected patient who develops an event of interest has a higher risk score than a patient who had not developed the event.

^{11}

In classification or prediction, the estimates of ${\beta}^{\prime}s$ are regarded simply as weights, rather than effects, and will be tuned to achieve high prediction accuracy. For the standard regression models, such as linear, logistic, Poisson, and Cox regression models (Table 1), penalized regressions, such as ridge and lasso,

^{13}

^{,}are a technique to shrink the regression parameters or weights toward zero. The resultant weights are thus biased, but more stable (i.e., less variance). As the penalization reduces the number of parameters (the degree of freedom) substantially in the process of estimation by shrinkage, it is effective especially when the number of predictors or parameters is large relative to the limiting sample size.All regression models are subject to overfitting to random noise, rather than systemic variation in the data used to build the model. In classification or prediction, resampling techniques, such as split-sample, cross-validation, or bootstrap, can be used for internal validation to evaluate the accuracy using the study population for which the model was developed. However, an external validation study using an independent set of samples is generally warranted. See the TRIPOD guidelines for reporting both model building and validation studies.

^{7}

^{,}^{13}

^{,}^{15}

^{16}

## Concluding Remarks

In this article, our focus is on the basis and principles of statistical models after collection of a data set for statistical analysis. We emphasize that the key to successful data analyses is designing the study to enhance the quality and quantity of data relevant to the study objective.

^{17}

^{, }^{18}

^{, }## Acknowledgments

This work is supported by a Grant-in-Aid for Scientific Research (16H06299) from the Ministry of Education, Culture, Sports, Science and Technology of Japan. This work was partially supported by the National Institutes of Health Grant P30CA15083 (Mayo Comprehensive Cancer Center Grant) and U10CA180882 (Alliance for Clinical Trials in Oncology Statistics and Data Management Grant).

## References

- Empirical Model-Building and Response Surfaces.Wiley, New York, NY1987
- Introduction to regression models.in: Rothman K.J. Greenland S. Lash T.L. Modern Epidemiology. 3rd ed. Lippincott Williams & Wilkins, Philadelphia, PA2008: 381-417
- Weight loss over time and survival: a landmark analysis of 1000+ prospectively treated and monitored lung cancer patients.
*J Cachexia Sarcopenia Muscle.*2020; 11: 1501-1508 - A randomized phase II study of gemcitabine and carboplatin with or without cediranib as first-line therapy in advanced non-small-cell lung cancer: North Central Cancer Treatment Group Study N0528.
*J Thorac Oncol.*2013; 8: 79-88 - PD-L1 expression, tumor mutational burden, and cancer gene mutations are stronger predictors of benefit from immune checkpoint blockade than HLA class I genotype in non-small cell lung cancer.
*J Thorac Oncol.*2019; 14: 1021-1031 - Generalized Additive Models.Chapman & Hall/CRC, Boca Raton, FL1990
- Regression Modeling Strategies With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis.Springer, Berlin, Switzerland2015
- Causal diagrams for epidemiologic research.
*Epidemiology.*1999; 10: 37-48 - Introduction to causal diagrams for confounder selection.
*Respirology.*2014; 19: 303-311 - Five myths about variable selection.
*Transpl Int.*2017; 30: 6-10 - Assessing the performance of prediction models: a framework for traditional and novel measures.
*Epidemiology.*2010; 21: 128-138 - A prognostic model for advanced stage nonsmall cell lung cancer pooled analysis of north central cancer treatment group trials.
*Cancer.*2006; 107: 781-792 - The Elements of Statistical Learning: Data Mining, Inference, and Prediction.2nd ed. Springer, Berlin, Switzerland2009
- Developing and validating risk assessment models of clinical outcomes in modern oncology.
*JCO Precis Oncol.*2019; 3 (PO.19.00068) - What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.
*Psychosom Med.*2004; 66: 411-421 - Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.
*Ann Intern Med.*2015; 162 (W1–W73) - Using big data to emulate a target trial when a randomized trial is not available.
*Am J Epidemiol.*2016; 183: 758-764 - Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating.2nd ed. Springer, Berlin, Switzerland2019
- Calculating the sample size required for developing a clinical prediction model.
*BMJ.*2020; 368: m441 - Applied Logistic Regression.3rd ed. John Wiley & Sons, Chichester, United Kingdom2013

## Article Info

### Publication History

Published online: February 25, 2021

Accepted:
February 15,
2021

Received in revised form:
February 12,
2021

Received:
December 15,
2020

### Footnotes

*Disclosure:* The authors declare no conflict of interest.

### Identification

### Copyright

© 2021 International Association for the Study of Lung Cancer. Published by Elsevier Inc.

### User License

Elsevier user license | How you can reuse

Elsevier's open access license policy

Elsevier user license

## Permitted

### For non-commercial purposes:

- Read, print & download
- Text & data mine
- Translate the article

## Not Permitted

- Reuse portions or extracts from the article in other works
- Redistribute or republish the final article
- Sell or re-use for commercial purposes

Elsevier's open access license policy