Find Relationship Between Continuous Variable and Nominal Variable

October 19, 2022 Post a Comment

The title of this question suggests a fundamental misunderstanding. The most basic idea of correlation is "as one variable increases, does the other variable increase (positive correlation), decrease (negative correlation), or stay the same (no correlation)" with a scale such that perfect positive correlation is +1, no correlation is 0, and perfect negative correlation is -1. The meaning of "perfect" depends on which measure of correlation is used: for Pearson correlation it means the points on a scatter plot lie right on a straight line (sloped upwards for +1 and downwards for -1), for Spearman correlation that the ranks exactly agree (or exactly disagree, so first is paired with last, for -1), and for Kendall's tau that all pairs of observations have concordant ranks (or discordant for -1). An intuition for how this works in practice can be gleaned from the Pearson correlations for the following scatter plots (image credit):

Pearson correlation for various scatter plots

Further insight comes from considering Anscombe's Quartet where all four data sets have Pearson correlation +0.816, even though they follow the pattern "as $x$ increases, $y$ tends to increase" in very different ways (image credit):

Scatter plots for Anscombe's Quartet

If your independent variable is nominal then it doesn't make sense to talk about what happens "as $x$ increases". In your case, "Topic of conversation" doesn't have a numerical value that can go up and down. So you can't correlate "Topic of conversation" with "Duration of conversation". But as @ttnphns wrote in the comments, there are measures of strength of association you can use that are somewhat analogous. Here is some fake data and accompanying R code:

                      data.df                        <-                          data.frame            (                          topic                        =                          c            (            rep            (            c            (            "Gossip"            ,                                    "Sports"            ,                                    "Weather"            ),                          each                        =                                    4            )),                          duration                        =                          c            (            6            :            9            ,                                    2            :            5            ,                                    4            :            7            )                                    )                          print            (            data.df            )                          boxplot            (            duration                        ~                          topic            ,                          data                        =                          data.df            ,                          ylab                        =                                    "Duration of conversation"            )

Which gives:

                      >                          print            (            data.df            )                          topic duration                        1                          Gossip                        6                                    2                          Gossip                        7                                    3                          Gossip                        8                                    4                          Gossip                        9                                    5                          Sports                        2                                    6                          Sports                        3                                    7                          Sports                        4                                    8                          Sports                        5                                    9                          Weather                        4                                    10                          Weather                        5                                    11                          Weather                        6                                    12                          Weather                        7

Box plots for fake data

By using "Gossip" as the reference level for "Topic", and defining binary dummy variables for "Sports" and "Weather", we can perform a multiple regression.

                      >                          model.lm                        <-                          lm            (            duration                        ~                          topic            ,                          data                        =                          data.df            )                                    >                          summary            (            model.lm            )                          Call            :                          lm            (            formula                        =                          duration                        ~                          topic            ,                          data                        =                          data.df            )                          Residuals            :                          Min                        1            Q Median                        3            Q    Max                        -1.50                                    -0.75                                    0.00                                    0.75                                    1.50                          Coefficients            :                          Estimate Std. Error t value Pr            (>|            t            |)                                    (            Intercept            )                                    7.5000                                    0.6455                                    11.619                                    1.01e-06                                    ***                          topicSports                        -4.0000                                    0.9129                                    -4.382                                    0.00177                                    **                          topicWeather                        -2.0000                                    0.9129                                    -2.191                                    0.05617                          .                        ---                          Signif. codes            :                                    0                          '            ***            '                        0.001                          '            **            '                        0.01                          '            *            '                        0.05                          '.'                        0.1                          ' '                        1                          Residual standard error            :                                    1.291                          on                        9                          degrees of freedom Multiple R            -            squared            :                                    0.6809            ,                          Adjusted R            -            squared            :                                    0.6099                          F            -            statistic            :                                    9.6                          on                        2                          and                        9                          DF            ,                          p            -            value            :                                    0.005861

We can interpret the estimated intercept as giving the mean duration of Gossip conversations as 7.5 minutes, and the estimated coefficients for the dummy variables as showing Sports conversations were on average 4 minutes shorter than Gossip ones, while Weather conversations were 2 minutes shorter than Gossip. Part of the output is the coefficient of determination $R^{2} = 0.6809$ . One interpretation of this is that our model explains 68% of variance in conversation duration. Another interpretation of $R^{2}$ is that by square-rooting, we can find the multiple correlation coefficent $R$ .

                      >                          rsq                        <-                          summary            (            model.lm            )$            r.squared                        >                          rsq                        [            1            ]                                    0.6808511                                    >                          sqrt            (            rsq            )                                    [            1            ]                                    0.825137

Note that 0.825 isn't the correlation between Duration and Topic - we can't correlate those two variables because Topic is nominal. What it actually represents is the correlation between the observed durations, and the ones predicted (fitted) by our model. Both of these variables are numerical so we are able to correlate them. In fact the fitted values are just the mean durations for each group:

                      >                          print            (            model.lm            $            fitted            )                                    1                                    2                                    3                                    4                                    5                                    6                                    7                                    8                                    9                                    10                                    11                                    12                                    7.5                                    7.5                                    7.5                                    7.5                                    3.5                                    3.5                                    3.5                                    3.5                                    5.5                                    5.5                                    5.5                                    5.5

Just to check, the Pearson correlation between observed and fitted values is:

                      >                          cor            (            data.df            $            duration            ,                          model.lm            $            fitted            )                                    [            1            ]                                    0.825137

We can visualise this on a scatter plot:

                      plot            (            x                        =                          model.lm            $            fitted            ,                          y                        =                          data.df            $            duration            ,                          xlab                        =                                    "Fitted duration"            ,                          ylab                        =                                    "Observed duration"            )                          abline            (            lm            (            data.df            $            duration                        ~                          model.lm            $            fitted            ),                          col            =            "red"            )

Visualise multiple correlation coefficient between observed and fitted values

The strength of this relationship is visually very similar to those of the Anscombe's Quartet plots, which is unsurprising as they all had Pearson correlations about 0.82.

You might be surprised that with a categorical independent variable, I chose to do a (multiple) regression rather than a one-way ANOVA. But in fact this turns out to be an equivalent approach.

                      library            (            heplots            )                                    # for eta                          model.aov                        <-                          aov            (            duration                        ~                          topic            ,                          data                        =                          data.df            )                          summary            (            model.aov            )

This gives a summary with identical F statistic and p-value:

                                    Df Sum Sq Mean Sq F value  Pr            (>            F            )                          topic                        2                                    32                                    16.000                                    9.6                                    0.00586                                    **                          Residuals                        9                                    15                                    1.667                                    ---                          Signif. codes            :                                    0                          '            ***            '                        0.001                          '            **            '                        0.01                          '            *            '                        0.05                          '.'                        0.1                          ' '                        1

Again, the ANOVA model fits the group means, just as the regression did:

                      >                          print            (            model.aov            $            fitted            )                                    1                                    2                                    3                                    4                                    5                                    6                                    7                                    8                                    9                                    10                                    11                                    12                                    7.5                                    7.5                                    7.5                                    7.5                                    3.5                                    3.5                                    3.5                                    3.5                                    5.5                                    5.5                                    5.5                                    5.5

This means that the correlation between fitted and observed values of the dependent variable is the same as it was for the multiple regression model. The "proportion of variance explained" measure $R^{2}$ for multiple regression has an ANOVA equivalent, $η^{2}$ (eta squared). We can see that they match.

                      >                          etasq            (            model.aov            ,                          partial                        =                                    FALSE            )                          eta            ^            2                          topic                        0.6808511                          Residuals                        NA

In this sense, the closest analogue to a "correlation" between a nominal explanatory variable and continuous response would be $η$ , the square-root of $η^{2}$ , which is the equivalent of the multiple correlation coefficient $R$ for regression. This explains the comment that "The most natural measure of association / correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta". If you are more interested in the proportion of variance explained, then you can stick with eta squared (or its regression equivalent $R^{2}$ ). For ANOVA, one often comes across the partial eta squared. As this ANOVA was one-way (there was only one categorical predictor), the partial eta squared is the same as eta squared, but things change in models with more predictors.

                      >                          etasq            (            model.aov            ,                          partial                        =                                    TRUE            )                          Partial eta            ^            2                          topic                        0.6808511                          Residuals                        NA

However it's quite possible that neither "correlation" nor "proportion of variance explained" is the measure of effect size you wish to use. For instance, your focus may lie more on how means differ between groups. This question and answer contain more information on eta squared, partial eta squared, and various alternatives.

kozlowskilovid1985.blogspot.com

Source: https://www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20%28IV%29%20and%20a%20continuous%20%28DV%29%20variable.html

Kozlowski Lovid1985

Find Relationship Between Continuous Variable and Nominal Variable

Post a Comment for "Find Relationship Between Continuous Variable and Nominal Variable"