Find Relationship Between Continuous Variable and Nominal Variable
The title of this question suggests a fundamental misunderstanding. The most basic idea of correlation is "as one variable increases, does the other variable increase (positive correlation), decrease (negative correlation), or stay the same (no correlation)" with a scale such that perfect positive correlation is +1, no correlation is 0, and perfect negative correlation is -1. The meaning of "perfect" depends on which measure of correlation is used: for Pearson correlation it means the points on a scatter plot lie right on a straight line (sloped upwards for +1 and downwards for -1), for Spearman correlation that the ranks exactly agree (or exactly disagree, so first is paired with last, for -1), and for Kendall's tau that all pairs of observations have concordant ranks (or discordant for -1). An intuition for how this works in practice can be gleaned from the Pearson correlations for the following scatter plots (image credit):
Further insight comes from considering Anscombe's Quartet where all four data sets have Pearson correlation +0.816, even though they follow the pattern "as increases, tends to increase" in very different ways (image credit):
If your independent variable is nominal then it doesn't make sense to talk about what happens "as increases". In your case, "Topic of conversation" doesn't have a numerical value that can go up and down. So you can't correlate "Topic of conversation" with "Duration of conversation". But as @ttnphns wrote in the comments, there are measures of strength of association you can use that are somewhat analogous. Here is some fake data and accompanying R code:
data.df <- data.frame ( topic = c ( rep ( c ( "Gossip" , "Sports" , "Weather" ), each = 4 )), duration = c ( 6 : 9 , 2 : 5 , 4 : 7 ) ) print ( data.df ) boxplot ( duration ~ topic , data = data.df , ylab = "Duration of conversation" )
Which gives:
> print ( data.df ) topic duration 1 Gossip 6 2 Gossip 7 3 Gossip 8 4 Gossip 9 5 Sports 2 6 Sports 3 7 Sports 4 8 Sports 5 9 Weather 4 10 Weather 5 11 Weather 6 12 Weather 7
By using "Gossip" as the reference level for "Topic", and defining binary dummy variables for "Sports" and "Weather", we can perform a multiple regression.
> model.lm <- lm ( duration ~ topic , data = data.df ) > summary ( model.lm ) Call : lm ( formula = duration ~ topic , data = data.df ) Residuals : Min 1 Q Median 3 Q Max -1.50 -0.75 0.00 0.75 1.50 Coefficients : Estimate Std. Error t value Pr (>| t |) ( Intercept ) 7.5000 0.6455 11.619 1.01e-06 *** topicSports -4.0000 0.9129 -4.382 0.00177 ** topicWeather -2.0000 0.9129 -2.191 0.05617 . --- Signif. codes : 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' ' 1 Residual standard error : 1.291 on 9 degrees of freedom Multiple R - squared : 0.6809 , Adjusted R - squared : 0.6099 F - statistic : 9.6 on 2 and 9 DF , p - value : 0.005861
We can interpret the estimated intercept as giving the mean duration of Gossip conversations as 7.5 minutes, and the estimated coefficients for the dummy variables as showing Sports conversations were on average 4 minutes shorter than Gossip ones, while Weather conversations were 2 minutes shorter than Gossip. Part of the output is the coefficient of determination . One interpretation of this is that our model explains 68% of variance in conversation duration. Another interpretation of is that by square-rooting, we can find the multiple correlation coefficent .
> rsq <- summary ( model.lm )$ r.squared > rsq [ 1 ] 0.6808511 > sqrt ( rsq ) [ 1 ] 0.825137
Note that 0.825 isn't the correlation between Duration and Topic - we can't correlate those two variables because Topic is nominal. What it actually represents is the correlation between the observed durations, and the ones predicted (fitted) by our model. Both of these variables are numerical so we are able to correlate them. In fact the fitted values are just the mean durations for each group:
> print ( model.lm $ fitted ) 1 2 3 4 5 6 7 8 9 10 11 12 7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5
Just to check, the Pearson correlation between observed and fitted values is:
> cor ( data.df $ duration , model.lm $ fitted ) [ 1 ] 0.825137
We can visualise this on a scatter plot:
plot ( x = model.lm $ fitted , y = data.df $ duration , xlab = "Fitted duration" , ylab = "Observed duration" ) abline ( lm ( data.df $ duration ~ model.lm $ fitted ), col = "red" )
The strength of this relationship is visually very similar to those of the Anscombe's Quartet plots, which is unsurprising as they all had Pearson correlations about 0.82.
You might be surprised that with a categorical independent variable, I chose to do a (multiple) regression rather than a one-way ANOVA. But in fact this turns out to be an equivalent approach.
library ( heplots ) # for eta model.aov <- aov ( duration ~ topic , data = data.df ) summary ( model.aov )
This gives a summary with identical F statistic and p-value:
Df Sum Sq Mean Sq F value Pr (> F ) topic 2 32 16.000 9.6 0.00586 ** Residuals 9 15 1.667 --- Signif. codes : 0 ' *** ' 0.001 ' ** ' 0.01 ' * ' 0.05 '.' 0.1 ' ' 1
Again, the ANOVA model fits the group means, just as the regression did:
> print ( model.aov $ fitted ) 1 2 3 4 5 6 7 8 9 10 11 12 7.5 7.5 7.5 7.5 3.5 3.5 3.5 3.5 5.5 5.5 5.5 5.5
This means that the correlation between fitted and observed values of the dependent variable is the same as it was for the multiple regression model. The "proportion of variance explained" measure for multiple regression has an ANOVA equivalent, (eta squared). We can see that they match.
> etasq ( model.aov , partial = FALSE ) eta ^ 2 topic 0.6808511 Residuals NA
In this sense, the closest analogue to a "correlation" between a nominal explanatory variable and continuous response would be , the square-root of , which is the equivalent of the multiple correlation coefficient for regression. This explains the comment that "The most natural measure of association / correlation between a nominal (taken as IV) and a scale (taken as DV) variables is eta". If you are more interested in the proportion of variance explained, then you can stick with eta squared (or its regression equivalent ). For ANOVA, one often comes across the partial eta squared. As this ANOVA was one-way (there was only one categorical predictor), the partial eta squared is the same as eta squared, but things change in models with more predictors.
> etasq ( model.aov , partial = TRUE ) Partial eta ^ 2 topic 0.6808511 Residuals NA
However it's quite possible that neither "correlation" nor "proportion of variance explained" is the measure of effect size you wish to use. For instance, your focus may lie more on how means differ between groups. This question and answer contain more information on eta squared, partial eta squared, and various alternatives.
kozlowskilovid1985.blogspot.com
Source: https://www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20%28IV%29%20and%20a%20continuous%20%28DV%29%20variable.html
Post a Comment for "Find Relationship Between Continuous Variable and Nominal Variable"