National IQs summarized

Lynn and Vanhanen have a new paper out in which they summarize, in 19 tables, the research to date on National IQs. (Bottom.) It’s interesting that IQ tests and international assessments predict as they do, but it’s not at all clear what the national IQ differences represent (Lynn’s updated National IQs, here)– it’s not even clear if there are actual differences in ability. With regards to the latter point, here were the results from a recent study on measurement invariance and the 1999 TIMSS (Trends in International Mathematics and Science Study):

The authors rightfully conclude:

For cross-culture MI examinations, only weak invariance, at best, is achieved. This result indicates that intercept invariance does not hold for any of the cross-culture comparisons, hence, the mathematics test, as a whole, was consistently biased against one of the countries in the pairs. One cannot infer that there is true group difference even if the hypothesis test, such as a t-test, is significant because the detected difference might be an artifact of the measurement bias. Any research or policy exercise such as ranking performances or explaining group differences based on such mathematics proficiency scores is not meaningful because mathematics proficiency scores were not measured on the same metric unless some forms of linking or equating, which have their own variation of MI assumptions, is performed before comparison.

Lynn and Vanhanen, 2012. National IQs: A review of their educational, cognitive, economic, political, demographic, sociological, epidemiological, geographic and climatic correlates

This entry was posted in Uncategorized. Bookmark the permalink.

10 Responses to National IQs summarized

  1. Steve Sailer says:

    Thanks for linking to Wu, Li, and Zumbo’s paper on measurement invariance, which begins, “Measurement invariance (MI) has been developed in a very technical language and manner that is generally not widely accessible to social and behavioral researchers and applied measurement specialists.”

    How right they are. I’m just not smart enough to be able to keep the concept of MI clear in my head for more than about 30 seconds.

    What might be helpful for duffers like me are a couple of real world examples where MI explains something useful.

    • Chuck says:

      So do “test scores” relate to “cognitive ability” (a latent construct) the same way in Mexico as in the US. If measurement invariance holds, yes. A blunt example of a situation in which it wouldn’t is if tests were given in Spanish both in the US and Mexico. In this case, you clearly wouldn’t be measuring the same thing in the US as in Mexico (i.e., measurement invariance would not hold). Inference about latent ability differences then become problematic. In this case, US test scores would be lower, but not cognitive ability.

      Here’s a discussion with citations:

      “Group Comparisons

      The interpretation of group differences on observed scores, in terms of psychological attributes, depends on the invariance of measurement models across the groups that figure in the comparison. In psychometrics, a significant array of theoretical models and associated techniques has been developed to get some grip on this problem (Mellenbergh, 1989; Meredith, 1993; Millsap & Everson, 1993). In practice, however, group differences are often simply evaluated through the examination of observed scores—without testing the invariance of measurement models that relate these scores to psychological attributes.
      Tests of measurement invariance are conspicuously lacking, for instance, in some of the most influential studies on group differences in intelligence. Consider the controversial work of Herrnstein and Murray (1994) and Lynn and Vanhanen (2002). These researchers infer latent intelligence differences between groups from observed differences in IQ (across race and nationality, respectively) without having done a single test for measurement invariance. (It is also illustrative, in this context, that their many critics rarely note this omission.) What these researchers do instead is check whether correlations between test scores and criterion variables are comparable (e.g., Lynn & Vanhanen, 1994, pp. 66–71), or whether regressing some criterion on the observed test scores gives comparable regression parameters in the different groups (e.g., Herrnstein & Murray, 2002, p. 627). This is called prediction invariance. Prediction invariance is then interpreted as evidence for the hypothesis that the tests in question are unbiased.

      In 1997 Millsap published an important paper in Psychological Methods on the relation between prediction invariance and measurement invariance. The paper showed that, under realistic conditions, prediction invariance does not support measurement invariance. In fact, prediction invariance is generally indicative of violations of measurement invariance: if two groups differ in their latent means, and a test has prediction invariance across the levels of the grouping variable, it must have measurement bias with regard to group membership. Conversely, when a test is measurement invariant, it will generally show differences in predictive regression parameters. One would expect a clearly written paper that reports a result, which is so central to group comparisons, to make a splash in psychology. If the relations between psychometrics and psychology were in good shape, to put forward invariant regression parameters as evidence for measurement invariance would be out of the question in every professional and scientific work that appeared after 1997. (Borsboom, 2006. The attack of the psychometricians.)

  2. Steve Sailer says:

    Okay, so to extend your example, the last time the National Assessment of Educational Progress asked students if they were born in the U.S. was in, I believe, 1992. Hispanic students not born in the U.S. scored worse than African-Americans on the NAEP, which is given in English, while Hispanics born in the U.S. scored better than African-Americans. Presumably, language is a major cause of this difference.

    In contrast, on “culture-free” IQ tests, such as Ravens, both Mexicans and Mexican-Americans tend to outscore African-Americans on average.

    On the other hand, the NAEP results tend to be more accurate predictors of outcomes that are highly dependent upon good English skills, such as going to law school in the U.S. (a higher proportion of African-Americans than Mexican-Americans go to law school, I believe) than the Ravens results.

    On the other other hand, the Ravens might likely be a more accurate predictor than the NAEP of, say, whether somebody who didn’t grow up speaking English would make a decent carpenter or some other job that requires some brainpower but is less dependent upon English skills.

    So, how do we apply the concepts “prediction invariance” and “measurement invariance” to these (stylized, but not implausible) facts? And what benefit do we get from using those concepts?

    • Chuck says:


      A good example of a situation where the “prediction invariance” and “measurement invariance” issue comes up is with stereotype threat. (ST). As Wicherts has pointed out, if MI holds differences are unlikely to be due to ST, since MI means that differences are not a function of group membership, but rather are a function of the differential distributions of the latent variable. In the case of the US Black White difference a number of studies have shown that MI holds. Let me just quote James Lee on this:

      “A population difference that does not arise from common factors is said to arise from measurement bias. In this situation a member of the minority population with the same latent ability as a member of the majority population is expected to obtain a different observed score (Fig. 2). For example, if the mean difference between two populations in vocabulary size arises from some cultural barrier impeding the minority population’s acquisition of the majority language, then members of the minority population with given latent scores on g and the broad verbal factor will obtain lower scores on a vocabulary test than their majority peers with equal latent scores. This form of measurement bias corresponds to different intercepts in the regression of test scores on common factors. However, differences in slopes and residual variances are also forms of measurement bias, since under such conditions observed scores continue to depend on both latent abilities and group membership.

      Three studies examining the factorial nature of the black–white IQ difference have found that the difference does not arise from measurement bias (Dolan, 2000; Dolan & Hamaker, 2001; Lubke, Dolan, Kelderman, & Mellenbergh, 2003). This implies that the black–white difference is indeed a difference in very generalabilities. In contrast, a study of stereotype threat employing similarly sized samples found measurement bias to be an important contributor to the differences between treatment groups (Wicherts, Dolan, & Hessen, 2005). Various experiments revealed discrepancies in intercepts, slopes (factor loadings), and residual variances, leading to the conclusion that the differences introduced by stereotype threat are generally not impairments of broad abilities. When combined with the tenability of unbiased measurement for blacks and whites in more typical settings, this finding suggests that stereotype threat may be yet another curiosity of the psychological laboratory with minimal relevance to behavior in realworld situations. (Lee, 2009. Review of intelligence and how to get it: Why schools and cultures count, R.E. Nisbett, Norton, New York, NY (2009). ISBN: 9780393065053)”

      But it’s a hassle to establish MI, as tests of MI have rather restrictive assumptions. It’s much easier to establish prediction invariance and then to infer MI or to test theories of ST based on some hypothesized predictor-criterion relations. For example, the latter has been tried in a number of studies (e.g., Cullen et al., 2004; Cullen et al., 2006). These attempts have been criticized (Wicherts and Millsap, 2009) and rebuttals have been given (Sackett et al., 2009):

      “Wicherts and Millsap (2009, this issue) rightly noted the distinction between measurement bias and predictive bias and the fact that a finding of no predictive bias does not rule out the possibility that measurement bias still exists. They took issue with a statement we cited from Cullen, Hardison, and Sackett (2004) that if motivational mechanisms, such as stereotype threat, result in minority group members obtaining lower observed scores than true scores (i.e., a form of measurement bias), then the performance of minority group members should be underpredicted. Our characterization of Cullen et al.’s (2004) statement was too cryptic; what was intended was a statement to the effect that if the regression lines for majority and minority groups are identical at the level of true predictor scores, then a biasing factor resulting in lower observed scores than true scores for minority group members would shift the minority group regression line to result in underprediction for that group. Cullen et al. (2004) also noted that this pattern would be expected only in the absence of other biasing factors, such as a similar motivational mechanism affecting the criterion. Thus, they took great pains to identify a criterion measure not subject to potential biasing effects of stereotype threat.”

      With regards to inferences of MI from PI, one of the subtle psychometric issues involved concerns the relation between the two. MI proponents argue that (e.g., Borsboom, 2006) PI does not support MI. Simulations, in fact, show that the presence of MI often corresponds to the absence of PI. But others point out that this finding might be an artifact of simulations and that in real world data PI tends to correspond to MI (e.g., Moses, 2011):

      “Prediction invariance indicates that the expected scores of Y given the observed X scores computed for subpopulation G = g are equal to those computed for the total group,

      Measurement invariance indicates that the expected scores of Y given latent variable T
      computed in subpopulation G = g are equal to those computed in the total group,

      Some work has suggested that the invariances, and in particular prediction and
      measurement invariance, are inconsistent. Subgroup differences in the intercepts of observed score regressions have been demonstrated in hypothetical situations where measurement invariance is assumed (Linn & Werts, 1971). The subgroup correlations that are one aspect of prediction invariance can be nearly identical for conditions where measurement invariance does not hold (Drasgow, 1982). Comparisons of measurement and prediction invariance have also been studied in terms of the slopes in regression and the pattern loadings in factor analysis models, with the invariances shown to be contradictory under all but the most extreme conditions (Millsap, 1995; Millsap, 1997). Although the work of Linn and Werts is hypothetical and the works of Drasgow and Millsap are theoretical, Millsap argued that his results are realistic and encouraged empirical investigations that evaluate and show the inconsistencies among prediction and measurement invariance.

      The disagreements about whether prediction, measurement, and scaling invariance are consistent or inconsistent suggest that empirical evaluations of the invariances may be useful. Empirical studies have the potential to inform suggestions that the invariances are consistent, suggestions which are primarily based on empirical studies that have focused on evaluating only one type of invariance and inferring the results of other invariances (Dorans, 2004; Humphreys, 1986; Kolen, 2004). Empirical studies also have the potential to clarify suggestions that prediction and measurement invariance are inconsistent, suggestions which are based on theoretical studies that have compared the invariances with respect to theoretical models rather than empirical data (Drasgow, 1982; Millsap, 1995; Millsap, 1997). This paper’s use of empirical studies to address whether prediction, measurement, and scaling invariance are consistent or inconsistent may be useful for both extending prior empirical studies and clarifying theoretical suggestions.

      The overall results of this study’s demonstrations replicate and extend the results of prior
      empirical studies by finding that

       invariance in X-to-Y prediction, measurement and scaling functions is more likely to
      be achieved when X is highly correlated to Y (Dorans, 2000),
       prediction, measurement, and scaling invariance results are often consistent (Dorans,
      2004; Humphreys, 1986; Kolen, 2004),
       the major source of prediction, scaling, and measurement invariance results tends to
      be in the differences of subpopulation functions’ intercepts rather than in functions’
      slope or in nonlinear functions (Houston & Novick, 1987; Humphreys, 1986; Hunter,
      et al., 1984; Liu & Holland, 2008; Rotundo & Sackett, 1999; Rushton & Jensen,
      2005; Sackett, Schmitt, Ellingson, & Kablin, 2001; Schmidt & Hunter, 1981).”

      ……So that’s the background. Now let’s put this in real world context using our stereotype threat example. US Blacks under-perform US whites on cognitive ability tests and in work performance (Sackett and Lievens, 2007) and the difference in CATs predicts the difference in work performance. Some argue that the difference in CAT performance is due to “stereotype threat.” Since CAT performance predicts work performance, they must also argue that the difference in work performance is due to some non-cognitive factors and that the two sets of non-cognitive factors correlate as if there was a cognitive difference. And they do argue this. Whether or not it’s plausible that this is the case, given the specific relation between the predictors and criterion (e.g., Cullen et al., 2004), the people who argue this are correct that this could be the case; predictivity does not establish the non-existence of ST; nor does PI.

      Now, if MI holds, ST isn’t a plausible candidate for causing group differences. But MI is troublesome to establish. What about Prediction invariance? That depends on our reading of the literature on the relation between the two. The absence of PI of course does not imply the absence of MI — or, in this case. the absence of latent difference. Nor does the absence of PI imply the absence of predictivity. As technically defined [e.g., E(Y | X,G =g) = E(Y | X)], PI may or may not hold if MI does. And if MI does, CAT differences will predict work performance difference; in absence of PI, the subgroup correlations will just be different.

      ….So getting back to your question: “So, how do we apply the concepts “prediction invariance” and “measurement invariance” to these (stylized, but not implausible) facts? And what benefit do we get from using those concepts?”

      In the case of 1st gen Hispanics and African Americans, MI most probably doesn’t hold for
      NAEP scores. MI might hold when it comes to Ravens. Regardless of whether MI holds for either, PI may or may not hold in some instances. Regardless of whether MI and/or PI holds there may be a latent difference — you just wouldn’t know for certain (this goes for both the presence and absence of measured differences.) And regardless of whether MI and/or PI holds, differences may be predictive.– and they may be predictive regardless of whether there are latent differences.

      To put it another way, establishing MI establishes latent differences. You can try to infer these differences otherwise, but some methods are questionable, such as by way of PI or predictivity.

      Hope that helps.

      Cullen et al., 2004. Using SAT-grade and ability-job performance
      relationships to test predictions derived from stereotype threat theory. J

      Cullen et al., 2006. Testing stereotype threat theory predictions for
      math-identified and nonmath-identified students by gender.

      Moses, 2011. Evaluating Empirical Relationships Among Prediction, Measurement, and Scaling Invariance

      Wicherts and Millsap, 2009. The absence of underprediction does not imply the absence of measurement bias.

      Sackett and Lievens, 2007. Personnel Selection

  3. JL says:

    Statsquatch wrote about the measurement vs. prediction invariance problem here. If a test is measurement invariant, the fair thing would be to demand higher scores from members of low-scoring groups than high-scoring groups in selection situations. This, of course, won’t happen.

    In the case of the TIMSS, I wonder if it would be possible to select only items that are not biased toward any group and look at differences in those. In general, it would be interesting to know what sort of items are unbiased in crosscultural comparisons.

  4. JL says:

    In the book “The Black-White Test Score Gap” (1998), they consistently found in their analyses of various data sets that cognitive tests overpredict black performance compared to whites, but they generally attributed this to stereotype threat.

  5. statsquatch says:

    The over prediction noted in “The Black-White Test Score Gap” is a case where “prediction invariance” does not hold. Probably because ‘measurement invariance’ does. I find ‘Measurement Invariance’ more abstract in that it says a black and a white with an IQ of 100 have the same latent unobservable level of intelligence. As Chuck note, the existence of measurement invariance implies prediction invariance does not hold it the distribution of scores differs between two groups. That is ironically a “fair test” that is measurement invariant will lead to an “unfair” outcome” where prediction invariance does not hold.

    Measurement invariance is really interesting when you think about the Flynn effect, a situation where you can’t easily measure prediction invariance. Is the share cropper from 1940 with a 2012 IQ of 70 exactly as latently dumb as our contemporary retards? According to Wicherts and others the answer is a ‘NO’ measure invariance does not hold over time.

    • Chuck says:

      The issue seems to be more complex than that. A lack of measurement invariance does not mean a lack of difference in a latent factor, rather it means that the measured difference does not completely correspond to a latent factor difference. That is, it means that the measured difference is not solely caused by a latent difference, not that there is necessarily no latent difference.

      When it comes to the secular rise, I think Wai and Putallaz (2011) made the correct conclusion in their massive study, “The Flynn effect puzzle: A 30-year examination from the right tail of the ability distribution provides some missing pieces,” which looked at over 1.7 million scores:

      “For example, for tests that are most g loaded such as the SAT, ACT, and EXPLORE composites, the gains should be lower than on individual subtests such as the SAT-M, ACT-M, and EXPLORE-M. This is precisely the pattern we have found within each set of measures and this suggests that the gain is likely not due as much to genuine increases in g, but perhaps is more likely on the specific knowledge content of the measures. Additionally, following Wicherts et al. (2004),we used multigroup = confirmatory factor analysis (MGCFA) to further investigate whether the gains on the ACT and EXPLORE (the two measures with enough subtests for this analysis) were due to g or to other factors. Using time period as the grouping variable, we uncovered that both tests were not factorially invariant with respect to cohort which aligns with the findings of Wicherts et al. (2004) among multiple tests from the general ability distribution. Therefore, it is unclear whether the gains on these tests are due to g or to other factors, although increases could indeed be due to g, the true aspect, at least in part..(a).

      (a)…Under this model the g gain on the ACT was estimated at 0.078 of the time 1 SD. This result was highly sensitive to model assumptions. Models that allowed g loadings and intercepts for math to change resulted in Flynn effect estimates ranging from zero to 0.30 of the time 1 SD. Models where the math intercept was allowed to change resulted in no gains on g. This indicates that g gain estimates are unreliable and depend heavily on assumptions about measurement invariance. However, all models tested consistently showed an ACT g variance increase of 30 to 40%. Flynn effect gains appeared more robust on the EXPLORE, with all model variations showing a g gain of at least 30% of the time 1 SD. The full scalar invariance model estimated a gain of 30% but showed poor fit. Freeing intercepts on reading and English as well as their residual covariance resulted in a model with very good fit: χ2 (7) = 3024, RMSEA=0.086, CFI=0.985, BIC=2,310,919, SRMR=0.037. Estimates for g gains were quite large under this partial invariance model (50% of the time 1 SD). Contrary to the results from the ACT, all the EXPLORE models found a decrease in g variance of about 30%. This demonstrates that both the ACT and EXPLORE are not factorially invariant with respect to cohort which aligns with the findings of Wicherts et al. (2004) investigating multiple samples from the general ability distribution. Following Wicherts et al. (2004, p. 529), “This implies that the gains in intelligence test scores are not simply manifestations of increases in the constructs that the tests purport to measure (i.e., the common factors).” In other words, gains may still be due to g in part but due to the lack of full measurement invariance, exact estimates of changes in the g distribution depend heavily on complex partial measurement invariance assumptions that are difficult to test. Overall the EXPLORE showed stronger evidence of potential g gains than did the ACT.”

      (It seems queer that there would be g gains on EXPLORE but not ACT — but not as when you make the distinction between statistical g and biological/genetic g. In principle you can have massive statistical g gains with no biological/genetic g gains. )

      Here was Mingroni’s discussion: (below).

      “In recent years, multigroup confirmatory factor analysis (MGCFA) has been applied by some researchers in an effort to better understand the nature of IQ differences observed among different groups (see Lubke, Dolan, Kelderman, & Mellenbergh, 2003, for a general description). Briefly, in this type of analysis MGCFA is used to test the proposition that observed group IQ differences are due to differences in the latent factors thought to underlie the test, such as verbal ability, spatial ability, or a general intelligence factor g (e.g., Carroll, 1993). When the results of the analysis are consistent with group differences solely in the latent factors, the test is said to be measurement invariant with respect to the groups studied. The failure to observe measurement invariance suggests that the differences are due to other factors, instead of or in addition to, differences in the latent factors….

      Because heterosis would almost certainly be expected to affect the latent factors, one would initially expect measurement invari- ance to be observed between cohorts if heterosis were the sole cause of the Flynn effect. Wicherts et al. (2004) analyzed five data sets in which IQ test data were available for two different cohorts and found that measurement invariance was untenable in all five data sets. This finding would appear to be inconsistent with het- erosis as the sole cause of the rise in IQ. It is important to note, however, that the researchers also found that several of the data sets displayed partial measurement invariance. That is, when some of the subtests were effectively taken out of the analysis, measure- ment invariance was found to be tenable for the remaining subtests. This suggests that the trends observed on at least some subtests are likely due to changes in the latent factors.

      The failure to observe complete measurement invariance be- tween cohorts does not allow one to preclude the possibility that heterosis is a partial or even major cause of the Flynn effect. Therefore, findings like that of Wicherts et al. (2004) provide little justification for abandoning other available opportunities to test a heterosis hypothesis, such as conclusively determining whether the Flynn effect has occurred within families or testing for intergen- erational genetic changes in the frequency of heterozygotes. It must also be mentioned that unlike IQ, observed changes in traits like height, age at menarche, or myopia almost certainly could not be explained by artifacts resulting from the testing instruments.

      Efforts like those of Wicherts et al., which try to understand the nature of the Flynn effect, can only complement efforts like those discussed in this article, which primarily try to get at the cause of the Flynn effect. For example, Wicherts (personal communication, May 15, 2006) cited the case of a specific vocabulary test item, terminate, which became much easier over time relative to other items, causing measurement invariance to be less tenable between cohorts. The likely reason for this was that a popular movie, The Terminator, came out between the times when the two cohorts took the test. Because exposure to popular movie titles represents an aspect of the environment that should have a large nonshared component, one would expect that gains caused by this type of effect should show up within families. Although it might be difficult to find a data set suitable for the purpose, it would be interesting to try to identify specific test items that display Flynn effects within families. Such changes cannot be due to genetic factors like heterosis, and so a heterosis hypothesis would initially predict that measurement invariance should become more tenable after removal of items that display within-family trends. One could also look for items in which the heritability markedly increases or decreases over time. In the particular case cited above, one would also expect a breakdown in the heritability of the test item, as evidenced, for example, by a change in the probability of an individual answering correctly given his or her parents’ responses.”

  6. Steve Sailer says:

    So, if they gave the 1980 AFCT test to the children of the NLSY79 panel, we ought to be able to get some useful data on the Flynn Effect? Sounds like it would be worth doing…

  7. Greying Wanderer says:

    “it’s not at all clear what the national IQ differences represent”

    I think a lot of it is the ability to deal with abstractions. Someone with low IQ can learn to hunt or drive a car or perform other mechanical tasks which are learned through practising but have a harder time with anything abstract so mechanics would come easier than electrics which would come easier than electronics etc

    The future is an abstract concept and the further ahead the more abstract. Lower IQ people – or young children – have a hard time imagining more than a few days ahead. So the lower IQ would have a problem both with tasks that were abstract in themselves plus those that required study or future planning.

    As an example of the distinction in the recent Libyan conflict there was heavy use of technicals – toyota pickup trucks with a machinegun mounted on the back. If you tested IQ representative teams from a variety of ethnic groups purely on driving and shooting technicals around an assault course cum target range i don’t think there would be much of a noticeable difference based on national IQ. However if you did the tests weekly over six months with a cumulative score and with each team maintaining their own technicals the higher IQ groups would rapidly pull ahead because of maintenance and the lower IQ broke down and jammed more.

    If you accept the future as being an abstract concept then you get the link. Those populations that needed to think ahead to survive got the abstraction chip i.e. winter i.e. latitude.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s