Jump to content

Consensus-based assessment: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
Jarich (talk | contribs)
added more wiki-links, various wording changes, proofing.
Line 1: Line 1:
'''Consensus Based Assessment''' expands on the common practice of [[consensus decision-making]] and the theoretical observation that expertise can be closely approximated by large numbers of novices or journeymen. It creates a method for determining [[Rubric (academic)|measurement standards]] for very ambiguous domains of knowledge, such as politics, religion, values and culture in general. From this perspective, the shared knowledge that forms cultural consensus can be assessed in much the same way as expertise or general intelligence.
{{Essay-entry}}


== Measurement standards for general intelligence ==
'''Consensus Based Assessment (CBA),''' expands on the common practice of [[Consensus decision-making]] and the theoretical observation that expertise can be closely approximated by large numbers of novices or journeymen. It creates a method for determining [[measurement standards]] for very ambiguous domains of knowledge, such as politics, religiosity, and values and culture in general. From this perspective, the shared knowledge that forms cultural consensus can be assessedin much the same way as expertise or general intelligence.


Consensus based assessment identifies that samples of individuals with differing competence (e.g., experts and apprentices) rate relevant scenarios, using [[Likert scale|Likert scales]], with similar mean ratings. Thus, from the perspective of a CBA framework, cultural standards for scoring keys can be derived from the same population that is being assessed. Peter Legree and Joseph Psotka, working together over the past decades, proposed that [[General intelligence factor|psychometric g]] could be measured unobtrusively through survey-like scales requiring judgements. This could use deviation score for each person from the group or expert mean; or a [[correlation|Pearson correlation]] between their judgments and the group mean. The two techniques are perfectly correlated. Legree and Psotka subsequently created scales that requested individuals to estimate word frequency; judge binary probabilities of good continuation; identify knowledge implications, and approximate employment distributions. The items were carefully identified to lack objective referents, and therefore the scales required respondents to provide judgments that were scored against broadly developed, consensual standards. Performance on this judgment battery correlated approximately 0.80 with conventional measures of psychometric g. The response keys were consensually derived. Unlike mathematics or physics questions, the selection of items, scenarios, and options to assess psychometric g were guided roughly by a theory that emphasized complex judgment, but the explicit keys were unknown until after the assessments were made: they were determined by the means of everyone's responses, using deviation scores, correlations, or factor scores.
== Measurement Standards for General Intelligence ==


== Measurement standards for cultural knowledge==
According to this view, if samples of individuals with differing competence (e.g., experts and apprentices) rate a relevant scenario using [[Likert scales]], then they would provide similar mean ratings for each item. Thus, from the perspective of a CBA framework, cultural standards for scoring keys can be derived from the same population that is being assessed. Peter Legree and Joseph Psotka, working together over the past decades, proposed that psychometric g could be measured unobtrusively through survey-like scales requiring judgments. The measurement could be by either using a deviation score for each person from the group or expert mean; or by using a Pearson correlation between their judgments and the group mean. The two techniques are perfectly correlated. They subsequently created scales that requested individuals to estimate word frequency; judge binary probabilities of good continuation; identify knowledge implications, and approximate employment distributions. The items were carefully identified to lack objective referents, and therefore the scales required respondents to provide judgments that were scored against broadly developed, consensual standards. Performance on this judgment battery correlated approximately .80 with conventional measures of psychometric g. The response keys were consensually derived. Unlike mathematics or physics questions, the selection of items, scenarios, and options to assess psychometric g were guided roughly by a theory that emphasized complex judgment, but the explicit keys were unknown until after the assessments were made: they were determined by the means of everyone's responses, using deviation scores, correlations, or factor scores.


One way to understand the connection between expertise and consensus is to consider that for many performance domains, expertise largely reflects knowledge derived from experience. Since novices tend to have fewer experiences, their opinions err in various inconsistent directions. However, as experience is acquired, the opinions of journeyment through to experts become more consistent. According to this view, errors are random. Ratings data collected from large samples of respondents of varying expertise can thus be used to approximate the rating means a substantial number of experts would provide were they available. Because the standard deviation of a mean will approach zero as the number of observations becomes very large, estimates based on groups of varying competence will provide converging estimates of the best performance standards. The means of these groups’ responses can be used to create effective scoring [[Rubric (academic)|rubrics]] or measurement standards to evaluate performance. This approach is particularly relevant to scoring subjective areas of knowledge that are scaled using Likert response scales, and the approach has been applied to develop scoring standards for several domains that have lacked experts.


== Experimental results ==
== Measurement Standards for Cultural Knowledge==



One way to understand the connection between expertise and consensus is to consider that for many performance domains expertise largely reflects knowledge derived from experience. Because novices will tend to have fewer experiences, their opinions may err in various directions and these errors will not be very consistent. However, as experience is acquired, the opinions of experts and even journeymen will become more consistent. According to this view, errors are random and ratings data collected from large samples of respondents whose knowledge and skills cover a broad range of expertise can be used to approximate the rating means that would be collected from a substantial number of experts, were they available. Because the standard deviation of a mean will approach zero, 0, as the number of observations, N, becomes very large, estimates based on groups of varying competence will provide converging estimates of the best performance standards. The means of these groups’ responses can be used to create effective scoring rubrics or measurement standards to evaluate performance. This approach is particularly relevant to scoring subjective areas of knowledge that are scaled using Likert response scales, and the approach has been applied to develop scoring standards for several domains that have lacked experts.

== Experimental Results Using CBA Standards==



In practice, analyses have demonstrated high levels of convergence between expert and CBA standards with values quantifying those standards highly correlated (Rs-e,s-c ranging from .72 to .95), and with scores based on those standards also highly correlated (Re,c ranging from .88 to .99) provided the sample size of both groups is large (Legree, Psotka, Tremble & Bourne, 2005). For these five domains, and for an additional three domains that could only be scored using consensus based measurement, scores based on expert and CBA standards correlated with relevant criteria. This convergence between CBA and expert referenced scores and the associated validity data indicate that CBA and expert based scoring consistently evaluate options for items provided ratings data are collected using large samples of experts and examinees to construct scoring rubrics.
In practice, analyses have demonstrated high levels of convergence between expert and CBA standards with values quantifying those standards highly correlated (Rs-e,s-c ranging from .72 to .95), and with scores based on those standards also highly correlated (Re,c ranging from .88 to .99) provided the sample size of both groups is large (Legree, Psotka, Tremble & Bourne, 2005). For these five domains, and for an additional three domains that could only be scored using consensus based measurement, scores based on expert and CBA standards correlated with relevant criteria. This convergence between CBA and expert referenced scores and the associated validity data indicate that CBA and expert based scoring consistently evaluate options for items provided ratings data are collected using large samples of experts and examinees to construct scoring rubrics.


== Factor analysis==


CBA is often computed by using Pearson R correlation of each person's Likert scale judgments across a set of items against the mean of all people's judgments on those same items. The correlation is then a measure of that person's proximity to the consensus. It is also sometimes computed as a standardized deviation score from the consensus means of the groups. These two procedures are mathematically isomorphic. If culture is considered to be shared knowledge; and the mean of the group’s ratings on a focused domain of knowledge are considered a measure of the cultural consensus in that domain; then both procedures assess CBA as a measure of an individual person’s cultural understanding.
== CBA and Factor Analysis==



CBA is often computed by using Pearson R correlation of each person's [[Likert]] scale judgments across a set of items against the mean of all people's judgments on those same items. The correlation is then a measure of that person's proximity to the consensus. It is also sometimes computed as a standardized deviation score from the consensus means of the groups. These two procedures are mathematically isomorphic. If culture is considered to be shared knowledge; and the mean of the group’s ratings on a focused domain of knowledge are considered a measure of the cultural consensus in that domain; then both procedures assess CBA as a measure of an individual person’s cultural understanding.
However, it may be that the consensus is not evenly distributed over all subordinate items about a topic. Perhaps the knowledge content of the items is distributed over domains with differing consensus. For instance, conservatives who are libertarians may feel differently about invasion of privacy than conservatives who feel strongly about law and order. In fact, standard factor analysis brings this issue to the fore.
However, it may be that the consensus is not evenly distributed over all subordinate items about a topic. Perhaps the knowledge content of the items is distributed over domains with differing consensus. For instance, conservatives who are libertarians may feel differently about invasion of privacy than conservatives who feel strongly about law and order. In fact, standard [[factor analysis]] brings this issue to the fore.


In either centroid or principle components analysis (PCA) [[factor analysis]] the first factor scores are created by multiplying each rating by the correlation of the factor (usually the mean of all standardized ratings for each person) against each item’s ratings. This multiplication weights each item by the correlation of the pattern of individual differences on each item (the component scores). If consensus is unevenly distributed over these items, some items may be more focused on the overall issues of the common factor. If an item correlates highly with the pattern of overall individual differences, then it is weighted more strongly in the overall factor scores. Now, this weighting implicitly also weights the CBA score, since it is those items that share a common CBA pattern of consensus that are weighted more in factor analysis.
In either centroid or [[principal components analysis]] the first factor scores are created by multiplying each rating by the correlation of the factor (usually the mean of all standardized ratings for each person) against each item’s ratings. This multiplication weights each item by the correlation of the pattern of individual differences on each item (the component scores). If consensus is unevenly distributed over these items, some items may be more focused on the overall issues of the common factor. If an item correlates highly with the pattern of overall individual differences, then it is weighted more strongly in the overall factor scores. Now, this weighting implicitly also weights the CBA score, since it is those items that share a common CBA pattern of consensus that are weighted more in factor analysis.


The transposed or Q technique factor analysis brings this relationship out explicitly. CBA scores are statistically isomorphic to the component scores in PCA for a Q technique analysis. They are the loading of each person’s responses on the mean of all people’s responses. So, Q factor analysis may provide a superior CBA measure, if it can be used first to select the people who represent the dominant dimension, over items that best represent a subordinate attribute dimension of a domain (such as liberalism in a political domain) and then factor analysis can provide the CBA of individuals along that particular axis of the domain.
The transposed or [[False discovery rate|Q technique factor analysis]] brings this relationship out explicitly. CBA scores are statistically isomorphic to the component scores in PCA for a Q technique analysis. They are the loading of each person’s responses on the mean of all people’s responses. So, Q technique analysis may provide a superior CBA measure, if it can be used first to select the people who represent the dominant dimension, over items that best represent a subordinate attribute dimension of a domain (such as liberalism in a political domain) and then factor analysis can provide the CBA of individuals along that particular axis of the domain.


In practice, when items are not easily created and arrayed to provide a highly reliable scale, the Q factor analysis is not necessary, since the original factor analysis should also selects those items that have a common consensus. So, for instance, in a scale of items for political attitudes, the items may ask about attitudes to big government; law and order; economic issues; labor issues; or libertarian issues. Which of these items most strongly bear on the political attitudes of the groups polled may be difficult to determine a priori. However, since factor analysis is a symmetric computation on the matrix of items and people, the original factor analysis of items,(when these are Likert scales) selects not just those items that are in a similar domain, but more generally, those items that have a similar consensus. and demand first a Q factor analysis to select the people with the greatest consensus, whose ratings correlate best with the different groups of political views. The added advantage of this factor analystic technique is that items are automatically arranged along a factor so that the highest Likert ratings are also the highest CBA standard scores. Once selected, that factor determines the CBA (component) scores.
In practice, when items are not easily created and arrayed to provide a highly reliable scale, the Q factor analysis is not necessary, since the original factor analysis should also selects those items that have a common consensus. So, for instance, in a scale of items for political attitudes, the items may ask about attitudes to big government; law and order; economic issues; labor issues; or libertarian issues. Which of these items most strongly bear on the political attitudes of the groups polled may be difficult to determine a priori. However, since factor analysis is a symmetric computation on the matrix of items and people, the original factor analysis of items, (when these are Likert scales) selects not just those items that are in a similar domain, but more generally, those items that have a similar consensus. and demand first a Q factor analysis to select the people with the greatest consensus, whose ratings correlate best with the different groups of political views. The added advantage of this factor analystic technique is that items are automatically arranged along a factor so that the highest Likert ratings are also the highest CBA standard scores. Once selected, that factor determines the CBA (component) scores.


== Critiques of CBA Standards==
== Critiques ==


The most common critique of CBA standards is to question how an average could possibly be a maximal standard.
----
One common rejoinder is to point out how averaging of multiple images of stellar objects through a noisy atmosphere can improve the clarity of the image [[Speckle imaging]]. NASA performs such [[Speckle imaging]]image averaging routinely. A more theoretical rejoinder is to point out the there is a long history in psychology of thinking about concept formation and mental representations as averages, begining with [[Francis Galton]]'s early speculations of image averaging in the 1880's and [[composite photography]]. [[Digital compositing]] of images creates images, such as faces, that are "representative" and "beautiful".
The most common critique of CBA Standards is to question how an average could possibly be a maximal standard.
One common rejoinder is to point out how averaging of multiple images of stellar objects through a noisy atmosphere can improve the clarity of the image [[Speckle imaging]]. NASA performs such [[Speckle imaging]]image averaging routinely. A more theoretical rejoinder is to point out the there is a long history in psychology of thinking about concept formation and mental representations as averages, begining with [[Francis Galton]]'s early speculations of image averaging in the 1880's and [[composite photography]]. [[Digital compositing]] of images creates images, such as faces, that are more "representative" and "beautiful".


==See also==
==See also==
Line 57: Line 47:


Legree, P. J., Psotka J., Tremble, T. R. & Bourne, D. (2005). Using Consensus Based Measurement to Assess Emotional Intelligence. In R. Schulze & R. Roberts (Eds.), International Handbook of Emotional Intelligence. (pp 99-123). Berlin, Germany: Hogrefe & Huber.
Legree, P. J., Psotka J., Tremble, T. R. & Bourne, D. (2005). Using Consensus Based Measurement to Assess Emotional Intelligence. In R. Schulze & R. Roberts (Eds.), International Handbook of Emotional Intelligence. (pp 99-123). Berlin, Germany: Hogrefe & Huber.




== External links ==
== External links ==
Line 65: Line 53:
* [http://www.smartmobs.com/ Smart Mobs]
* [http://www.smartmobs.com/ Smart Mobs]
* [http://www.randomhouse.com/features/wisdomofcrowds/index.html The Wisdom of Crowds]
* [http://www.randomhouse.com/features/wisdomofcrowds/index.html The Wisdom of Crowds]




[[Category:Intelligence]]
[[Category:Intelligence]]
[[Category:Statistics]]

Revision as of 12:57, 5 March 2007

Consensus Based Assessment expands on the common practice of consensus decision-making and the theoretical observation that expertise can be closely approximated by large numbers of novices or journeymen. It creates a method for determining measurement standards for very ambiguous domains of knowledge, such as politics, religion, values and culture in general. From this perspective, the shared knowledge that forms cultural consensus can be assessed in much the same way as expertise or general intelligence.

Measurement standards for general intelligence

Consensus based assessment identifies that samples of individuals with differing competence (e.g., experts and apprentices) rate relevant scenarios, using Likert scales, with similar mean ratings. Thus, from the perspective of a CBA framework, cultural standards for scoring keys can be derived from the same population that is being assessed. Peter Legree and Joseph Psotka, working together over the past decades, proposed that psychometric g could be measured unobtrusively through survey-like scales requiring judgements. This could use deviation score for each person from the group or expert mean; or a Pearson correlation between their judgments and the group mean. The two techniques are perfectly correlated. Legree and Psotka subsequently created scales that requested individuals to estimate word frequency; judge binary probabilities of good continuation; identify knowledge implications, and approximate employment distributions. The items were carefully identified to lack objective referents, and therefore the scales required respondents to provide judgments that were scored against broadly developed, consensual standards. Performance on this judgment battery correlated approximately 0.80 with conventional measures of psychometric g. The response keys were consensually derived. Unlike mathematics or physics questions, the selection of items, scenarios, and options to assess psychometric g were guided roughly by a theory that emphasized complex judgment, but the explicit keys were unknown until after the assessments were made: they were determined by the means of everyone's responses, using deviation scores, correlations, or factor scores.

Measurement standards for cultural knowledge

One way to understand the connection between expertise and consensus is to consider that for many performance domains, expertise largely reflects knowledge derived from experience. Since novices tend to have fewer experiences, their opinions err in various inconsistent directions. However, as experience is acquired, the opinions of journeyment through to experts become more consistent. According to this view, errors are random. Ratings data collected from large samples of respondents of varying expertise can thus be used to approximate the rating means a substantial number of experts would provide were they available. Because the standard deviation of a mean will approach zero as the number of observations becomes very large, estimates based on groups of varying competence will provide converging estimates of the best performance standards. The means of these groups’ responses can be used to create effective scoring rubrics or measurement standards to evaluate performance. This approach is particularly relevant to scoring subjective areas of knowledge that are scaled using Likert response scales, and the approach has been applied to develop scoring standards for several domains that have lacked experts.

Experimental results

In practice, analyses have demonstrated high levels of convergence between expert and CBA standards with values quantifying those standards highly correlated (Rs-e,s-c ranging from .72 to .95), and with scores based on those standards also highly correlated (Re,c ranging from .88 to .99) provided the sample size of both groups is large (Legree, Psotka, Tremble & Bourne, 2005). For these five domains, and for an additional three domains that could only be scored using consensus based measurement, scores based on expert and CBA standards correlated with relevant criteria. This convergence between CBA and expert referenced scores and the associated validity data indicate that CBA and expert based scoring consistently evaluate options for items provided ratings data are collected using large samples of experts and examinees to construct scoring rubrics.

Factor analysis

CBA is often computed by using Pearson R correlation of each person's Likert scale judgments across a set of items against the mean of all people's judgments on those same items. The correlation is then a measure of that person's proximity to the consensus. It is also sometimes computed as a standardized deviation score from the consensus means of the groups. These two procedures are mathematically isomorphic. If culture is considered to be shared knowledge; and the mean of the group’s ratings on a focused domain of knowledge are considered a measure of the cultural consensus in that domain; then both procedures assess CBA as a measure of an individual person’s cultural understanding.

However, it may be that the consensus is not evenly distributed over all subordinate items about a topic. Perhaps the knowledge content of the items is distributed over domains with differing consensus. For instance, conservatives who are libertarians may feel differently about invasion of privacy than conservatives who feel strongly about law and order. In fact, standard factor analysis brings this issue to the fore.

In either centroid or principal components analysis the first factor scores are created by multiplying each rating by the correlation of the factor (usually the mean of all standardized ratings for each person) against each item’s ratings. This multiplication weights each item by the correlation of the pattern of individual differences on each item (the component scores). If consensus is unevenly distributed over these items, some items may be more focused on the overall issues of the common factor. If an item correlates highly with the pattern of overall individual differences, then it is weighted more strongly in the overall factor scores. Now, this weighting implicitly also weights the CBA score, since it is those items that share a common CBA pattern of consensus that are weighted more in factor analysis.

The transposed or Q technique factor analysis brings this relationship out explicitly. CBA scores are statistically isomorphic to the component scores in PCA for a Q technique analysis. They are the loading of each person’s responses on the mean of all people’s responses. So, Q technique analysis may provide a superior CBA measure, if it can be used first to select the people who represent the dominant dimension, over items that best represent a subordinate attribute dimension of a domain (such as liberalism in a political domain) and then factor analysis can provide the CBA of individuals along that particular axis of the domain.

In practice, when items are not easily created and arrayed to provide a highly reliable scale, the Q factor analysis is not necessary, since the original factor analysis should also selects those items that have a common consensus. So, for instance, in a scale of items for political attitudes, the items may ask about attitudes to big government; law and order; economic issues; labor issues; or libertarian issues. Which of these items most strongly bear on the political attitudes of the groups polled may be difficult to determine a priori. However, since factor analysis is a symmetric computation on the matrix of items and people, the original factor analysis of items, (when these are Likert scales) selects not just those items that are in a similar domain, but more generally, those items that have a similar consensus. and demand first a Q factor analysis to select the people with the greatest consensus, whose ratings correlate best with the different groups of political views. The added advantage of this factor analystic technique is that items are automatically arranged along a factor so that the highest Likert ratings are also the highest CBA standard scores. Once selected, that factor determines the CBA (component) scores.

Critiques

The most common critique of CBA standards is to question how an average could possibly be a maximal standard. One common rejoinder is to point out how averaging of multiple images of stellar objects through a noisy atmosphere can improve the clarity of the image Speckle imaging. NASA performs such Speckle imagingimage averaging routinely. A more theoretical rejoinder is to point out the there is a long history in psychology of thinking about concept formation and mental representations as averages, begining with Francis Galton's early speculations of image averaging in the 1880's and composite photography. Digital compositing of images creates images, such as faces, that are "representative" and "beautiful".

See also

References

Legree, P. J., Psotka J., Tremble, T. R. & Bourne, D. (2005). Using Consensus Based Measurement to Assess Emotional Intelligence. In R. Schulze & R. Roberts (Eds.), International Handbook of Emotional Intelligence. (pp 99-123). Berlin, Germany: Hogrefe & Huber.