LLM Psychometric Scoring Reliability — Lab Note

Background

This note documents the first systematic reliability pass for using GPT-4o as a psychometric scorer within the PersonaMatrix framework.

Method

We compared GPT-4o scores against two trained human raters on 120 response samples drawn from three instruments: IPIP-NEO-120, HEXACO-60, and a custom reflective writing rubric.

Scoring prompts were structured using a chain-of-thought template with explicit anchoring examples per dimension.

Results

Inter-rater agreement between GPT-4o and the consensus human score reached κ = 0.81 (substantial agreement). Agreement was highest for Conscientiousness (κ = 0.87) and lowest for Openness (κ = 0.74).

Implications

These results suggest that structured LLM scoring is viable for research-grade psychometric work when prompts include explicit rubric anchors and examples. We will proceed to Phase 2 validation with a larger sample (n = 500).

Next Steps

Expand sample to n = 500 across diverse demographic groups

Test prompt robustness across model versions (GPT-4o-mini, Claude 3.5)

Document failure modes for Openness dimension