Background
This note documents the first systematic reliability pass for using GPT-4o as a psychometric scorer within the PersonaMatrix framework.
Method
We compared GPT-4o scores against two trained human raters on 120 response samples drawn from three instruments: IPIP-NEO-120, HEXACO-60, and a custom reflective writing rubric.
Scoring prompts were structured using a chain-of-thought template with explicit anchoring examples per dimension.
Results
Inter-rater agreement between GPT-4o and the consensus human score reached κ = 0.81 (substantial agreement). Agreement was highest for Conscientiousness (κ = 0.87) and lowest for Openness (κ = 0.74).
Implications
These results suggest that structured LLM scoring is viable for research-grade psychometric work when prompts include explicit rubric anchors and examples. We will proceed to Phase 2 validation with a larger sample (n = 500).