Research Notes
Lab NotePersonaMatrix Lab · Research Notes

Establishing Baseline Reliability for LLM-Based Psychometric Scoring

January 14, 2026Anatoliy Drobakha4 min read

Initial reliability analysis of GPT-4o as a psychometric scorer across three validated instruments. Inter-rater agreement (κ = 0.81) meets threshold for research use with structured prompting.

Background

This note documents the first systematic reliability pass for using GPT-4o as a psychometric scorer within the PersonaMatrix framework.

Method

We compared GPT-4o scores against two trained human raters on 120 response samples drawn from three instruments: IPIP-NEO-120, HEXACO-60, and a custom reflective writing rubric.

Scoring prompts were structured using a chain-of-thought template with explicit anchoring examples per dimension.

Results

Inter-rater agreement between GPT-4o and the consensus human score reached κ = 0.81 (substantial agreement). Agreement was highest for Conscientiousness (κ = 0.87) and lowest for Openness (κ = 0.74).

Implications

These results suggest that structured LLM scoring is viable for research-grade psychometric work when prompts include explicit rubric anchors and examples. We will proceed to Phase 2 validation with a larger sample (n = 500).

Next Steps

  • Expand sample to n = 500 across diverse demographic groups
  • Test prompt robustness across model versions (GPT-4o-mini, Claude 3.5)
  • Document failure modes for Openness dimension
  • Cite this note

    Drobakha, A. (2026). Establishing baseline reliability for LLM-based psychometric scoring. PersonaMatrix Lab Research Notes. https://personamatrixlab.org/research-notes/llm-psychometric-scoring-baseline

    Related Metrics

    κ (Cohen's Kappa)Inter-rater reliability

    Related Projects

    PersonaMatrix