What Changed in v2.1
The v2.0 prompt architecture used a single-pass scoring approach where all five dimensions were scored in one LLM call. This created cross-contamination: scoring Neuroticism influenced the model's framing of Agreeableness.
v2.1 changes:
1. Dimension isolation — each Big Five dimension is scored in a separate LLM call with no prior dimension context 2. Chain-of-thought anchoring — the model is required to identify specific textual evidence before assigning a score 3. Confidence calibration — the model outputs a confidence score (0–1) alongside each dimension score; low-confidence outputs are flagged for human review
Performance Impact
On our internal benchmark (n = 200 scored responses), v2.1 achieved:
Prompt Template
The full prompt template is available in the project repository. Key structure:
``
System: You are a psychometric scoring assistant...
User: [DIMENSION]: {dimension_name}
Anchor low (1): {low_anchor_example}
Anchor high (5): {high_anchor_example}
Text to score: {participant_text}
Step 1: Identify evidence...
Step 2: Assign score...
Step 3: Confidence...
``
Next Version
v2.2 will explore multi-turn dialogue for ambiguous responses.