Job Description
- Deconstruct complex semantic and behavioral models outputs into observable, quantifiable rating criteria, functioning much like diagnostic behavioral anchors.
- Establish baseline Inter-Rater Reliability (IRR) metrics (e.g., ICC, Cohen’s/Fleiss’ Kappa) and architect programmatic pipelines to monitor the longitudinal psychometric health of data collections.
- Partner with ML Engineering to integrate human assessment workflows directly into the model development lifecycle and deploy automated computerized evaluation tooling.
- Design automated data quality checks to detect careless responding, systematic rater bias, straight-lining, and other threats to data integrity at scale.
- Utilize advanced statistical frameworks (drawing on Classical Test Theory or Item Response Theory) to detect rater drift, identify differential item functioning, and implement systemic interventions.
