ISSN 1975-6321 (Print)
ISSN 2713-8372 (Online)
통번역학연구, Vol.29 no.1 (2025)
pp.29~51
- AI는 한영 번역을 어떻게 평가하는가? 챗GPT-인간 평가의 상관관계와 챗GPT 평가의 특징에 관하여 -
In this study, we carried out a series of experiments to explore how ChatGPT (version 4o) evaluated Korean-English translations. Using two datasets of human translations (n=57) and two datasets of post-edited translations (n=56), all drawn from Lee and Lee (2021), we adopted two evaluation approaches with strict prompt control. In Experiment A, ChatGPT rated the four datasets freely on a five-point scale without specific criteria. In Experiment B, which was conducted concurrently with Experiment A, ChatGPT rated the same datasets using a prescribed, criterion-referenced five-point scale. To assess intra-rater reliability, we repeated both experiments one month later. This study yielded both quantitative and qualitative findings, including the following: (1) ChatGPT’s average scores differed significantly from those of human raters; (2) correlations between human and ChatGPT scores ranged from ‘moderate’ to ‘strong’; (3) the use of the prescribed rating scale improved ChatGPT’s reliability as a rater; (4) ChatGPT exhibited very low intra-rater reliability; and (5) ChatGPT’s self-justifications for its ratings varied in quality, often failing to identify obvious errors.
번역평가,포스트에디팅,번역품질,번역교육,번역 평가자로서의 챗GPT
