?
Large Language Model-Based Automated Item Generation in STEM Assessments: Historical Mapping and a Scoping Review of Empirical Studies
Educational assessments, from low-stakes classroom tests to high-stakes national examinations, require item pools that are valid, fair, and secure. Automated Item Generation (AIG) aims to efficiently produce large pools of calibrated test items. This paper adopts a two-part design: (1) a brief historical mapping situating LLM-based AIG within the broader AIG trajectory; and (2) a scoping review of empirical studies on LLM-based AIG for STEM assessments, published between January 2022 and January 2026. A structured search of ERIC, Lens and OpenAlex yielded 1,267 records; after deduplication and screening, 7 studies were retained for synthesis. In all studies, LLMs were primarily used to draft stems, keys, distractors, and explanations by instruction-tuned prompting, sometimes enhanced with retrieval and human-in-the-loop review. Empirical evidence on item quality is generally promising. Multiple investigations have documented acceptable expert evaluations and, in a subset of studies, psychometric properties comparable to those of human-authored items. Nevertheless, recurrent limitations have been observed, including factual inaccuracies, construct drift, low calibration of item difficulty, and variable distractor plausibility. Few studies reported robust fairness audits or provided reproducible details, such as complete prompts and decoding settings. In general, LLM-based AIG can substantially increase throughput in STEM item development, but high-stakes deployment requires layered validation protocols (expert review, pilot testing, psychometrics, and bias audits) and governance controls to ensure traceability and item security.