PinnedMMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it MeasuresIn examining the low performance of large language models on the Moral Scenarios task, part of the widely-used MMLU benchmark by Hendrycks…Sep 27, 20231Sep 27, 20231
PinnedPreliminary Analysis of MMLU-by-task: Insights from the Evaluation of Over 500 Open Source ModelsRecently Hugging face released a dataset of evaluation results for the Measuring Massive Multitask Language Understanding (MMLU)…Aug 7, 2023Aug 7, 2023