PinnedMMLU’s Moral Scenarios Benchmark Doesn’t Measure What You Think it MeasuresIn examining the low performance of large language models on the Moral Scenarios task, part of the widely-used MMLU benchmark by Hendrycks…Sep 27, 2023A response icon1Sep 27, 2023A response icon1
PinnedPreliminary Analysis of MMLU-by-task: Insights from the Evaluation of Over 500 Open Source ModelsRecently Hugging face released a dataset of evaluation results for the Measuring Massive Multitask Language Understanding (MMLU)…Aug 7, 2023Aug 7, 2023