SCALAR

Abstract

Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.

Evaluation

The table presents model performance across difficulty levels. All LLMs' performance downgrades when difficulty increases. On the easy level, SCALAR can already differentiate LLMs' long context capability, where the best model GPT-4o achieves 95% accuracy, while the lowest-performing models hover around 30-37%, compared to the random baseline of 25%. For the hard level set, half of the models achieve random performance, and even current SOTA models obtain less than half the correct results, demonstrating how challenging our dataset is.

Template

You are given a paper with a placeholder "**[MASKED_CITATION]**" in its content. Your task is to select the most appropriate reference from the provided reference list to replace the mask.

- The paper content is enclosed within <Paper> and </Paper>.
- The reference list is enclosed within <References> and </References>, with each reference wrapped in <Candidate> and </Candidate>.

After selecting the best-suited reference, output the index of that reference in the following format:
<answer>index</answer>

<Paper>
... BERT (**[MASKED_CITATION]**) or ...
</Paper>

<References>
<Candidate>Candidate [0]:
... candidate content ...
</Candidate>
<Candidate>Candidate [1]:
... candidate content ...
</Candidate>
<Candidate>Candidate [2]:
... candidate content ...
</Candidate>
<Candidate>Candidate [3]:
... candidate content ...
</Candidate>
</References>

Remember to output the index of the selected reference enclosed within <answer> and </answer>.

Citation

@misc{
  wang2025scalarscientificcitationbasedlive,
  title={SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning}, 
  author={Renxi Wang and Honglin Mu and Liqun Ma and Lizhi Lin and Yunlong Feng and Timothy Baldwin and Xudong Han and Haonan Li},
  year={2025},
  eprint={2502.13753},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2502.13753}, 
}