Evaluating the Riemann Sum

MathVista: Evaluating Math Reasoning in Visual Contexts

🔔 The automatic evaluation on CodaLab are under construction. The MathVista dataset is derived from three newly collected datasets: IQTest, FunctionQA, and Paper, as well as 28 other source datasets.

GitHub

MIRAI : Evaluating LLM Agents for Event Forecasting

@misc{ye2024miraievaluatingllmagents, title={MIRAI: Evaluating LLM Agents for Event Forecasting}, author={Chenchen Ye and Ziniu Hu and Yihe Deng and Zijie Huang and Mingyu Derek Ma and Yanqiao Zhu and ...

Journal of Medical Internet Research

Evaluating Conversational Agents for Mental Health: Scoping Review of Outcomes and Outcome Measurement Instruments

We included experimental studies evaluating CA mental health interventions. The screening and data extraction were performed independently by 2 review authors in parallel. Descriptive and thematic ...

Microsoft

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results