CUP&A Public Research Dataset Releases

The Cambridge PictureStories Dataset

Paper: https://www.repository.cam.ac.uk/items/4995fd8e-1edc-4137-9e19-f09edb6c79f7

Description: The Cambridge PictureStories Dataset contains essays written by second language (L2) learners of English at various proficiency levels, responding to prompts featuring a series of images which depict a story. For instance, in the pictures below, a man is seated reading a book, then he walks away leaving his umbrella behind, and in the final image a woman returns the umbrella to him. 

The Umbrella picturestory
The Umbrella picture story

 

Essays were submitted to the Write & Improve essay practice platform (W&I), on which automated error feedback and marking are provided. The learner can then revise and resubmit their essay in order to request new error feedback and marks. The auto-marker provides an estimate of writing proficiency level without specifically marking the relevance of the essay to the images, or checking whether it is a sufficient description of them. We therefore define a marking rubric for the relevance and sufficiency of essays written in response to picture-story prompts. Within this new rubric, an essay responding to picture-stories should be written in the style of a story, should be relevant and comprehensive in writing about the images, and should be sufficiently descriptive so that the reader can re-draw the images based on the text.

For this PictureStories dataset, we include essays written in response to 5 different picture-story prompts. We sample 713 essays written by W&I users in 2024, being their first attempt at writing about the prompt and provided that the essays meet the word limit requirements of the prompt.  Six expert annotators have labelled the essays according to our proposed 5-part marking rubric, described in the paper. Note that not all of the essays are good attempts at responding to the prompts: some are off-topic or in some other way do not meet requirements. This is so that there are both positive and negative examples in the dataset. In the paper we describe our experiments to automatically predict the reference marks for each essay, fine-tuning and prompting several LLMs for this task.

Data security: Please be aware of the problems of leaking benchmark datasets to LLMs (e.g. Balloccu et al, EACL 2024). Please only use this dataset with LLMs hosted locally (e.g. after download from Hugging Face Transformers) or with no retention of data for training if using LLMs via commercial APIs.

Publication date: 2026

Keywords: Cambridge University Press & Assessment, Common European Framework of Reference for Languages, CEFR, learners of English as a second language, essay writing, picture stories, Write & Improve

Citing this paper: Marie Bexte, Andrew Caines, Diane Nicholls, Paula Buttery, and Torsten Zesch (2026). PictureStories: Predicting the Task Adherence of Language Learner Answers to a Picture- Story-Based Writing Task. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics.

@inproceedings{picturestories,

  author = {Marie Bexte and Andrew Caines and Diane Nicholls and Paula Buttery and Torsten Zesch},

  year = {2026},

  title = {PictureStories: Predicting the Task Adherence of Language Learner Answers to a Picture- Story-Based Writing Task},

  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},

  publisher = {Association for Computational Linguistics}

}

You may publish the results of research using this dataset.  In any such publication you must acknowledge use of the dataset in your research by citing Cambridge University Press & Assessment and the Authors and Contributors as shown. 

We ask you to inform us of any such publications by emailing: researchdatasets@cambridge.org

Please report any issues or problems in downloading the dataset by emailing: researchdatasets@cambridge.org

 

Licence Agreement 

  1. By downloading this dataset and licence, this licence agreement (the “Agreement”) is entered into, effective this date, between you (the “Licensee"), and the Chancellor, Masters and Scholars of the University of Cambridge acting through its department Cambridge University Press & Assessment (the “Licensor”). 

     

  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee, nor shall the Licensee have any rights in the dataset other than the right to use the dataset in accordance with this Agreement 

 

  1. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes only. The Licensee shall not sub-licence or assign the benefit or burden of this Agreement in whole or in part. 

 

  1. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented. 

 

  1. The Licensee shall expressly acknowledge and reference the Licensor when making use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the paper at the top of the dataset details page.

 

  1. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3. 

 

  1. The Licensor grants the Licensee this right to use the licensed dataset "as is". Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever. The Licensor has no liability for any loss or damage whatsoever sustained by Licensee as a result of the availability or use of or reliance on the dataset. 

 

  1. The Licensor shall not be liable for any indirect or consequential loss or damage or for any loss of or corruption of data, loss of programs, profit or goodwill (whether direct or indirect) arising out of or in connection with the access, availability, use of or reliance on the dataset. 

 

  1. The Licensee shall indemnify and hold the Licensor harmless against any loss or damage which it may suffer or incur as a result of the Licensee’s breach of any terms of this Agreement. 

 

  1. This Agreement constitutes the entire agreement between the parties and supersedes any previous agreement between the parties relating to its subject-matter. Each party acknowledges and agrees that, in entering into this Agreement, it does not rely on, and shall have no remedy in respect of, any statement, representation, warranty or understanding (whether negligently or innocently made) other than as expressly set out in this Agreement. 

 

  1. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction. 

 

 

You may download this dataset if you agree to the licence terms above and complete the following registration form.  Publications using this dataset must acknowledge and reference Cambridge University Press & Assessment as the source of the data.


 

Registration form

Name
Title
CAPTCHA
This question is for testing that you are a human visitor and to prevent automated spam submissions.