Skip to main navigation Skip to search Skip to main content

SlideImages

  • David Morris (Creator)
  • Eric Müller-Budack (German National Library of Science and Technology (TIB) (Creator)
  • Ralph Ewerth (German National Library of Science and Technology (TIB) (Creator)

Dataset

Description

Please note: this archive requires support for dangling symlinks, which excludes the Windows operating system.

To use this dataset, you will need to download the MS COCO 2017 detection images and expand them to a folder called coco17 in the train_val_combined directory. The download can be found here: https://cocodataset.org/#download You will also need to download the AI2D image description dataset and expand them to a folder called ai2d in the train_val_combined directory. The download can be found here: https://prior.allenai.org/projects/diagram-understanding

License Notes for Train and Val: Since the images in this dataset come from different sources, they are bound by different licenses.

Images for bar charts, x-y plots, maps, pie charts, tables, and technical drawings were downloaded directly from wikimedia commons. License and authorship information is stored independently for each image in these categories in the wikimedia_commons_licenses.csv file. Each row (note: some rows are multi-line) is formatted so: ,,,;

Images in the slides category were taken from presentations which were downloaded from Wikimedia Commons. The names of the presentations on Wikimedia Commons omits the trailing underscore, number, and file extension, and ends with .pdf instead. The source materials' licenses are shown in source_slices_licenses.csv.

Wikimedia commons photos' information page can be found at "https://commons.wikimedia.org/wiki/File:".

License Notes for Testing: The testing images have been uploaded to SlideWiki by SlideWiki users. The image authorship and copyright information is available in authors.csv.

Further information can be found for each image using the SlideWiki file service. Documentation is available at https://fileservice.slidewiki.org/documentation#/ and in particular: metadata is available at "https://fileservice.slidewiki.org/metadata/", and the image can be accessed at "https://fileservice.slidewiki.org/picture/".

This is the SlideImages dataset, which has been assembled for the SlideImages paper. If you find the dataset useful, please cite our paper: https://doi.org/10.1007/978-3-030-45442-5_36
Date made available2020
PublisherForschungsdaten-Repositorium der LUH
  • SlideImages: A dataset for educational image classification

    Morris, D., Müller-Budack, E. & Ewerth, R., 8 Apr 2020, Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Proceedings. Jose, J. M., Yilmaz, E., Magalhães, J., Martins, F., Castells, P., Ferro, N. & Silva, M. J. (eds.). Cham: Springer, p. 289-296 8 p. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); vol. 12036 LNCS).

    Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Cite this