Skip to main navigation Skip to search Skip to main content

In-Memory Indexed Caching for Distributed Data Processing

Alexandru Uta, Bogdan Ghit, Ankur Dave, Jan Rellermeyer, Peter Boncz

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Abstract

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages104-114
Number of pages11
ISBN (Electronic)9781665481069
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, France
Duration: 30 May 20223 Jun 2022

Publication series

NameProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Conference

Conference36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
Country/TerritoryFrance
CityVirtual, Online
Period30 May 20223 Jun 2022

Keywords

  • cs.DC

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Computer Science Applications

Cite this