Talk:Apache Spark

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

NPOV?[edit]

Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database.

Sounds like someone has an axe to grind here. Is not everything in Spark read-only (i.e. that is one of the intentional aspects of design, "it's not a bug, it's a feature") then harping on how Spark isn't a database sounds a lot like somebody doesn't like it, or has something else they want people to use/buy. — Preceding unsigned comment added by 75.73.1.89 (talk) 15:55, 28 September 2016 (UTC)[reply]

I wrote that line in this Wikipedia article. I'm also the author of the book Spark GraphX in Action. I attempted to present a balanced view, and chose to highlight the immutability of graphs because the question comes up sometimes on the Apache mailing lists. See [1] and [2]. Also until recently, GraphX was listed in the Graph database article! See [3]. The lack of mutability was even acknowledge as a weakness by Ankur Dave, one of the primary authors of GraphX, and he attempted to address it via the external package IndexedRDD. Michaelmalak (talk) 17:48, 28 September 2016 (UTC)[reply]

Links to potential references[edit]

RDD Versus Dataset.[edit]

This article states that Spark is built around RDD but the official documentation at https://spark.apache.org/docs/latest/quick-start.html says that RDD is deprecated and Datasets are the new paradigm. It's beyond my knowledge and experience in Spark to fix the article but it would be great if someone expert on the change could update this. I find wiki articles to be better intro than most software documentation so I'd love to see a good, updated, intro to Spark here. — Preceding unsigned comment added by 138.32.32.166 (talk) 17:31, 19 October 2017 (UTC)[reply]

Done Michaelmalak (talk) 00:16, 20 October 2017 (UTC)[reply]

PySpark[edit]

PySpark redirects here but isn't actually mentioned in the article. The article should explain what PySpark is. --Jameboy (talk) 11:14, 1 November 2022 (UTC)[reply]