I am currently working on big JSON data that is delimited by a newline. The data is nearly a terabyte. I wanted to clean the data before processing it further. But I can’t load the whole data JSON file into memory at once. All my requirement is to load the whole data and filter off the unnecessary classes. I heard about Apache Spark which can load GBs of data and parallelly process it using workers. I am just confused in which I should be using Spark? Either in the native Scala or the python one. I wanted to know the reason for choosing the one.
A choice of language depends on the need, and what you are comfortable working with to get the job done. In most cases, Scala works better when it comes to handling large data sets. Python works better when you are going to be handling small datasets and experimental purposes.
To give you a binary answer, my suggestion would be to go with Scala. Considering you will be working on bigger projects in production later. I’ve also written more about this here:
Let me know what you think.
Both Python and Scala are easy to program and help data experts get productive fast. Choosing a programming language for Apache Spark depends on the type of application to be developed.