Jun 19, 2018

Big Data Juggernaut IV

Anil Vaidya

The big data is taking big leaps with Spark based products. The cloud as well as on-premises solutions deploy spark in their offerings. I wrote earlier that Spark is taking over Hadoop as big data mainstay. The Spark needs to be supported by additional access mechanisms and the programming languages. Not surprisingly Python is rising to the occasion. A special version of python named ‘PySpark’ does this very well. PySpark is the programming interplay with provision for accessing Spark based datasets. It has built-in libraries that allows programmers to do computation of data stored under Apache Spark.

This simply means if someone is using Spark, he/she needs to work with PySpark too. Going further the PySpark being based on Python one needs to know bit of Python too. One of the easier ways to start working on PySpark is the use of Jupyter Notebook. By now you have gauged the number of different technologies have to integrate to be able to get into Big data project. It is imperative that one has to have a combination of Business mindset and liking for technological innovations.

Technology is developing at a rapid pace, beyond imagination. Number of people and companies working in this arena has been phenomenally high, also spread geographically all over the world. We, to be successful, have to keep an eye on these developments but also upgrade ourselves all the time. Just think of how many different technologies I brought together in this short blog, starting from Spark to Python, to PySpark and Jupyter Notebook, all within the ambit of BIG DATA.

AppLy Now