When you have set up Apache Spark and use Jupyter to run analyses on it, you’ll need to connect to the Jupyter notebooks by forwarding the port the notebooks run on to your local machine.
Depending on how the server that runs Spark is secured, you might need to do that through a “jump box”, a server that is hardened to prevent unauthorized access and let’s you access a network that’s otherwise not directly accessible from the Internet.
Whether it’s for social science, marketing, business intelligence or something else, the number of times data analysis benefits from heavy duty parallelization is growing all the time.
Apache Spark is an awesome platform for big data analysis, so getting to know how it works and how to use it is probably a good idea. Setting up your own cluster, administering it etc. etc. is a bit of a hassle to just learn the basics though (although Amazon EMR or Databricks make that quite easy, and you can even build your own Raspberry Pi cluster if you want…), so getting Spark and Pyspark running on your local machine seems like a better idea.
I’m currently evaluating different publishing workflows for my academic writing. One option that seems to be increasingly popular is the use of RMarkdown as a source document, from which you then compile into HTML, LaTeX or whatever else you need.
For the code used in the document, RMarkdown not only supports the execution of R, but a whole bunch of languages, including Python, thanks to the knitr package.
One problem I had when I first came across RMarkdown a couple of years ago is that it didn’t really support using virtual environments when using Python.