8 Using this Spark Guide

8.1 Data science programming

Each language has its view on the vocabulary of data munging. However, data science is starting to reach a point where each language is more like a dialect (British, Australian, or US English) than a unique language (Chinese, Finish, English). We focus on PySpark while providing dialect maps to a few other data-munging languages.

Our first dialect discussion focuses on how each names the data object upon which we will write munging scripts. R has data.frame, Pandas has DataFrame, Pyspark has DataFrame, and SQL has TABLE to describe the rectangular rows and columns that so often represent how data is stored. In this book we will use the general dataframe to refer to all of them at once.

Each data object comes with column types or classes that facilitate data munging and analysis. Most of these types are very similar. However, they are not fully interchangeable.

8.2 Example Code Snippets

The examples will show code for a few popular data-munging languages to help you understand the mapping between each paradigm.

viewof language = Inputs.checkbox(
  ["Tidyverse", "Pandas", "Polars", "SQL", "Pyspark"], 
  { value: ["Tidyverse", "Pandas", "Polars", "SQL", "Pyspark"], 
    label: md`__Display languages:__`
  }
)

import {Editor} from '@cmudig/editor'
import {makeEditor} from '@cmudig/editor'
import {makeHl} from '@cmudig/highlighter'
import {columns} from "@bcardiff/observable-columns"

SQL = makeHl('sql')
PY = makeHl('py')
R = makeHl('r')

data = FileAttachment("code_snippets.json").json()
df = data._filter.filter(function(dat){
    return language.includes(dat._api)
})

function hl(query, language='js') {
  const result = md`
~~~${language}
${String(query).trim()}
~~~`;

  result.value = String(query).trim();

  return result;
}

SQLEditor = makeEditor({language: 'sql', label: 'SQL'})
PythonEditor = makeEditor({language: 'py', label: 'Python'})
REditor = makeEditor({language: 'r', label: 'R'})




md`### Selecting Columns from a *dataframe*`

titles = df.map(d => md`### ${d._api}`)
// codes = df.map(d => htl.html`<code>${d.code}</code>`)
codes = df.map(d => hl(d.code, d.language))

columns(titles)

columns(codes)