1  What is Big Data?

Big data has three key components that signal the inclusion of the word big in front of data. Generally, two of these characteristics would be required.

This course will focus on simplified versions of Volume and Velocity. We will work with data too large to process on a typical student computer. Much of the data could also stress your local file system. Our data and project will focus on batch processing toward developing a predictive model. We will not focus on streaming data or real-time decision-making.

1.1 Note about the class data

We will limit our data sizes to less than a terabyte in total with the largest tables in the 100 GB range. This is not big data in the sense of the term. However, it is large enough to require a different approach to data analysis than what you have learned in previous courses. We will use the term big data to refer to the data we will use in this course. We will focus on the tools and techniques used on big data for analysis. At times these tools could be slower than some of the modern tools for medium data like polars mentioned above. However, the tools we will use are scalable to much larger data sizes and will be useful for your future work.

1.1.1 Dump truck analogy

We want to figure out how move and shape data with big data tools. We need to learn to drive the massive mining dump truck imagining the massive loads. When the load is manageable in a small truck, nobody would ever try to drive a mining dump truck down neighborhood roads to help a friend move. You will be tempted to drop into polars or pandas if you focus on the size of the load we will use in the class as the data can get small enough to fit into those packages and routines. Stay firmly in the dump truck and learn to drive it. You will need to drive the dump truck when you get into industry.