Big Data Statistical Analysis

Big Data Statistical Analysis

Course Syllabus

Course Title: Big Data Statistical Analysis
Instructor: Sewon Park
Email: swpark0413@sookmyung.ac.kr
Semester: Spring 2026
Class Time & Location: Monday & Wednesday, 10:00–11:50, Room B116, Changhak B Office Hours: Wednesday 13:00–14:00 or by appointment


Course Description

This course introduces fundamental tools and methodologies for big data collection and processing. Building on prior knowledge of Python and statistics, students will learn practical data workflows using shell scripting, SQL, and distributed computing frameworks such as Apache Spark. Emphasis is placed on real-world applications including web crawling, text preprocessing, large-scale data analysis, and machine learning.


Learning Objectives

By the end of this course, students will be able to:

  • Independently collect and analyze large-scale data for their own research questions.
  • Communicate effectively with database professionals as a peer analyst.
  • Build data analysis environments using Python and Linux.
  • Perform data preprocessing and exploratory data analysis (EDA).
  • Use SQL databases for structured data manipulation.
  • Conduct distributed data processing using Apache Spark.
  • Apply machine learning techniques to large-scale datasets.
  • Understand parallel and GPU-based computation concepts.

Prerequisites

Students are expected to have:

  • Prior coursework in statistics (e.g., regression analysis, multivariate analysis)
  • Basic Python programming skills

Course Materials

Textbook

  • No required textbook. Lecture notes will be provided.

References

  • Jules Damji et al. (translated by Jong-young Park & Seong-su Lee), Learning Spark, 2nd ed., Jpub, 2022

Grading Policy

ComponentTimesPercentage
Attendance10%
Assignments735%
Project 1110%
Project 2110%
Final Project135%

Weekly Schedule

WeekTopic
1Orientation
2Python & Linux Environment Setup; Understanding Linux/Unix
3Shell Scripting; NumPy & Pandas Basics
4Web Data Collection: Static & Dynamic HTML Crawling
5Open API Data Collection; Text Preprocessing & Regular Expressions
6MySQL Fundamentals
7MySQL Advanced: Table Joins
8Spark Fundamentals: Installation, Environment Setup, Data I/O
9Spark SQL: DataFrames & Distributed Queries
10Spark Analysis: Preprocessing, EDA & Visualization
11Spark Machine Learning: Supervised/Unsupervised & Evaluation
12Spark NLP & Text Analytics
13Kafka: Installation, Environment Setup & Structured Streaming
14PyTorch Basics; CPU vs GPU; Parallel Processing
15WandB & DeepSpeed Introduction; Final Project Evaluation