Big Data Statistical Analysis

Course Syllabus

Course Title: Big Data Statistical Analysis
Instructor: Sewon Park
Email: swpark0413@sookmyung.ac.kr
Semester: Spring 2026
Class Time & Location: Monday & Wednesday, 10:00–11:50, Room B116, Changhak B Office Hours: Wednesday 13:00–14:00 or by appointment

Course Description

This course introduces fundamental tools and methodologies for big data collection and processing. Building on prior knowledge of Python and statistics, students will learn practical data workflows using shell scripting, SQL, and distributed computing frameworks such as Apache Spark. Emphasis is placed on real-world applications including web crawling, text preprocessing, large-scale data analysis, and machine learning.

Learning Objectives

By the end of this course, students will be able to:

Independently collect and analyze large-scale data for their own research questions.
Communicate effectively with database professionals as a peer analyst.
Build data analysis environments using Python and Linux.
Perform data preprocessing and exploratory data analysis (EDA).
Use SQL databases for structured data manipulation.
Conduct distributed data processing using Apache Spark.
Apply machine learning techniques to large-scale datasets.
Understand parallel and GPU-based computation concepts.

Prerequisites

Students are expected to have:

Prior coursework in statistics (e.g., regression analysis, multivariate analysis)
Basic Python programming skills

Course Materials

Textbook

No required textbook. Lecture notes will be provided.

References

Jules Damji et al. (translated by Jong-young Park & Seong-su Lee), Learning Spark, 2nd ed., Jpub, 2022

Grading Policy

Component	Times	Percentage
Attendance	—	10%
Assignments	7	35%
Project 1	1	10%
Project 2	1	10%
Final Project	1	35%

Weekly Schedule

Week	Topic
1	Orientation
2	Python & Linux Environment Setup; Understanding Linux/Unix
3	Shell Scripting; NumPy & Pandas Basics
4	Web Data Collection: Static & Dynamic HTML Crawling
5	Open API Data Collection; Text Preprocessing & Regular Expressions
6	MySQL Fundamentals
7	MySQL Advanced: Table Joins
8	Spark Fundamentals: Installation, Environment Setup, Data I/O
9	Spark SQL: DataFrames & Distributed Queries
10	Spark Analysis: Preprocessing, EDA & Visualization
11	Spark Machine Learning: Supervised/Unsupervised & Evaluation
12	Spark NLP & Text Analytics
13	Kafka: Installation, Environment Setup & Structured Streaming
14	PyTorch Basics; CPU vs GPU; Parallel Processing
15	WandB & DeepSpeed Introduction; Final Project Evaluation

Last updated on Feb 22, 2026

Bayesian Statistics Dec 28, 2025 →