Big Data Statistical Analysis
Course Syllabus
Course Title: Big Data Statistical Analysis
Instructor: Sewon Park
Email: swpark0413@sookmyung.ac.kr
Semester: Spring 2026
Class Time & Location: Monday & Wednesday, 10:00–11:50, Room B116, Changhak B
Office Hours: Wednesday 13:00–14:00 or by appointment
Course Description
This course introduces fundamental tools and methodologies for big data collection and processing. Building on prior knowledge of Python and statistics, students will learn practical data workflows using shell scripting, SQL, and distributed computing frameworks such as Apache Spark. Emphasis is placed on real-world applications including web crawling, text preprocessing, large-scale data analysis, and machine learning.
Learning Objectives
By the end of this course, students will be able to:
- Independently collect and analyze large-scale data for their own research questions.
- Communicate effectively with database professionals as a peer analyst.
- Build data analysis environments using Python and Linux.
- Perform data preprocessing and exploratory data analysis (EDA).
- Use SQL databases for structured data manipulation.
- Conduct distributed data processing using Apache Spark.
- Apply machine learning techniques to large-scale datasets.
- Understand parallel and GPU-based computation concepts.
Prerequisites
Students are expected to have:
- Prior coursework in statistics (e.g., regression analysis, multivariate analysis)
- Basic Python programming skills
Course Materials
Textbook
- No required textbook. Lecture notes will be provided.
References
- Jules Damji et al. (translated by Jong-young Park & Seong-su Lee), Learning Spark, 2nd ed., Jpub, 2022
Grading Policy
| Component | Times | Percentage |
|---|---|---|
| Attendance | — | 10% |
| Assignments | 7 | 35% |
| Project 1 | 1 | 10% |
| Project 2 | 1 | 10% |
| Final Project | 1 | 35% |
Weekly Schedule
| Week | Topic |
|---|---|
| 1 | Orientation |
| 2 | Python & Linux Environment Setup; Understanding Linux/Unix |
| 3 | Shell Scripting; NumPy & Pandas Basics |
| 4 | Web Data Collection: Static & Dynamic HTML Crawling |
| 5 | Open API Data Collection; Text Preprocessing & Regular Expressions |
| 6 | MySQL Fundamentals |
| 7 | MySQL Advanced: Table Joins |
| 8 | Spark Fundamentals: Installation, Environment Setup, Data I/O |
| 9 | Spark SQL: DataFrames & Distributed Queries |
| 10 | Spark Analysis: Preprocessing, EDA & Visualization |
| 11 | Spark Machine Learning: Supervised/Unsupervised & Evaluation |
| 12 | Spark NLP & Text Analytics |
| 13 | Kafka: Installation, Environment Setup & Structured Streaming |
| 14 | PyTorch Basics; CPU vs GPU; Parallel Processing |
| 15 | WandB & DeepSpeed Introduction; Final Project Evaluation |