Select Page

IST 557: Data Mining: Techniques and Applications, Fall 2023

Basic Information

Class Location: Westgate Bldg E208
Class Time: Tue/Thu, 10:35 AM – 11:50 AM
Instructor: Lu Lin

  • Contact: lulin[at]psu.edu
  • Office hours: Wed, 2:00 PM – 3:00 PM or By Appointment
  • Office: Westgate Bldg E373

TA: Tianrong Zhang

  • Contact: tbz5156[at]psu.edu
  • Office hours: Fri, 1:00 PM – 2:00 PM
  • Office: Westgate Bldg E301

Course Overview

Objective: The course will cover a broad topics in data mining including machine learning foundations (regression, classification and clustering), and recent trends in computer vision (image/video data mining), natural language processing (text mining) and graph learning (structured data mining). This course is designed for graduate students who are interested in using machine learning techniques to discover patterns and gain knowledge about data.

Prerequisites: Students are expected to have programming background either in C, Java, Python (recommended) or other programming language to do course projects. However, the course will not require the students to program things from scratch: Python has a lot of machine learning libraries, which already realizes many models and is very convenient to use with just importing the libraries and calling functions. Meanwhile, sample code about how to call a model will be provided, when introducing the model details in class. Students are also expected to have math background in linear algebra and probability to understand the machine learning principles.

Course Material:

Tentative Schedule and Readings

Slides will be posted before each class.

 

 

Week Date Lectures
1 08/22

Introduction

08/24

Review of Linear Algebra and Probability I

Team sign up for paper presentation [form]

Due Friday 09/01, 11:59pm (ET)

2 08/29

Review of Linear Algebra and Probability II

08/31

Data Preprocessing and Representation

 09/01: Paper presentation team sign up due

Group project team sign up [form] and proposal [template]

Due Friday 10/06, 11:59pm (ET)

3 09/05

Machine Learning Foundations I: Linear Regression

09/07

Machine Learning Foundations II: Linear Classification

Individual Project I on Heart Attack Prediction

Due Friday 09/22, 11:59pm (ET)

4 09/12

Machine Learning Foundations III: Perceptron and Evaluation

09/14

Machine Learning Foundations IV: Naive Bayes

5 09/19

Machine Learning Foundations V: Decision Tree

09/21

Machine Learning Foundations VI: Ensemble Method

 09/22: Individual Project I due

6 09/26 class canceled
09/28

Machine Learning Foundations VII: Clustering

7 10/03

Machine Learning Foundations VIII: Clustering

10/05

Machine Learning Foundations VIIII: Review and Application

In-class quiz on Machine Learning

 10/06: Group project proposal due

8 10/10

Multi-layer Perceptrons and Back-propagation

10/12

Image Mining I: Convolutional Neural Networks

  • Deep Learning Chapter 9
  • He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Individual Project II on Image Classification

Due Friday 10/27, 11:59pm (ET)

9 10/17

Image Mining II: Review and Application

In-class quiz on Computer Vision

10/19

Text Mining I: Word Embedding

  • Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
  • Learning Word Embedding by Lil Weng
10 10/24

Text Mining II: Language Model

10/26

Text Mining III: Recurrent Neural Networks

 10/27: Individual Project II due

11 10/31

Text Mining IV: Transformer

  • Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
  • Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  • Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
11/02

Text Mining V: Review and Application

In-class quiz on Text Mining

Individual Project III on Text Classification

Due Mon 11/27, 11:59pm (ET)

12 11/07

Graph Mining I: Node Embedding

  • Grover, Aditya, and Jure Leskovec. “node2vec: Scalable feature learning for networks.” Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.
11/09

Graph Mining II: Graph Neural Networks

  • Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).
  • Hamilton, Will, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large graphs.” Advances in neural information processing systems 30 (2017).
13 11/14

Graph Mining III: Review and Application

In-class quiz on Graph Mining

11/16

Lab: Project Discussion

14 11/21 Thanksgiving, No Class
11/23 Thanksgiving, No Class

 11/27: Individual Project III due

15 11/28

Advance Topic I: Interpretability

11/30

Advance Topic II: Robustness

16 12/05

Advance Topic III: Fairness

12/07

Group Project Expo (Lightning talk + Demo)

 12/08: Group project report due

 

Grading

  • Paper Presentation (10 points)
    • 10-min presentation about the chosen paper
  • In-class Quiz (10 points) – 2.5 points * 4
  • Individual Project (60 points) – 20 points * 3
    • Predefined data mining problem on Kaggle
    • Each project is graded based on the evaluation metric on Kaggle and the quality of report
    • Top-ranked teams will be awarded bonus points
  • Group Project (20 points)
    • Exploratory data mining problem defined by you
    • The project is graded based on the quality of proposal (5 points) and final report (15 points)
    • In the last class of project expo, selective teams will present their work (10-min lightling talk) and be awarded bonus points, and the rest time will be a workshop among all the teams
  • Bonus points can be earned
  • Cutoff:
    • A: [93, 100]
    • A-: [90, 93)
    • B+: [87, 90)
    • B: [83, 87)
    • B-: [80, 83)
    • C+: [77, 80)
    • C: [70, 77)
    • D: [60, 70)
    • F: [0, 60)

The instructor reserves the right to curve the grade so as to improve the letter grade if warranted by unpredictable circumstances (i.e., assignment too difficult).

Grading criteria for paper presentation

In the beginning of each lecture from 09/26, there will be a paper reading session. Students are required to form a team (of 1-3 members), select one paper from the list (or propose other choices with instructor’s approval), and prepare a 10-min presentation for the class, with a maximum 5-min Q&A. So in total, the session is 15-min. Students are required to prepare the slides by themselves (the original authors’ slides are not allowed to be used for this presentation). Presenters must present the selected/assigned paper on the scheduled date. No extension will be given due to the tight schedule of this course. The purpose of this paper presentation is to help students to practice giving talks in front of public at conferences or other situations.

Both the instructor and other students will grade the presentation (but no self-grading). The detailed grading criteria are as follows. In total it has 50 pts, and counts for 10 pts in the final grade.

Aspect Score range
Slides quality — Slides content was clearly visible and self-explainable [1, 5]
Idea delivery — Important messages of the paper were properly highlighted [1, 5]
Organization — structure and logic of the presentation were well organized [1, 5]
Clarity — Explained approaches/methods clearly [1, 5]
Pace — Moderate pace for the audience to follow [1, 5]
Engagement — Presenter(s) did not just read off of the slides [1, 5]
Team Work — All students in the team well understood the paper [1, 5]
Timing — Perfect timing [1, 5]
Q&A — Responded to audience’s questions well [1, 5]
Inspiration — I have learned something and was inspired by this presentation, and would like to read the paper in future [1, 5]

 

Grading criteria for group project

The purpose of course project is to give students hands-on experience on solving some novel data mining problems. The project thus emphasizes either research-oriented problems or “deliverables.” It is preferred that the outcome of your project could be publishable, or tangible, typically some kind of novel research problem or prototype system that can be demonstrated. Group work is strongly encouraged, and each team can have 2-3 members. The group project topics are flexible:

  1. Your own research projects that are related to data mining, which preferably present a good integration of data mining techniques;
  2. You can define and solve a data mining problem in a specific application, which has some novel challenge to be tackled;
  3. You could explore and identify interesting weakness/failure/behavior of trending techniques (e.g., ChatGPT, diffusion model), reason why and provide possible solutions you will try based on open-source models;
  4. You could do literature survey, but please be advised that this needs to be up-to-date and novel (i.e., it should not be similar to existing survey papers). A good survey paper is also expected to have a good coverage of the following: summarization and reflection of existing works, your own understanding about pros/cons of existing works, unique challenges, your proposed methods, and preliminary results to support the motivation/design, and what are the future directions.

 

The grade consists of two major parts: proposal report (50 pts) and final report (150 pts), which in total counts for 20 pts in the final grade. The detailed grading criteria are as follows. Three teams will be selected to do a 10-min lightning talk in the last lecture, with bonus points applied.

Proposal report grading criteria:
Aspect Score range
Strictly follow the provided template and page limit [0, 10]
Background and studied problem were clearly stated in the introduction [0, 10]
Sufficient discussion of state-of-the-art in related work section [0, 10]
The proposed solution is reasonable and not too trivial [0, 10]
Detailed and reasonable schedule for deliverables [0, 10]

 

Final project report grading criteria:
Aspect Score range
Strictly follow the provided template and page limit [0, 10]
Background, studied problem and motivation were clearly stated in the introduction, and the logic and argument were reasonable [0, 15]
Contribution of the work was properly articulated in the introduction [0, 15]
Sufficient discussion of state-of-the-art and how this work differentiates from existing works in related work section [0, 15]
Description of the proposed method was clear, comprehensive, coherent and consistent with the claim in the introduction [0, 35]
Clear and precise description of evaluation design and dataset [0, 10]
Thorough evaluation of the proposed method and detailed analysis of the results [0, 35]
Summarization of the work, reasonable discussion of limitation of the proposed solution and possible future work [0, 15]

Assignment Submission Policy

  • Assignments must be TYPED and dropped to proper CANVAS drop boxes
  • Students can submit late with the penalty of 25% deduction for every 12 hours late (up to 2 days)
  • After 2 days, no more late submission is allowed
  • All deadlines will be Friday midnight

Academic Integrity

According to the Penn State Principles and University Code of Conduct: Academic integrity is a basic guiding principle for all academic activity at Penn State University, allowing the pursuit of scholarly activity in an open, honest, and responsible manner. In accordance with the University’s Code of Conduct, you must not engage in or tolerate academic dishonesty. This includes, but is not limited to cheating, plagiarism, fabrication of information or citations, facilitating acts of academic dishonesty by others, unauthorized possession of examinations, submitting work of another person, or work previously used without informing the instructor, or tampering with the academic work of other students. Any violation of academic integrity will be investigated, and where warranted, punitive action will be taken. For every incident when a penalty of any kind is assessed, a report must be filed.

Plagiarism (Cheating): Talking over your ideas and getting comments on your writing from friends are NOT examples of plagiarism. Taking someone else’s words (published or not) and calling them your own IS plagiarism. Plagiarism has dire consequences, including flunking the paper in question, flunking the course, and university disciplinary action, depending on the circumstances of the office. The simplest way to avoid plagiarism is to document the sources of your information carefully.

Projects: When discussing projects and paper presentations, you may:

  • Discuss the material presented in class or included in assigned readings, documentation, user manual, etc.
  • Assist another student in understanding the statement of the problem (e.g., you may assist a non-native speaker by translating some English phrases unfamiliar to that student)
  • Discuss high-level ideas about how to complete the lab assignment, including problem specification, general strategies for the solution, strategies for debugging and testing code, etc. without examining code written by other students, or sharing code written by you with other students.

It is expected that you have independently arrived at solutions that you turn in for laboratory assignments. The following are examples of activities that are PROHIBITED:

  • Examining, copying of code or code fragments from someone else (including online sources), other than the code that is provided to you by the instructor or included in the reference books.
  • Sharing code or code fragments (via email, discussion groups, social media, whiteboard, handwritten or printed copies, etc.)

! Warning

  • Violation of Academic Integrity policy will result in an automatic F for the concerning submission.
  • Two violations ⇒ fail grade in the course

Student Disability

Americans with Disabilities Act: The School of Information Sciences and Technology welcomes persons with disabilities to all of its classes, programs, and events. If you need accommodations or have questions about access to buildings where IST activities are held, please contact us in advance of your participation or visit. If you need assistance during a class, program, or event, please contact the member of our staff or faculty in charge. Access to IST courses should be arranged by contacting the Office of Human Resources, 332 IST Building: (814) 865-8949.

Students with Disabilities: It is Penn State’s policy to not discriminate against qualified students with documented disabilities in its educational programs. (You may refer to the Nondiscrimination Policy in the Student Guide to University Policies and Rules.) If you have a disability-related need for reasonable academic adjustments in this course, contact the Office for Disability Services (ODS) at 814-863-1807 (V/TTY). For further information regarding ODS, please visit the Office for Disability Services Web site at http://equity.psu.edu/ods/.

In order to receive consideration for course accommodations, you must contact ODS and provide documentation (see documentation guidelines at http://equity.psu.edu/ods/guidelines/documentation-guidelines). If the documentation supports the need for academic adjustments, ODS will provide a letter identifying appropriate academic adjustments. Please share this letter and discuss the adjustments with your instructor as early in the course as possible. You must contact ODS and request academic adjustment letters at the beginning of each semester.