Machine Learning Project - Guideline
DANL 320: Big Data Analytics
Overview
For the final project in DANL 320, you will complete a machine learning project.
- You may work alone or form a group with one other student.
- The size of a group must be either one or two.
- You may use any dataset for the project.
- You may also build on a project from another course such as DANL 310 or DANL 410, as long as you substantially extend it to meet the requirements of this course.
- Your project must include:
- thoughtful data preparation and transformation,
- appropriate data visualization and descriptive analysis,
- multiple supervised machine learning models, and
- at least one unsupervised learning model.
- A literature review is optional.
The main goal of this project is to show that you can take a dataset, prepare it carefully, apply and compare machine learning methods, interpret the results clearly, and communicate the value of your analysis in a coherent and professional way.
Presentation
Your presentation will take place during class time on May 4 and May 6.
- Each student will give a 12-minute presentation.
- If you work in a two-person group, your group will have 24 minutes total.
- Each member should speak for roughly 12 minutes.
- The presentation order will be determined randomly.
- If your project uses a machine learning method that was not covered in class, you should provide a brief and accessible explanation of that method.
- If your topic involves technical, scientific, or domain-specific knowledge, you should also provide enough background so that your classmates can understand the context and importance of your work.
What to Submit for the Presentation
- Please be prepared to present your slides during class on your assigned day.
- You should use clear and professional visual materials.
- Your slides may be prepared in PowerPoint, Google Slides, or another presentation format approved by the instructor.
Key Components in the Presentation
- Title
- Choose a title that clearly reflects your project.
- Introduction
- Background: Explain the topic and why it matters.
- Project Motivation: Describe what interested you about the problem.
- Research Question or Goal: State clearly what you are trying to predict, classify, cluster, or learn from the data.
- Data
- Introduce the dataset and explain where it came from.
- Describe the key variables used in your analysis.
- Briefly explain how you cleaned, transformed, or prepared the data.
- Exploratory Analysis
- Present descriptive statistics and visualizations that help the audience understand the data.
- Highlight patterns that motivate your modeling choices.
- Machine Learning Analysis
- You must include various supervised learning models.
- You must also include at least one unsupervised learning model.
- Clearly explain why you chose your models.
- Present model performance and interpret the results in a meaningful way.
- Focus on comparison, insight, and interpretation rather than simply reporting numbers.
- Significance of the Project
- Explain why your findings matter.
- Discuss possible business, policy, scientific, or practical implications.
- References
- Cite all relevant sources consistently.
Structure of the Project Write-Up
Your write-up should be posted on your personal GitHub website by May 14, 2026, at 11:59 PM.
You may prepare the write-up using Quarto that can be clearly published as a webpage on your personal GitHub website.
1. Introduction
- Provide background on your topic.
- Explain why the topic is interesting or important.
- Clearly state the main research question, modeling goal, or analytical objective.
2. Literature Review (Optional)
- You may include a short literature review if it helps motivate your project.
- This is optional, not required.
3. Data
- Source and Scope
- Explain where the data came from.
- Describe the time period, unit of observation, and scope of the data.
- Variables
- Define the main variables used in the project.
- Cleaning and Preparation
- Explain how you handled missing values, recoded variables, engineered features, normalized values, or otherwise prepared the data.
- Exploratory Data Analysis
- Include descriptive statistics and visualizations.
- Show patterns in the data that help motivate later modeling choices.
4. Supervised Machine Learning Analysis
- Include multiple supervised machine learning models.
- These may include, for example, linear regression, logistic regression, regularized regression, decision trees, random forests, gradient boosting, support vector machines, or other appropriate supervised methods.
- Clearly explain:
- the modeling goal,
- the predictors and outcome,
- how the data were split or validated,
- the evaluation metric(s), and
- how the models compare.
- Interpret the results in clear language.
5. Unsupervised Learning Analysis
- Include at least one unsupervised learning model.
- This may include, for example, clustering, principal component analysis, association rules, or another appropriate unsupervised method.
- Explain why the method is useful for your dataset and what insights it provides.
6. Discussion / Implications
- Discuss what your results mean.
- Explain the practical significance of your findings.
- Reflect on strengths and limitations of your analysis.
7. Conclusion
- Summarize your main findings.
- Briefly explain what you learned from the project.
- Suggest possible extensions or next steps.
8. References
- Use a consistent citation style.
- Include all sources you cited.
General Requirements
- Format: Your write-up should be presented in a clear, readable, and reproducible format.
- Website Posting: The final write-up must be posted on your personal GitHub website.
- Deadline: May 14, 2026, 11:59 PM.
- Code and Output: Include code, results, tables, and figures as appropriate.
- Organization: Use clear section headings and logical flow.
- Clarity: Your write-up should explain what you did and why you did it, not just show code.
Suggested Project Workflow
- Choose a topic and dataset.
- Clean and prepare the data.
- Explore the data with tables and visualizations.
- Fit and compare multiple supervised models.
- Apply at least one unsupervised learning method.
- Interpret the results.
- Prepare presentation slides.
- Publish the final write-up on your personal GitHub website.
Rubric
Presentation
| Attribute | Very Deficient (1) | Somewhat Deficient (2) | Acceptable (3) | Very Good (4) | Outstanding (5) |
|---|---|---|---|---|---|
| 1. Quality of Data Preparation and Exploratory Analysis | Little or no preparation shown; major errors | Minimal preparation; several errors | Adequate preparation and exploratory analysis | Strong preparation and thoughtful exploratory analysis | Excellent and highly effective preparation and exploratory analysis |
| 2. Quality of Data Visualization | Missing, unclear, or misleading visuals | Basic visuals with limited clarity | Clear and appropriate visuals | Insightful and well-designed visuals | Exceptional, polished, and highly effective visuals |
| 3. Quality of Supervised Learning Analysis | Inappropriate or missing supervised models | Limited supervised modeling with weak explanation | Appropriate supervised models with adequate explanation | Strong supervised modeling with good comparison and interpretation | Excellent supervised modeling with strong justification, comparison, and insight |
| 4. Quality of Unsupervised Learning Analysis | Missing or inappropriate unsupervised method | Minimal unsupervised analysis with weak explanation | Appropriate unsupervised analysis with adequate explanation | Strong unsupervised analysis with useful interpretation | Excellent unsupervised analysis with deep and meaningful insight |
| 5. Effectiveness of Communication and Storytelling | No clear narrative or purpose | Weak narrative and limited clarity | Clear overall structure and message | Compelling and well-organized presentation | Exceptionally engaging, coherent, and polished presentation |
| 6. Quality of Presentation Delivery | Difficult to follow; poor delivery | Uneven delivery; limited preparedness | Clear and reasonably organized delivery | Professional and confident delivery | Highly polished, confident, and engaging delivery |
Write-Up
| Attribute | Very Deficient (1) | Somewhat Deficient (2) | Acceptable (3) | Very Good (4) | Outstanding (5) |
|---|---|---|---|---|---|
| 1. Quality of Project Question / Goal | Unclear or missing | Somewhat unclear | Clearly stated | Clear and well motivated | Exceptionally clear, interesting, and well motivated |
| 2. Quality of Data Preparation and Visualization | Poorly prepared and poorly visualized | Some preparation and visualization, but several weaknesses | Adequate preparation and visualization | Strong preparation and clear visualization | Excellent preparation and highly effective visualization |
| 3. Quality of Supervised Modeling Analysis | Missing or inappropriate | Basic and weakly explained | Appropriate and adequately explained | Strong and well interpreted | Excellent, thoughtful, and well justified |
| 4. Quality of Unsupervised Learning Analysis | Missing or inappropriate | Basic and weakly explained | Appropriate and adequately explained | Strong and well interpreted | Excellent, thoughtful, and insightful |
| 5. Quality of Interpretation and Discussion | Little or no interpretation | Limited interpretation | Adequate interpretation | Strong interpretation with meaningful implications | Deep, thoughtful, and compelling interpretation |
| 6. Quality of Writing and Organization | Very difficult to follow; many errors | Somewhat disorganized; several errors | Generally clear and organized | Well organized and easy to read | Exceptionally clear, polished, and professional |
| 7. Quality of Reproducible Computing / Website Presentation | Major issues with code, output, or website presentation | Several issues with reproducibility or presentation | Adequate reproducibility and website presentation | Strong reproducibility and clear website presentation | Excellent reproducibility, presentation, and technical polish |