๐งพ๐ Capstone Project Guide
Research Kick-off Report (DANL 410)
๐ Overview
In DANL 410: Data Analytics Capstone, you will complete an end-to-end analytics projectโstarting from a focused question, moving through data work and analysis, and ending with a polished report and presentation.
This course is designed to help you practice what data analysts actually do:
- Translate a real-world problem into an analytics question
- Find and evaluate data (or create a clean dataset)
- Clean, transform, and explore the data
- Apply an appropriate machine learning or statistical method (e.g., regression, classification, clustering)
- Communicate findings clearly with visuals and writing
- Publish your work on your Quarto website (GitHub Pages)
You do not need the โperfectโ topic at the beginning. Your goal early in the semester is to build a feasible plan with a clear question, a credible dataset, and a reasonable method.
๐ฅ Team Formation
- Each team may have one or two members.
- Every team member is expected to contribute actively and understand the entire project.
- Teams should be formed by Wednesday, Feb 11, 11:59 P.M.
- If your team has two members, a representative should email Prof. Choe at bchoe@geneseo.edu, cc-ing a team member.
โฐ Project Timeline (Spring 2026)
| Component | Description | Due Date |
|---|---|---|
| ๐งพ Research Kick-off Report | Define question, motivation, data, and plan (2โ3 pages) | Feb 25 (11:59 P.M.) |
| ๐ฃ๏ธ Midterm Presentation | Short progress presentation (what you have, what you learned, next steps) | Mar 11 (class time) |
| ๐งฉ Progress & Insights Report | What you did + what you learned + obstacles + updated plan | Mar 25 |
| ๐ง Research Synthesis Report | Stronger draft narrative + results + interpretation | Apr 15 |
| ๐ฅ๏ธ GREAT Day PowerPoint | Slide deck submission | Apr 22 |
| ๐ Final Capstone Report | Final written report (published on website) | May 11 (11:59 P.M.) |
๐งพ Research Kick-off Report โ Structure
Your Research Kick-off Report (2โ3 pages) is a concise proposal that explains:
- what you plan to study,
- why it matters (business, social, or policy relevance),
- how you will study it using data.
Include these sections:
- Working Title & Topic
- A clear, descriptive title.
- 2โ4 sentences describing the topic and setting.
- Research Question
- State one main research question in a single sentence.
- Add 1โ3 sub-questions if helpful.
- Motivation / Value
- Why would someone care?
- Who is the โstakeholderโ (a business, customers, local government, nonprofit, campus office, etc.)?
- What decision, insight, or understanding might the analysis support?
- Data Plan
- What dataset(s) will you use?
- Unit of analysis (customer? county? day? product? student? transaction?)
- Key variables you expect to use (inputs and outcomes)
- Data limitations you already anticipate (missingness, small sample, measurement, bias)
- Method Plan (Required: include one ML/statistical component)
- What approach will you use and why?
- Examples:
- Regression (prediction or cause-and-effect)
- Classification (e.g., churn / pass-fail / fraud / sentiment categories)
- Clustering (e.g., segmentation)
- Time-series forecasting (if appropriate)
- What is your evaluation plan? (train/test split, accuracy/RMSE, cross-validation, or a clear validation idea)
- References
- Use a consistent citation style.
- For online sources: include URL and date of access.
๐ก Think of the Kick-off Report as your project blueprint. You are not expected to have final results yetโbut your plan should be credible and feasible.
๐ก Brainstorming Ideas (Capstone-Friendly)
| Area | Example Research Question | Data Type | Possible Methods |
|---|---|---|---|
| Business / Marketing | Which factors predict customer churn or repeat purchase? | transactions, CRM, reviews | classification, regression |
| Operations | What drives delays or failures in a process? | timestamps, logs | regression, clustering |
| Finance / Risk | Can we predict risk outcomes or identify risky segments? | firm/household indicators | classification, clustering |
| Sports / Performance | Which player/team features predict win probability? | game logs | regression, classification |
| Public Policy | Which communities face higher risk or lower access? | census + admin data | regression, mapping, clustering |
| Education | What predicts course performance or retention patterns? | LMS / grades / surveys | regression, classification |
| Climate Change | Which factors predict household solar adoption (or energy burden) across counties? | American Community Survey (ACS) + solar installs + weather/energy | regression, classification, mapping |
๐งญ Research Idea Logic
A strong capstone topic is:
- Specific enough to answer within one semester
- Data-supported (you can actually obtain usable data)
- Method-feasible (you can execute and interpret a reasonable model)
- Meaningful (it has a real stakeholder, decision, or insight)
Think in โquestion typesโ:
| Type | Purpose | Example |
|---|---|---|
| Descriptive | Summarize what is happening | โHow have housing prices changed across counties since 2018?โ |
| Comparative | Compare groups or time periods | โDo outcomes differ by region, income level, or policy change?โ |
| Predictive | Predict an outcome using features | โCan we predict churn using usage behavior?โ |
| Segmentation | Identify clusters / groups | โCan we cluster customers into meaningful segments?โ |
| Diagnostic | Identify drivers of an outcome | โWhich factors are most associated with delays?โ |
Avoid questions that are too broad:
- โHow does social media affect society?โ
Instead, narrow:
- โWhich post features predict higher engagement for a specific account/category?โ
๐พ Using Data Effectively
You are encouraged to use real-world data and present evidence using:
- Clean tables (summary statistics, grouped means, counts)
- Clear visualizations (distributions, trends, comparisons, relationships)
- A method you can explain and justify
- Transparent code and reproducible workflow
Guidelines:
- Start with a dataset you can actually obtain quickly (Week 2โ4).
- Prefer data that has:
- Enough observations to learn patterns
- Clear variable definitions
- A usable target/outcome variable (for prediction/classification), or a clear grouping structure (for clustering)
- Use modeling only when it adds value:
- A โfancyโ model with weak interpretation is worse than a simple model well explained.
- Always cite data sources and document data cleaning steps.
You donโt need complex modeling to do a strong capstoneโ but you do need clear evidence + clear reasoning + clear communication.
๐ Recommended Open Data Sources
๐ฌ Public Opinion, Education & General Statistics
| Source | Description |
|---|---|
| ๐ Our World in Data | Cross-country datasets on COโ emissions, energy use, and climate impacts. |
| ๐ Statista | Statistics portal covering energy, environment, and sustainability topics. (Access through SUNY Geneseo library subscription.) |
| ๐บ๏ธ Yale Climate Opinion Map | U.S. public opinion on climate change beliefs, attitudes, and policy support. |
๐ Climate & Environmental Data
| Source | Description |
|---|---|
| ๐ก NOAA Climate Data Online | Temperature, precipitation, and extreme weather records for U.S. and global stations. |
| ๐ฆ PRISM Climate Group (Oregon State University) | High-resolution U.S. climate data (temperature, precipitation, and normals) for local and regional analysis. |
| ๐ก๏ธ Berkeley Earth | Long-term global datasets on temperature, air quality, and climate change. |
| ๐งฎ Global Carbon Budget Office | Annual data on global carbon emissions and carbon budget analysis. |
| ๐ UNEP Environment Data Explorer | World Environment Situation Room (WESR), an open data platform by the UN Environment Programme (UNEP). |
| ๐ง AQUASTAT (FAO Water Data) | Global data on water resources, irrigation, and agricultural water management. |
โก Energy, Agriculture & Resource Use
| Source | Description |
|---|---|
| โก U.S. Energy Information Administration (EIA) | Data on energy production, consumption, and fuel prices. |
| ๐ IEA (International Energy Agency) | Global datasets on energy production, consumption, efficiency, and emissions by sector and country. |
| ๐พ FAOSTAT (UN Food and Agriculture Organization) | Global data on agriculture, food systems, land use, and greenhouse gas emissions by sector and country. |
| ๐งญ OECD Environment Statistics | Data on environmental performance, green growth, energy intensity, and environmental taxes for OECD countries. |
| ๐ U.S. Census Bureau โ Natural & Built Environments | Datasets on pollution, agriculture, transportation emissions, recycling, and rural development. |
Always check:
- whether the data is legally usable (license/terms),
- whether variables are well-defined,
- and whether the data is appropriate for your question.
โ Checklist (Research Kick-off Report)
Before submitting your Research Kick-off Report:
- Team confirmed (1โ2 people)
- Clear research question (one sentence)
- Stakeholder / value explained (why it matters)
- Data source identified and link provided
- Unit of analysis + key variables listed
- Method plan includes at least one ML/stat component (regression/classification/clustering/etc.)
- Realistic timeline and deliverables described
- References included (URLs + access date for online sources)
- 2โ3 pages, professional formatting
- Submitted via Brightspace by Feb 25 (11:59 P.M.)
๐ Closing Thought
A strong capstone project is not about doing everythingโ
itโs about doing a focused analysis well,
and communicating results clearly enough that someone can act on them.