Welcome to DS701

Course overview

Welcome to Fall 2025 DS701 Tools for Data Science.

This course is a Master’s level introduction to data science.

In this course you will:

  • develop proficiency in working and analyzing data
  • work with a wide range of data analysis techniques and tools

Instructor

Thomas Gardos

  • Office Hours: Thursdays, 3:30 pm – 4:45 pm
    • and by appointment (email me to schedule)
  • Office Location: CDS 1623
  • Email: tgardos <at> bu <dot> edu

Teaching assistants

We have the following teaching assistants:

Atul Aravind Das (MSDS ’25)

  • Office Hours: Wednesdays, 2-3:30pm
  • Office Hours Location: TBA
  • Email: atuladas <at> bu <dot> edu

Samritha Aadhi Ravikumar (MSDS ’25)

  • Office Hours: Mondays, 2-3:30pm
  • Office Hours Location: TBA
  • Email: samritha <at> bu <dot> edu

Course Assistants

We have the following course assistants:

Caslow Chien (MSDS ’25)

  • Office Hours: TBA
  • Office Hours Location: TBA
  • Email: caslow <at> bu <dot> edu

Carlos Fernando Garcia Padilla (MSDS ’25)

  • Office Hours: TBA
  • Office Hours Location: TBA
  • Email: cfg001 <at> bu <dot> edu

Course Meeting Times

Lecture (A1)

  • Tu,Th 11:00 am – 12:15 pm
  • 928 Commonwealth Ave SHA 110

A2 Discussions (Atul)

  • We 10:10 am – 11:00 am
  • 665 Commonwealth Ave CDS 164

A3 Discussions (Samritha)

  • We 11:15 am – 12:05 pm
  • 665 Commonwealth Ave CDS 164

A4 Discussions (Samritha)

  • We 12:20 pm – 1:10 pm
  • 685-725 Commonwealth Ave CAS B06B

A5 Discussions (Atul)

  • We 9:05 am – 9:55 am
  • 685-725 Commonwealth Ave CAS 204A

Please attend the discussion section you signed up for.

Learning outcomes

The goal of the class is to:

  • give youa hands-on understanding of classical (but still useful!) data analysis techniques (although we cover neural networks too!)
  • help you gain proficiency in applying these techniques in a modern programming language (Python)

Broadly speaking, the course breaks down into three main categories of approaches:

  1. unsupervised methods
  2. supervised methods
  3. methods for structured (e.g. tabular) data

Lectures present the fundamentals of each technique.

Focus is not on the theoretical analysis of the methods, but rather on helping students understand the practical settings in which these methods are useful.

Class discussion will study use cases and cover relevant Python packages to enable the students to perform hands-on experiments with their data.

We’ll Cover

  • Introductory material (probability and linear algebra refreshers, pandas and scikit-learn)
  • Clustering techniques (k-means, hierarchical clustering, gaussian mixture models)
  • Classification and regression with decision trees
  • Linear and logistic regression
  • Hands-on neural networks
  • Recommendation systems
  • Networks/Graphs

With lots of hands-on practice!

You’ll be using

  • Python
  • Jupyter notebooks
  • numpy
  • pandas
  • scikit-learn
  • statsmodels
  • matplotlib
  • scipy
  • networkx
  • pytorch
  • and others…

Course webpage

Syllabus, lecture schedule, and course notes can be found on this webpage

This online text will evolve as the course progresses.

Bookmark this page!

Course Notes Source

You have a few ways to follow along with the course lectures/notes:

  1. Open the Colab version of the lecture and save to your own Google Drive when you see the see the badge .
  2. Fork and clone the repository and run the Jupyter notebooks version of the lectures locally.
  3. Use an annotation tool like hypothes.is on the website.

In general, you’ll want to have your laptop computer with you at the lectures to follow along and participate in in-class activities.

Course Repository

  • The course notes are written in Quarto markdown.
  • The slides are Quarto RevealJS presentations.
  • Setup instructions are on the repo README.

Feel free to fork the repository to explore, experiment, and modify this code as you see fit.

If you want to clone or fork the repository using git, please feel free. If you find a bug, feel free to submit a pull request. Some of the lectures are based on Introduction to Data Mining, by Tan, Steinbach and Kumar. This is a good place to go for more detail if some methodological aspect is not clear.

For up-to-date reference on Pandas, scikit-learn, or any of the other software tools we use, there is no substitute for online resources. Google will quickly bring you to the authoritative (and current) references on software tools.

Tools and platforms

We will use:

  1. Piazza for questions
  2. Github for homeworks, midterm, lectures.
  3. Gradescope for grading and grade management
  4. Kaggle for the midterm challenge

You should already be signed up for Gradescope (if not, enroll using code sent via welcome email).

You can add yourself to Piazza if you are not already enrolled (again, use the code sent via welcome email).

You will need an account on Github. Please tell us your Github user name on this form.

If you don’t have an up to date Python installation, take care of that right away.

Piazza

We will be using Piazza for class discussion.

You can use Piazza to get help fast and efficiently from classmates, the TAs, and the professors.

I encourage you to post your questions on Piazza.

Our class Piazza page is at

Piazza Etiquette

Please be respectful on Piazza.

Do

  • Ask questions about course material, logistics, etc.
  • Answer posted questions when you know the answer (not homework)
  • Tell people where to look for answers

Don’t

  • Provide solutions to homework questions.

Programming environment

We will use Python as the language for teaching and for assignments that require coding.

We have a chapter on installing python and a brief recap of some python fundamentals.

If you are don’t feel proficient in python, you should review the chapter, or even better, complete a more comprehensive on-line course on python.

Grading

Final grades will be computed based on the following:

  • 15% participation and in-class activities
  • 15% homework (~7 assignments)
  • 30% midterm challenge
  • 40% final project

The exact cutoffs for final grades will be determined after the class is complete.

Project Grading

Of the 40% of the project course grade, it is further weighted as the following

Percentage Category
50% Project quality
10% Repo quality
30% Individual Contribution
10% Collaboration

Project Grading Details

  • Project quality (50%)
    • Did the project accomplish a sufficient number of (possibly revised) objectives?
    • Was the client relationship managed well?
  • Repo quality (10%)
    • Is the Github repository well organized and easy to navigate?
    • Is the repo well documented especially with replication steps?
    • Can one start from a new environment and easily setup and run?
  • Individual Contribution (30%)
    • Is there clear evidence of attendance and active participation in class lab time, client and team meetings?
    • Documented activities in sprint plan history?
    • Git commit history and co-authored git commits?
    • Record of individual’s contributions in document and presentation revision history?
  • Collaboration (10%)
    • Is there indication, for example from peer reviews, of positive collaborations and constructive teamwork?

Homeworks

Homeworks are due at 11:59 pm on the date shown on the syllabus.

Assignments will be submitted using Github and/or Gradescope.

Please review the instructions for submitting homeworks, on the Resources page of Piazza.

NOTE: Late assignments will be accepted up to 48 hours after the deadline with a 10% penalty.

In a typical assignment you will analyze one or more datasets using the tools and techniques presented in class.

You are expected to work individually on homeworks.

Midterm

The midterm will be a Kaggle Data Science competition among the students in the class with a live leaderboard.

Students will need to submit predictions based on a training dataset and a report detailing the methods used and decisions made.

The intent is not to use the leaderboard to determine your grade, but to help you assess how effective your work is.

Having said that we will give a few bonus points to the leaders.

Project

A major goal of this course is to gain experience with real-world data science problems in form of a group project.

In your project, you and your team will extract some knowledge or conclusions from the analysis of dataset of your choice.

Grading will be based on specific deliverables as well as your performance in your team throughout the semester.

For the project, students will get the opportunity to work with BU Spark! on a real world, datadriven project for a company, non-profit, or institution.

Project

Spark projects have already been curated and will be presented during “Pitch Day”.

Every team will need to upload a SCRUM file to the final project repository every week which gives a short report on the status of their project. These SCRUM reports are a fast and concise way to answer:

  • What have I worked on?
  • What will I be working on next?
  • Have I run into any issues? Do I need help?
  • Have I talked to the client recently? When are we meeting with them next?

Project Expectations

  • All team members should contribute equally and proactively to project work; we will evaluate team contributions through a peer evaluation at the end of the semester and this will be factored into your grade.
  • You / your team lead should make yourself available to speak with your client on a bi-weekly basis (depends on client availability)
  • You / your team lead should meet with your Spark PM on a weekly basis
  • You should meet with your team every other day (can / should be a short meeting)
  • For any team communication issues, please let your spark PMs know asap - they are here to help. If the problem persists please email me with a description of the situation.
  • All students are expected to abide by University conduct policies as detailed in the following links:

Spark! project teams

All Spark! project teams have the following structure:

  • Project Managers (Spark! provided): These are the project leads and will communicate with the client directly, they will assist with administrative support (meeting scheduling, agenda setting), and will be a point of contact for project questions / concerns. Project Managers are also responsible for grading all Spark! project deliverables .
  • Team Lead: These students will assist the Project Manager in attending client meetings, organizing team questions, and facilitating team meetings.
  • Team Members: These students work collaboratively with each other on the project goals.

Spark! Collaboration

BU Spark! offers students an opportunity to work on technical projects provided by companies or organizations in the Greater Boston area through our experiential learning lab (X-Lab).

Spark! has partnered with DS701 to offer a diverse selection of external data science projects scoped to support the course’s learning outcomes and enhance the student experience.

Your project team will be led by one of the Spark! Project managers. Their role is to support the student team’s work plan, manage client communication and expectations, organize weekly and biweekly meetings, and to oversee project deliverable grading.

Spark! projects are a great opportunity for students to get real-world project experience to highlight on their github and CV.

Participation and In-Class Activities

Lecture sessions will be a combination of lecture and in-class activities.

Participation is an important part of the class and intended to keep you engaged as well as practice “thinking on your feet”. It will be a combination of volunteered answers and randomly calling students.

In-Class Activities will be where you work in small groups on challenges that reinforce your understanding of the material in the lecture.

You’ll get credit for participation and activities (not graded).

AI Use Discussion

Think-pair-share style, each ~6-7 minutes, with wrap-up.

See team assignment in Gradescope.

Round 1: Learning Impact

“How might GenAI tools help your learning in this course? How might they get in the way?”

Round 2: Values & Fairness

“What expectations do you have for how other students in this course will or won’t use GenAI? What expectations do you have for the teaching team so we can assess your learning fairly given easy access to these tools?”

Round 3: Real Decisions

“Picture yourself stuck on a challenging problem at midnight with an assignment deadline looming. What options do you have? What would help you make decisions you’d feel good about?”

Thoughts on GenAI

GenAI is a powerful tool that can help you learn and be more productive and undoubtedly will be part of your workflow going forward. We will use it!

But, would you…

(a) …build strength by having a robot left weights for you?
(b) …increase your fitness by having a robot jog for you?
Figure 1: Via ChatGPT – And yes, I’m aware of the irony.

Generative AI Assistance (GAIA) Policy

In general, we follow the policy outlined in the CDS GAIA Policy.

In particular students shall:

  1. When using GenAI on written assignments, unless prohibited, add an appendix citing which model was used and prompts used.
  2. When using AI tools on coding assignments, unless prohibited:
    1. If generating several lines of code or entire functions, add the prompt text and tool used as comments before the generated code.
    2. If auto-completing function syntax, no need to cite the tool, but add a comment at the top of the file citing tool used (e.g. “This file was written with help of Copilot”)
Note

For foundational concepts, as are taught in this course, it is in your best interest and worth it to struggle some in creating your answers and solutions. It is just as important to learn what doesn’t work, and which paths are dead ends, as it is to learn what does work.

Academic honesty

You may collaborate and discuss homework assignments with classmates, but you are solely responsible for what you turn in.

All forms of cheating (copying parts of a classmate’s assignment, plagiarism from books or old posted solutions) are NOT allowed.

We – both teaching staff and students – are expected to abide by the guidelines and rules of the Academic Code of Conduct.

Graduate students must also be aware of and abide by the GRS Academic Conduct code.

  • If you are looking online for an answer because you don’t know how to start thinking about a problem, talk to your Professors or TAs, who may be able to give you pointers to get you started. Piazza is great for this – you can usually get an answer in an hour if not a few minutes.
  • If you are looking online for an answer because you want to see if your solution is correct, ask yourself if there is some way to verify the solution yourself. Usually, there is. You will understand what you have done much better if you do that. So … it would be better to simply submit what you have at the deadline (without going online to cheat) and plan to allocate more time for homeworks in the future.

For Thursday

  • We will have Spark! project pitches and open up the project preferences survey.
  • Bring your laptop and have python installed as well as VS Code and/or Cursor.

And fill out the intro survey and indicate so in Gradescope.

Back to top