Welcome to DS701

Course overview

Welcome to DS701 Tools for Data Science.

This course is a Master’s level introduction to data science.

In this course you will:

  • develop proficiency in working and analyzing data
  • work with a wide range of data analysis techniques

Professors

Prof Thomas Gardos (Section A1)

  • Office Hours: Thursday 11:00am - 12:00pm and by appointment
  • Office Location: CDS 1635
  • Email: tgardos <at> bu <dot> edu

Prof Scott Ladenheim (Section C1)

  • Office Hours: We,Fr 11:00am - 12:00 pm
  • Office Location: CDS 1545
  • Email: saladenh <at> bu <dot> edu

Teaching assistants

We have the following teaching assistants:

Chandrahas Aroori

  • Office Hours: Th 10:00am - 11:00am and by appointment
  • Office Hours Location: CCDS 15th floor, open desks on the south side.
  • Email: charoori <at> bu <dot> edu

Farid Karimli

  • Office Hours: Tu 11:00am - 12:00pm and by appointment
  • Office Hours Location: CCDS 15th floor, open desks on the south side.
  • Email: faridkar <at> bu <dot> edu

Shreyas Sudarsan

  • Office Hours: Wed 2:00pm - 3:00pm
  • Office Hours Location: CCDS 15th floor, open desks on the south side.
  • Email: shreyas9 <at> bu <dot> edu

Course overview

There are two sections of this course:

  • A1 lecture (Prof Gardos)
    • Tu,Th 9:30am – 10:45am
    • 665 Comm Ave CDS B62
  • A2 discussion (TAs)
    • We 12:20pm – 1:10pm
    • 750 Comm Ave EPC 205
  • C1 lecture (Prof Ladenheim)
    • Mo,We 4:40pm - 5:55pm
    • 595 Comm Ave HAR 324
  • C2 discussion (TAs)
    • Fr 10:10am - 11:00am
    • 685-725 Comm Ave CAS 313

Learning outcomes

The goal of the class is to provide to students a hands-on understanding of classical data analysis techniques and to develop proficiency in applying these techniques in a modern programming language (Python).

Broadly speaking, the course breaks down into three main components, which we will take in order of increasing complication:

  1. unsupervised methods
  2. supervised methods
  3. methods for structured data

Lectures present the fundamentals of each technique.

Focus is not on the theoretical analysis of the methods, but rather on helping students understand the practical settings in which these methods are useful.

Class discussion will study use cases and cover relevant Python packages to enable the students to perform hands-on experiments with their data.

Learning outcomes

Students who successfully complete this course will be proficient in

  • data acquisition
  • data manipulation
  • data analysis

They will have good working knowledge of the most commonly used methods of clustering, classification, and regression. They will also understand the efficiency issues and systems issues related to working on very large datasets.

Course webpage

Almost all details of the course can be found on this webpage

Bookmark this page!

Textbook and slides

The textbook used in the course is published at

This online text will evolve as the course progresses, but we will work to keep it up-to-date.

The slides used in the lecture are Quarto RevealJS presentations. Setup instructions are on the repo README.

For lectures with python you’ll see the badge . If you click on the badge, it will open the Jupyter notebook version of the lecture in Google Colab.

Feel free to explore, experiment, and modify this code as you see fit.

The lecture slides and everything else used in lecture are published on github. The repository is

If you want to clone or fork the repository using git, please feel free. If you find a bug, feel free to submit a pull request. Some of the lectures are based on Introduction to Data Mining, by Tan, Steinbach and Kumar. This is a good place to go for more detail if some methodological aspect is not clear.

For up-to-date reference on Pandas, scikit-learn, or any of the other software tools we use, there is no substitute for online resources. Google will quickly bring you to the authoritative (and current) references on software tools.

Tools and platforms

We will use:

  1. Piazza for questions
  2. Github for homeworks, midterm, lectures
  3. Gradescope for grading and grade management
  4. Kaggle for the midterm

You should already be signed up for Gradescope (if not, enroll using code sent via welcome email).

You can add yourself to Piazza if you are not already enrolled (again, use the code sent via welcome email).

You will need an account on Github. Please tell us your Github user name on this form.

If you don’t have an up to date Python installation, take care of that right away.

Piazza

We will be using Piazza for class discussion.

You can use Piazza to get help fast and efficiently from classmates, the TAs, and the professors.

I encourage you to post your questions on Piazza.

Our class Piazza page is at

Piazza Etiquette

Please be respectful on Piazza.

Do

  • Ask questions about course material, logistics, etc.
  • Answer posted questions when you know the answer (not homework)
  • Tell people where to look for answers

Don’t

  • Provide solutions to homework questions.

Programming environment

We will use Python as the language for teaching and for assignments that require coding. Instructions for installing and using Python are in the online textbook.

Grading

Homeworks are due at midnight on the date shown on the syllabus.

Assignments will be submitted using Github and Gradescope.

Please review the instructions for submitting homeworks, on the Resources page of Piazza.

NOTE: IMPORTANT: Late assignments WILL NOT be accepted. However, you may submit one homework up to 3 days late. You must post a note to “Instructors” on Piazza before the deadline if you intend to submit a homework late.

Final grades will be computed based on the following:

  • 20% Midterm
  • 40% Homework assignments.
  • 40% Final Project

The exact cutoffs for final grades will be determined after the class is complete.

Project Grading

Of the 40% of the project course grade, it is further weighted as the following

Percentage Category
50% Project quality
10% Repo quality
30% Individual Contribution
10% Collaboration

Project Grading Details

  • Project quality (50%)
    • Did the project accomplish a sufficient number of (possibly revised) objectives?
    • Was the client relationship managed well?
  • Repo quality (10%)
    • Is the Github repository well organized and easy to navigate?
    • Is the repo well documented especially with replication steps?
    • Can one start from a new environment and easily setup and run?
  • Individual Contribution (30%)
    • Is there clear evidence of attendance and active participation in class lab time, client and team meetings?
    • Documented activities in sprint plan history?
    • Git commit history and co-authored git commits?
    • Record of individual’s contributions in document and presentation revision history?
  • Collaboration (10%)
    • Is there indication, for example from peer reviews, of positive collaborations and constructive teamwork?

Homeworks

There will be weekly homework assignments.

In a typical assignment you will analyze one or more datasets using the tools and techniques presented in class.

Homeworks will be assigned and submitted via gradescope.

The format of the homeworks will be Jupyter notebooks.

You are expected to work individually on homeworks.

Midterm

The midterm will be a Kaggle Data Science competition among the students in the class with a live leaderboard.

Students will need to submit predictions based on a training dataset and a report detailing the methods used and decisions made.

The intent is not to use the leaderboard to determine your grade, but to help you assess how effective your work is.

Accordingly, 80% of the grade will be based on the report and only 20% will be based on the competition score related to the quality of the predictions made.

Project

A major goal of this course is to gain experience with real-world data science problems in form of a group project.

In your project, you and your team will extract some knowledge or conclusions from the analysis of dataset of your choice.

Grading will be based on specific deliverables as well as your performance in your team throughout the semester.

For the project, students will get the opportunity to work with BU Spark! on a real world, datadriven project for a company, non-profit, or institution.

Project

Spark projects have already been curated and will be presented during “Pitch Day”.

Every team will need to upload a SCRUM file to the final project repository every week which gives a short report on the status of their project. These SCRUM reports are a fast and concise way to answer:

  • What have I worked on?
  • What will I be working on next?
  • Have I run into any issues? Do I need help?
  • Have I talked to the client recently? When are we meeting with them next?

Project Expectations

  • All team members should contribute equally and proactively to project work; we will evaluate team contributions through a peer evaluation at the end of the semester and this will be factored into your grade.
  • You / your team lead should make yourself available to speak with your client on a bi-weekly basis (depends on client availability)
  • You / your team lead should meet with your Spark PM on a weekly basis
  • You should meet with your team every other day (can / should be a short meeting)
  • For any team communication issues, please let your spark PMs know asap - they are here to help. If the problem persists please email me with a description of the situation.
  • All students are expected to abide by University conduct policies as detailed in the following links:

Spark! project teams

All Spark! project teams have the following structure:

  • Project Managers (Spark! provided): These are the project leads and will communicate with the client directly, they will assist with administrative support (meeting scheduling, agenda setting), and will be a point of contact for project questions / concerns. Project Managers are also responsible for grading all Spark! project deliverables .
  • Team Lead: These students will assist the Project Manager in attending client meetings, organizing team questions, and facilitating team meetings.
  • Team Members: These students work collaboratively with each other on the project goals.

Spark! Collaboration

BU Spark! offers students an opportunity to work on technical projects provided by companies or organizations in the Greater Boston area through our experiential learning lab (X-Lab).

Spark! has partnered with DS701 to offer a diverse selection of external data science projects scoped to support the course’s learning outcomes and enhance the student experience.

Your project team will be led by one of the Spark! Project managers. Their role is to support the student team’s work plan, manage client communication and expectations, organize weekly and biweekly meetings, and to oversee project deliverable grading.

Spark! projects are a great opportunity for students to get real-world project experience to highlight on their github and CV.

Generative AI Assistance (GAIA) Policy

In general, we follow the policy outlined in the CDS GAIA Policy.

In particular students shall:

  1. When using GenAI on written assignments, unless prohibited, add an appendix citing which model was used and prompts used.
  2. When using AI tools on coding assignments, unless prohibited:
    1. If generating several lines of code or entire functions, add the prompt text and tool used as comments before the generated code.
    2. If auto-completing function syntax, no need to cite the tool, but add a comment at the top of the file citing tool used (e.g. “This file was written with help of Copilot”)
Note

For foundational concepts, as are taught in this course, it is in your best interest and worth it to struggle some in creating your answers and solutions. It is just as important to learn what doesn’t work, and which paths are dead ends, as it is to learn what does work.

DS701 AI Tutor

In fact we’re offering an experimental generative AI tutor tailored to this course’s content.

AI Tutor Sign In

AI Tutor Start

AI Tutor Chat Interface

Academic honesty

You may collaborate and discuss homework assignments with classmates, but you are solely responsible for what you turn in.

All forms of cheating (copying parts of a classmate’s assignment, plagiarism from books or old posted solutions) are NOT allowed.

We – both teaching staff and students – are expected to abide by the guidelines and rules of the Academic Code of Conduct.

Graduate students must also be aware of and abide by the GRS Academic Conduct code.

  • If you are looking online for an answer because you don’t know how to start thinking about a problem, talk to your Professors or TAs, who may be able to give you pointers to get you started. Piazza is great for this – you can usually get an answer in an hour if not a few minutes.
  • If you are looking online for an answer because you want to see if your solution is correct, ask yourself if there is some way to verify the solution yourself. Usually, there is. You will understand what you have done much better if you do that. So … it would be better to simply submit what you have at the deadline (without going online to cheat) and plan to allocate more time for homeworks in the future.
Back to top