Syllabus

“Data is the new oil.””

– Clive Humby, former CEO of Tesco

“Data is just like crude. It’s valuable but if unrefined it cannot really be used.”

– Michael Palmer, blog

“The goal is to turn data into information, and information into insight.”

– Carly Fiorina, former CEO of HP

“People … operate with beliefs and biases. To the extent you can eliminate both and replace them with data, you gain a clear advantage.”

Michael Lewis, Moneyball

Preface

Welcome to Tools for Data Science!

These are lecture notes for DS701, Tools for Data Science, as taught at Boston University.

This course has evolved from CS 506, which had major contributions from Evimaria Terzi, George Kollios, and Lance Galletti.

If you have suggestions, feel free to create an Issue. If you find typos or errors, feel free to create an issue, or better yet, submit a pull request!

If you are taking the course, please start with the Course Overview

Format

This course notes site is built using Quarto and all the source pages are in Quarto markdown files.

We also provide the lecture notes in Jupyter notebook form to make it easier to follow along and experiment. Demos and most figures are included as executable Python code.

All course materials are in the github repository: here.

When you see this badge, , at the beginning of the chapter, you will also be able to open the Jupyter notebook in Google Colab.

Each Chapter is based on a single Quarto markdown file (and also in the form of a Jupyter notebook), and each file/notebook forms the basis for one lecture.

Course Abstract

This course is a Master’s level introduction to data science, focusing on proficiency in working with and analyzing data. The course emphasizes practical skills in working with data, while introducing students to a wide range of techniques that are commonly used in the analysis of data, such as clustering, classification, regression, and network analysis.

Logistics

Lecture Section A1

Meeting Place: 665 Comm Ave CDS B62

Meeting Time: Tu, Th 9:30am – 10:45am

Instructor: Thomas Gardos

  • Office: CCDS 1623
  • Office Hours: Thursdays 11:00 AM – 12:00 noon and by appointment
  • Office Hours Location: CCDS 1623
  • Email: tgardos <at> bu <do> edu

Teaching Assistants

  • Name: TBA
    • Office Hours: TBA
    • Office Hours Location: TBA
    • Email: TBA
  • Name: TBA
    • Office Hours: TBA
    • Office Hours Location: TBA
    • Email: TBA
  • Name: TBA
    • Office Hours: TBA
    • Office Hours Location: TBA
    • Email: TBA

Lecture (A1)

  • Tu,Th 2:00 pm – 3:15 pm
  • 765 Commonwealth Ave LAW AUD

A2 Discussions (TA TBA)

  • We 12:20 pm – 1:20 pm
  • 635 Commonwealth Ave SAR 300

A3 Discussions (TA TBA)

  • We 1:25 pm – 2:15 pm
  • 888 Commonwealth Ave IEC B10

A4 Discussions (TA TBA)

  • We 2:30 pm – 3:20 pm
  • 871 Commonwealth Ave CGS 311

A5 Discussions (TA TBA)

  • We 3:35 pm – 4:25 pm
  • 835 Commonwealth Ave CGS 315

Overview of the Course

This course is a Master’s level introduction to data science, focusing on proficiency in working with and analyzing data. The course emphasizes practical skills in working with data, while introducing students to a wide range of techniques that are commonly used in the analysis of data, such as clustering, classification, regression, and network analysis. The goal of the class is to provide to students a hands-on understanding of classical data analysis techniques and to develop proficiency in applying these techniques in a modern programming language (Python).

Broadly speaking, the course breaks down into three main components, which we will take in order of increasing complication: (a) unsupervised methods; (b) supervised methods; and (c) methods for structured data.

Lectures will present the fundamentals of each technique; focus is not on the theoretical analysis of the methods, but rather on helping students understand the practical settings in which these methods are useful. Class discussion will study use cases and will go over relevant Python packages that will enable the students to perform hands-on experiments with their data.

Pre/Co-requisites

Prerequisite: Students taking this class must have proficiency in python, at the level of an undergradute python programming course such as DS110 or equivalent. If you are not confident in your python skills, we recommend you take one of many online courses.

Corequisite: Most of the techniques we will cover in this course are based on linear algebra and probability. Having a strong foundation in these subjects will serve you well in this course. We will cover the basics of linear algebra and probability, but we strongly encourage you take Mathematics for Data Science at the same time as this course.

Learning Outcomes

Students who successfully complete this course will be proficient in data acquisition, manipulation, and analysis. They will have good working knowledge of the most commonly used methods of clustering, classification, and regression. They will also understand the efficiency issues and systems issues related to working on very large datasets.

Textbook and Slides

The notes used in the course are published at https://tools4ds.github.io/DS701-Course-Notes/. This online text will evolve as the course progresses, but we will work to keep it up-to-date.

The slides used in the lecture are written in Quarto markdown that can include executable python cells. When we show you code in lecture, it will almost always be runnable code. You can execute them directly in the Quarto markdown files if you install Quarto. We recommend installing VS Code and adding the Quarto extension to execute the cells right in VS Code.

For all the lectures with python code, we also add the Open in Colab link so you can open the page directly from the Course Notes web page and execute the python cells that way.

Feel free to fork and clone the Course Notes repository and execute on your own computer. You can modify them any way you’d like, play around with them, experiment, etc. If you find a bug, feel free to submit a pull request.

Some of the lectures were previously based on Introduction to Data Mining1. This is a good place to go for more detail if some methodological aspect is not clear. For up-to-date reference on Pandas, scikit-learn, or any of the other software tools we use, there is no substitute for online resources.

Tools and Platforms

We will use:

  1. Piazza for questions
  2. Github for notes source, some homeworks and assignments
  3. Gradescope for grading and grade management
  4. Most likely also Kaggle for the midterm competition.

You should already be signed up for Gradescope and Piazza, if not see the enrollment codes in the course welcome email or contact an instructor.

You will need an account on Github. Please also add your real name to your GitHub account profile so we can easily associate you with your GitHub username.

If you don’t have an up to date Python installation, take care of that right away. We recommend Python 3.10 or later.

Piazza

We will be using Piazza for class discussion. The system is really well tuned to getting you help fast and efficiently from classmates, the teaching fellows, and the instructor. Rather than emailing questions to the teaching staff, we encourage you to post your questions on Piazza. Our class Piazza page is at: https://piazza.com/bu/fall2025/ds701.

When someone posts a question on Piazza, if you know the answer, please go ahead and post it. However please don’t provide answers to homework questions on Piazza. It’s OK to tell people where to look to get answers, or to correct mistakes in the assignment; just don’t provide actual solutions to homework questions.

Programming Environment

We will use python as the language for teaching and for assignments that require coding. Instructions for installing and using Python are in the online textbook.

We’ll also be using Jupyter notebooks and Google Colab environments.

Course and Grading Administration

Homeworks are due at midnight on the date shown on the syllabus. Assignments will be submitted using Github and Gradescope. Please review the instructions for submitting homeworks, on the Resources page of Piazza.

NOTE: IMPORTANT: Late assignments WILL NOT be accepted. However, you may submit one homework up to 3 days late. You must notify the TAs on Piazza before the deadline if you intend to submit a homework late.

Final grades will be computed based on the following:

Percentage Category
20% Midterm
40% Homework assignments
40% Final Project

The exact cutoffs for final grades will be determined after the class is complete.

Project Grading

Of the 40% of the project course grade, it is further weighted as the following

Percentage Category Comments
50% Project quality * Did the project accomplish a sufficient number of (possibly revised) objectives?
* Was the client relationship managed well?
10% Repo quality * Is the Github repository well organized and easy to navigate?
* Is the repo well documented especially with replication steps?
* Can one start from a new environment and easily setup and run?
30% Individual Contribution Is there clear evidence of
* attendance and active participation in class lab time, client and team meetings?
* Documented activities in sprint plan history?
* Git commit history and co-authored git commits?
* Record of individual’s contributions in document and presentation revision history?
10% Collaboration Is there indication, for example from peer reviews, of positive collaborations and constructive teamwork?

Homeworks

In a typical assignment you will analyze one or more datasets using the tools and techniques presented in class.

Homeworks will be submitted via Gradescope or GitHub. For this, we need your github account (create one if you don’t already have it). After you have created it, fill out this form to let us know what it is. We also highly recommend you add your full name to your GitHub account profile.

You are expected to work individually on homeworks.

Midterm

The midterm will be a Kaggle Data Science competition among the students in the class with a live leaderboard. Students will need to submit predictions based on a training dataset and a report detailing the methods used and decisions made. Note that the intent is not to use the leaderboard to determine your grade, but rather to help you assess how effective your work is.

Project

A major goal of this course is to gain experience with real-world data science problems in form of a project. For the project you will extract some knowledge or conclusions from the analysis of dataset of your choice. The analysis will be done using a subset of the methods we described in class. Grading will be based on specific deliverables as well as your performance in your team throughout the semester.

For the final project, students may get the opportunity to work with BU Spark! on a real world, data driven project for a company, non-profit, or institution. Spark projects have already been curated and will be presented during “Pitch Day”. Project descriptions will be made available at the start of the semester. Once every student has a final project, every team will need to upload a SCRUM file to the final project repository every week which gives a short report on the status of their project. SCRUM is an agile method used in many software companies. Fast and concise, it is a short report answering the following questions:

  • What have I worked on?
  • What will I be working on next?
  • Have I run into any issues? Do I need help?
  • Have I talked to the client recently? When are we meeting with them next?

Project Expectations

  • All team members should contribute equally and proactively to project work; we will evaluate team contributions through a peer evaluation at the end of the semester and this will be factored into your grade.
  • You / your team lead should make yourself available to speak with your client on a bi-weekly basis (depends on client availability)
  • You / your team lead should meet with your Spark PM on a weekly basis
  • You should meet with your team every other day (can / should be a short meeting)
  • For any team communication issues, please let your spark PMs know asap - they are here to help. If the problem persists please email me with a description of the situation.
  • All Spark! project teams – Project Managers: These are the project leads and will communicate with the client directly, they will assist with administrative support (meeting scheduling, agenda setting), and will be a point of contact for project questions / concerns. Project Managers are also responsible for grading all Spark! project deliverables as detailed in the syllabus below. – Team Lead: These students will assist the Project Manager in attending client meetings, organizing team questions, and facilitating team meetings. – Team Members: These students work collaboratively with each other on the project goals. For details on what you must submit as part of your project, see the section “Project Deliverables” at the end of this syllabus.

Spark! Collaboration

BU Spark! offers students an opportunity to work on technical projects provided by companies or organizations in the Greater Boston area through our experiential learning lab (X-Lab). For this semester, Spark! has partnered with DS701 to offer a diverse selection of external data science projects scoped to support the course’s learning outcomes and enhance the student experience. To learn more about Spark!, please visit their website.

Your project team will be led by one of the Spark! Project managers. Their role is to support the student team’s work plan, manage client communication and expectations, organize weekly and biweekly meetings, and to oversee project deliverable grading.

Spark! projects are a great opportunity for students to get real-world project experience to highlight on their github and CV. These projects have already been curated and will be presented during “Pitch Day”. Project descriptions will be made available at the start of the semester.

Accommodations for Students with Disabilities

If you have a disability and have an accommodations letter from the Disability & Access Services office, We encourage you to discuss your accommodations and needs with us as early in the semester as possible. We will work with you to ensure that accommodations are provided as appropriate. If you suspect that you may have a disability and would benefit from accommodations but are not yet registered with BU Disability & Access Services, we encourage you to find more information at https://www.bu.edu/disability/.

Generative AI Assistance (GAIA) Policy

In general, we follow the policy outlined in the CDS GAIA Policy.

Extracting and paraphrasing from the student responsibilities of that policy. Where there is conflicting information between the CDS policy and below, the policy below should take precedence.

Students shall:

  1. Give credit to AI tools whenever used, even if only to generate ideas rather than usable text, illustrations or code.
  2. When using AI tools on written assignments, unless prohibited, add an appendix showing
    1. the entire exchange, highlighting the most relevant sections;
    2. a description of precisely which AI tools were used (e.g. ChatGPT private subscription version or DALL-E free version),
    3. an explanation of how the AI tools were used (e.g. to generate ideas, turns of phrase, elements of text, long stretches of text, lines of argument, pieces of evidence, maps of conceptual territory, illustrations of key concepts, etc.);
    4. an account of why AI tools were used (e.g. to save time, to surmount writer’s block, to stimulate thinking, to handle mounting stress, to clarify prose, to translate text, to experiment for fun, etc.).
    5. Optional but recommended: Employ AI detection tools and originality checks prior to submission, ensuring that their submitted work is not mistakenly flagged.
  3. When using AI tools on coding assignments, unless prohibited
    1. Add the prompt text and tool used as comments before the generated code. Clarify whether the code was used as is, or modified somewhat, moderately or significantly.
  4. When using AI assistants incorporated into IDEs such as VSCode and Github Copilot, be extra mindful of when to allow copilot generation. It can be handy to look up syntax, but it will also generate entire functions. If the assignment allows it and you generate complete functions, cite the tool in the comments in the function.
  5. Not use AI tools during in-class examinations, or assignments, unless explicitly permitted and instructed.
  6. Use AI tools wisely and intelligently, aiming to deepen understanding of subject matter and to support learning.

As these generative assistive tools become widely deployed and pervasive, we believe they will become integral to most people’s workflow. However, for foundational concepts, as are taught in this course, it is in your best interest and worth it to struggle some in creating your answers and solutions. It is just as important to learn what doesn’t work, and which paths are dead ends, as it is to learn what does work. When you are posed with new and unique problems, that intuition you develop will be vital in choosing directions. More pragmatically, some of the most coveted jobs at the most selective companies require technical interviews where they expect you to know these foundational concepts without assistance.

And finally, to reiterate, it is vitally important, and a core part of academic integrity, to cite when you are using Generative AI Assistive technologies. Arguably, not citing and risking plagiarism is worse than using and citing GAIA.

Academic Honesty

You may discuss homework assignments with classmates, but you are solely responsible for what you turn in. Collaboration in the form of discussion is allowed, but all forms of cheating (copying parts of a classmate’s assignment, plagiarism from books or old posted solutions) are NOT allowed. We – both teaching staff and students – are expected to abide by the guidelines and rules of the Academic Code of Conduct (which is at http://www.bu.edu/academics/policies/academic-conduct-code/).

Graduate students must also be aware of and abide by the GRS Academic Conduct code at http://www.bu.edu/cas/students/graduate/forms-policies-procedures/academic-discipline-procedures/.

You can probably, if you try hard enough, find solutions for homework problems online. Given the nature of the Internet, this is inevitable. Let me make a couple of comments about that:

  1. If you are looking online for an answer because you don’t know how to start thinking about a problem, talk to Ms. Lu or myself, who may be able to give you pointers to get you started. Piazza is great for this – you can usually get an answer in an hour if not a few minutes.
  2. If you are looking online for an answer because you want to see if your solution is correct, ask yourself if there is some way to verify the solution yourself. Usually, there is. You will understand what you have done much better if you do that. So … it would be better to simply submit what you have at the deadline (without going online to cheat) and plan to allocate more time for homeworks in the future.

University conduct policies:

This syllabus provides a general plan for the course; deviations may be necessary depending on the progress of the class.

Back to top

Footnotes

  1. Tan et al, Introduction to Data Mining, Pearson, 2019, https://www-users.cse.umn.edu/~kumar001/dmbook/index.php↩︎