Advanced Topics in Cloud Networking and Computing

Graduate Seminar, University of Maryland, CS, 2023

The course aims to explore latest advances in cloud networking and computing in light of emerging workloads (e.g., machine learning and large-scale analytics), including communication platforms, compute parallelism, and datacenter networking. The class will discuss the latest developments in the entire networking stack, the interactions between networks and high-level applications, and their connections with other system components such as compute and storage. The course combines group readings and presentations of influential publications in the field, lectures by the instructor, talks by invited speakers, and a project etc.

Instructors

  • Instructor: Prof. Alan Zaoxing Liu
    • Class Time: Tue/Thu 2:00-3:15PM, IRB 1207
    • Office Hours: IRB 5138, Thu 4-5 or by appointment

Annoucements

  • The link to submit paper reviews is here.
  • The classroom teaching are cancelled in the first two weeks.

Prerequisites

All levels are welcome. Recommended experiences with computer networking and software systems, including one or more of CMSC330, CMSC412, CMSC414, CMSC417, etc., or permission of the instructor. The assignments and projects assume students have familiarity with programming (e.g., Python and C/C++).

Textbook

There are no mandatory textbooks for this course, but every class will have corresponding readings from research papers. A reading list with links to the papers will be provided.

Course Overview

  1. Paper Reviews: Each student reviews 1 paper/class from top conferences or journals. Submit reviews before the class in four sections, including summary, paper strengths paper weaknesses, and detailed comments.

  2. Paper Presentations: Each student will select papers from the paper reading list (the list will be provided, selections are first-come first-serve) and present that paper during a lecture. The presentation will be followed by a technical discussion.

  3. Lectures: In each topic, the instructor will give one or two introductory lectures, followed by paper presentations by class participants.

  4. Programming Assignments: There will be (tentatively) two programming assignments during the class. These assignments assume basic computer systems knowledge and some familiarity with network programming.

  5. Project: This class has a final project:

    • Topic: Reproduce a paper discussed in class, or novel research with a system-building component.
    • Can work alone, or in groups of two students. Must involve writing some code or conducting some measurement studies.
    • Can overlap with other research projects, with permission.

Academic Conduct Statement

Academic Conduct Statement including expectations for academic honesty, reference to consequences for cheating or plagiarism, course-specific guidelines for, e.g., extent of allowable collaboration on assignments, and URL for Academic Conduct Code.

Grading

  • Class participation: 10%
  • Paper reviews: 20%
  • Paper presentation: 10%
  • Programming/measurement assignments: 20%
  • Project: 40%

Late Policy: Programming/measurement assignments receive 10% off grades for each 24 hours late, rounded up.

Tentative Course Schedule

DateTopicsReadings
Week 1Administrative delay*No Class
*(Optional) How to Read a Paper
Week 2Administrative delay*No Class
*(Optional) MLSys: The New Frontier of Machine Learning Systems
Week 3Course Overview (slides)*A Datacenter Infrastructure Perspective for ML, (HPCA’18)
Week 4New Architecture*Empowering Azure Storage with RDMA, (NSDI’23)
*Optimized Network Architectures for Large Language Model Training with Billions of Parameters
Week 5Data Parallelism*PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, (VLDB’23)
*A Berkeley View of Systems Challenges for AI
Week 6Model Parallelism*PipeDream: Generalized Pipeline Parallelism for DNN Training, (SOSP’19)
*GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, (NeurIPS’19)
Week 7Tensor and Automated Parallelism*Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
*Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning, (OSDI’22)
Week 8Communicaton Library*A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters, (OSDI’20)
*NCCL Communication Primitives
Week 9Congestion Control*Congestion Control in Machine Learning Clusters, (HotNets’22)
*Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation, (HotNets’22)
Week 10ML Workflow: Data*Understanding Data Storage and Ingestion for Large-scale Deep Recommendation Model Training, (ISCA’22)
*Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models, (NSDI’22)
Week 11Preproessing*Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines, (SIGMOD’22)
*FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data Pipeline, (VLDB’23)
Week 12Job Scheduling and Energy*Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines, (SIGMOD’22)
*Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters, (OSDI’22)
Week 13Model Serving*Cocktail: A Multidimensional Optimization for Model Serving in Cloud, (NSDI’22)
*Thanksgiving Break
Week 14New Hardware*TPUv4
*Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking, (SIGCOMM’22)
*FAERY: An FPGA-accelerated Embedding-based Retrieval System, (OSDI’22)
Week 15Final Presentations 

Paper Reviews

The goal of the reviews is to get you comfortable of reading research papers in the software systems and networking space.

  • Students are expected to write reviews for the papers in each class. We will give scores based on the top 90% of the reviews. This means it is ok if you miss 10% of the reviews throughout the class.
  • Your reviews are due at noon one day before (Monday noon for Tuesday classes; Wednesday noon for Thursday classes). So the presenter of the paper can have time collect all your questions and we can discuss in class. For the lectures we have guest speakers, we will collect the questions and please raise your question in class.

Project Proposal and Project Pitch Presentation

The project proposal is not graded but it serves as the good basis for your individual meeting with the instructor and for your pitch presentation. Each student should give a 10-minute talk on your project ideas. The talk should include

  • What problem are you solving?
  • Why it’s an important problem?
  • What are the potential challenges you may face in solving the problem?
  • What are the first steps (your plan for the next month)?

Midterm Project Report

  • Describe the problem you plan to solve, why it is novel/unique, what the major challenges.
  • Describe the detailed design for your project and what you have implemented/evaluated so far.
  • Describe the remaining challenges, how you would address them, and your plan for the remaining time.
  • The midterm report should be about 2-4 pages and serve as a starting point for your final project report (see detailed requirements for the final report below)

Final Project Presentations

This should be similar to a workshop talk. You might consider covering the following content (not necessarily in the same order):

  • What problem are you trying to solve?
  • Why is it an important problem?
  • What’s your basic solution to the problem?
  • What are the challenges in the problem?
  • How did you solve these challenges? Or how do you plan to solve the challenges?
  • Some preliminary results
  • Future directions