At any given time, the Stanford Platform Lab has a few themes. These are platform areas where we have multiple projects involving multiple faculty. The themes provide a vehicle for collaboration among the faculty and students in the lab. Our themes change every few years, as technologies and faculty interests evolve.
In addition to themes, we also have a variety of smaller projects. Some of these projects will grow to become future themes.
Systems and Machine Learning
Machine learning is having a revolutionary impact on society. This theme explores two intersection points between systems and machine learning. First, we are developing new systems to support the next generation of machine learning applications. Second, we are exploring how machine learning techniques can be used to improve systems.
The underlying systems infrastructure needed to support the performance, security and robustness of ML deployments at scale is still in its infancy. As a result, tremendous human effort is currently devoted to creating usable, production worthy ML systems; for example, a repeatable test-debug-deploy cycle for developers is often measured in hours or days. Our research on systems for machine learning is addressing three overall questions:
- How do we scale the training, deployment, monitoring and debugging of models? We are developing new systems for scalable data management and scale-out training techniques to enable fast and interactive model deployment. We are also exploring how to help developers after a model is deployed, and developing AIops algorithms, and creating tools to monitor, debug and explain the diversity in inference performance when models are deployed on devices with radically different capabilities.
- How do we deliver privacy preserving AI at the edge? One approach is to leverage secure enclave HW capabilities to enforce privacy guarantees when deploying models on edge devices; the research focus is on feasible and enforceable privacy policies both for single as well as clusters of devices that are within a trust zone (e.g all devices in a home). Another area of research is auditability of models, specifically how to design generative models that can test for privacy leakage violations. In general, we are interested in how to enable users to specify privacy policies and leverage systems mechanisms to implement those policies.
- How can edge devices intelligently leverage the cloud to deliver scalable & collaborative inference and learning? One approach is to develop new algorithms for distributed learning and inference where models running on edge devices can intelligently and selectively leverage more sophisticated models running in the cloud.
A substantial body of research seeks to apply recent machine learning techniques to create and optimize computer systems that involve reasoning under uncertainty, e.g., congestion control, job scheduling, database query planning, bitrate adaptation for video, software fuzzing, channel scheduling in 5G networks, intrusion detection, spam filtering, or prevention of abuse and denial-of-service attacks. But practitioners are finding that while many ML approaches appear promising in simulations or small-scale tests, these gains don’t necessarily transfer to real-world deployments. The reasons for this phenomenon are not fully understood; computer systems and networks are complex, diverse, and challenging to simulate faithfully in training, individual nodes may only observe a noisy sliver of the system dynamics, and behavior is often heavy-tailed and changes with time or based on the behavior of adversaries. In the Puffer research project, Platform Lab researchers led by Keith Winstein are running a series of real-world randomized blinded experiments, evaluating ML techniques for bitrate adaptation, network prediction and bandwidth estimation, and congestion control on a popular video-streaming website and a real-time videoconferencing platform, both operated by the lab. The findings are leading to new methods of training and developing ML-guided systems, including new simulation and augmentation techniques, with the goal that resulting systems should generalize and hold up in real-world deployment environments.
Building Time-Critical Systems and Powering Time-Sensitive Applications
The first decade of the 21st century witnessed the explosive adoption of server virtualization and data center technologies, culminating in the cloud computing revolution. The second decade (2011–2020) saw the development of technologies for batch processing and bulk storage of “big” data. This data typically pertained to records generated in the past (e.g., yesterday’s or last week’s business receipts or operational data) and time wasn’t a critical factor during processing.
As more of the real world’s interactions are pulled into the digital realm, we believe the coming decade will witness the development of large-scale time-critical systems. Such systems will need to ingest, clean and process data from the real world, make complex decisions, and devise control actions with strict time deadlines. The time scales in question range from microseconds (e.g., financial exchanges running in the cloud), through 10s–100s of milliseconds (e.g., self-driving cars, digital advertising, ride hailing/matching, automated manufacturing, online gaming), to a few seconds (e.g., online/mobile banking, digital payments, online retail).
Time-critical systems and time-sensitive applications will give rise to a host of new problems, design paradigms and architectures. Some examples are: (i) event ordering and scheduling without global coordination, (ii) building deterministic and jitter-free systems on top of heterogeneous and “jittery” networks, (iii) large-scale monitoring and control using “time perimeters”, (iv) large-scale and high-frequency snapshotting without a central controller, and (v) building multi-leader consensus protocols and state replication systems. These are just a few of the early examples of what we believe will be a massive interest in time-critical systems and they will have a large impact on the industry and on academic research.
Taking SDNs to the Next Level
With the advent of Software Defined Networks (SDNs), including programmable forwarding and P4, the network has become a deeply programmable platform that can be controlled by network owners to suit their needs. It is now possible to program the control plane and the forwarding pipeline end-to-end on servers, NICs, switches and middle-boxes. With this, we can now reimagine the network as a dynamic distributed computing platform.
With the network now programmable end-to-end and top-to-bottom, software defines network behavior, and network functions can be placed where they are best suited which may be in hardware, on device software or on the central SDN controller. In this theme we are leveraging network programmability to enable deep and wide network visibility, verification, and closed-loop control, giving programmers tools to build and dynamically deploy customized network functionality in a secure and reliable manner.
A cornerstone of this theme is Project Pronto, a new cross-organizational effort that includes researchers at Stanford University, Cornell University, Princeton University, and the Open Networking Foundation. Pronto is building and deploying a beta-production end-to-end 5G connected edge cloud leveraging a fully programmable network empowered by unprecedented visibility, verification and closed-loop control capabilities to fuel innovation while helping to secure future network infrastructure.