The @Scale Conference is an invitation-only technical event for engineers who work on large-scale platforms and technologies. This year’s event took place on October 16 at the San Jose Convention Center, where more than 1,300 attendees gathered to discuss how to build applications and services that scale to millions or even billions of people. The conference featured technical deep dives from engineers at various scale companies, including Amazon Web Services, Box, Confluent, Cloudflare, Facebook, Google, Lyft, and NVIDIA.
AI: The next big scaling frontier
Srinivas Narayanan, Applied AI Research Lead, Facebook
In the past several years, Facebook has made significant progress across computer vision, language understanding, speech recognition, and personalization. But this rapid growth brings with it serious scaling challenges. Srinivas discusses the trends and opportunities in advancing AI, including learning with less supervision while scaling inference, making AI work for people all over the world (across hundreds of languages and accents), making billions of decisions for personalization, and steering the growth of AI in a responsible way. He also walks through recent advancements Facebook AI has made in deploying research into production quickly.
Engineering for respect: Building systems for the world we live in
Lea Kissner, Chief Privacy Officer, Humu
Lea shares tools and examples of how to build great products by building respect into them at every step. When we build systems, we want to build great systems. Great products and systems are respectful, treating both their users and other affected parties with care, concern, and consideration for their needs and feelings. Doing this at scale means that human needs and
feelings are also at scale. We must consider the diversity of human experience — and the diversity of human vulnerability — as part of building a great product. Privacy, security, and anti-abuse are all key to building with respect, but at the core, what we need is an understanding of humans and the societies they build.
@Scale 2019: Data Infra
Zanzibar: Google’s consistent, global authorization system
Ruoming Pang, Principal Software Engineer, Google
Determining whether online users are authorized to access digital objects is central to preserving privacy. In this context, Ruoming discusses the design, implementation, and deployment of Zanzibar, a global system for storing and evaluating access control lists. Zanzibar provides a uniform data model and configuration language for expressing a wide range of access control policies from hundreds of client services at Google, including Calendar, Cloud, Drive, Maps, Photos, and YouTube. He conveys how Zanzibar’s authorization decisions respect causal ordering of user actions, therefore providing external consistency amid changes to access control lists and object contents. Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services that billions of people use.
6 technical challenges in developing a distributed SQL database
Neha Deodhar, Software Engineer, YugaByte
Neha discusses the experience of developing YugaByte. In doing so, she outlines some of the hardest architectural issues they had to address while building an open source, cloud native, high-performance distributed SQL database. Neha touches on several topics, including architecture, SQL compatibility, distributed transactions, consensus algorithms, atomic clocks, and PostgreSQL code reuse.
Architecture of a highly available sequential data platform
Jana van Greunen, Engineering Manager, Facebook
LogDevice is a unified, high-throughput, low-latency platform for handling a variety of data streaming and logging needs. Jana lays out the architectural details and variants of Paxos used in LogDevice to provide scalability, availability, and data management. She demonstrates that these provide important flexibility to simplify the applications build on top of the system.
Amazon DynamoDB: Fast and flexible NoSQL database service for any scale
Akshat Vig, Principal Software Engineer, Amazon Web Services
Arturo Hinojosa, Senior Product Manager, Amazon Web Services
Amazon DynamoDB is a hyperscale, NoSQL database designed for internet-scale applications, such as serverless web apps, mobile backends, and microservices. DynamoDB provides developers with the security, availability, durability, performance, and manageability they need to run mission-critical workloads at extreme scale. Akshat and Arturo deep-dive into the underpinnings of DynamoDB, conveying how they run a fully managed, nonrelational database service that more than 100,000 customers use. They discuss how features such as DynamoDB Streams, ACID transactions, continuous backups, point-in-time recovery (PITR), and global tables work at scale. They also share key learnings garnered from building a highly durable, scalable, and available key-value store that you can apply when building your large-scale systems.
Kafka @Scale: Confluent’s journey bringing event streaming to the cloud
Ganesh Srinivasan, VP of Engineering, Confluent
Ganesh delves into the evolution and future of event streaming, conveying the lessons that Confluent learned through its journey to make the platform cloud-native. As streaming platforms become central to data strategies, companies both small and large are rethinking their architecture with real-time context at the forefront. What was once a “batch” mind-set is quickly being replaced with stream processing as the demands of the business impose more and more real-time requirements on developers and architects. This started at companies such as Facebook, LinkedIn, Netflix, Uber, and Yelp and has made its way to other companies in a variety of sectors. Today, thousands of companies across the globe build their businesses on top of Apache Kafka.
Disaggregated graph database with rich read semantics
Sebastian Wong, Software Engineering Manager, Facebook
Access to the social graph is an important workload for Facebook. Supporting a graph data model is inherently difficult because the underlying system has to be capable of efficiently supporting the combinatorial explosion of possible access patterns that can arise from even a single traversal. Sebastian discusses the introduction of an ecosystem of disaggregated secondary indexing microservices, which are kept up to date and consistent with the source of truth via access to a shared log of updates into a serving stack. This work has enabled Facebook to efficiently accommodate access pattern diversity by transparently optimizing graph accesses under the hood. It is responsible for a reduction of more than 50 percent in read queries to the source of truth and has unlocked the ability for product developers to access the graph in new game-changing ways. He also discusses the logical next steps that we are exploring as a consequence of this work.
@Scale 2019: AI
Unique challenges and opportunities for self-supervised learning in autonomous driving
Ashesh Jain, Head of Perception, Lyft
Autonomous vehicles generate a lot of raw (unlabeled) data every minute. But only a small fraction of that data can be labeled manually. Ashesh focuses on how we leverage unlabeled data for tasks on perception and prediction in a self-supervised manner. He touches on a few unique ways to achieve this goal in the AV land, including cross-modal self-supervised learning, in which one modality can serve as a learning signal for another modality without the need for labeling. Another approach he touches on is using outputs from large-scale optimization as a learning signal to train neural networks, which is done by mimicking their outputs but running in real-time on the AV. Ashesh further explores how we can leverage the Lyft fleet to oversample the long tail events and, hence, learn the long tail.
Reading text from visual content at scale
Vinaya Polamreddi, Machine Learning Engineer, Facebook
Vinaya presents multiple innovations across modeling, training infrastructure, deployment infrastructure, and efficiency measures Facebook has made to build its state-of-the-art OCR system running at Facebook scale. There are billions of images and videos posted on Facebook every day, and a significant percentage of them contain text. It is important to understand the text within visual content to provide people with better Facebook product experiences and remove harmful content. Traditional optical character recognition systems are not effective on the huge diversity of text in different languages, shapes, fonts, sizes, and styles. In addition to the complexity of understanding the text, scaling the system to run high volumes of production traffic efficiently and in real time creates another set of engineering challenges.
Multinode: Natural language understanding at scale
Sharan Chetlur, Deep Learning Software Engineering Manager, NVIDIA
This session includes an in-depth look at the world of multinode training for complex NLU models such as BERT. Sharan describes the challenges of tuning for speed and accuracy at the scale needed to bring training times down from weeks to minutes. Drawing from real-world experience running models on as many as 1,500 GPUs with reduced precision techniques, he explores the impact of different optimizers, strategies to reduce communication time, and improvements to per-GPU performance.
Romer Rosales, Senior Director of Artificial Intelligence, LinkedIn
Artificial intelligence powers every product experience at LinkedIn. Whether ranking the member’s feed or recommending new jobs, AI is used to fulfill LinkedIn’s mission of connecting the world’s professionals to make them more productive and successful. Although product functionality can be decomposed into separate components, they are beautifully interconnected, thus creating interesting questions and challenging AI problems that need to be solved in a sound and practical manner. Romer provides an overview of lessons learned and approaches LinkedIn has developed to address problems, including scaling to large problem sizes, handling multiple conflicting objective functions, efficient model tuning, and progress made toward deploying AI to optimize the LinkedIn product ecosystem more holistically.
Pushing the state of the art in AI with PyTorch
Joe Spisak, Product Manager, Facebook
Joe shares how PyTorch is being used to help accelerate the path from novel research to large-scale production deployment in computer vision, natural language processing, and machine translation at Facebook. He further explores the latest product updates, libraries built on top of PyTorch, and new resources for getting started.
Natural language processing for production-level conversational interfaces
Karthik Raghunathan, Head of Machine Learning at Webex Intelligence, Cisco Systems
Conversational applications often are overhyped and underperform. There’s been significant progress in natural language understanding in academia and a huge growing market for conversational technologies, but NLU performance drops significantly when you introduce language with typos or other errors, uncommon vocabulary, and more complex requests. Karthik explains how to build a production-quality conversational app that performs well in a real-world setting. He covers the domain-intent-entity classification hierarchy that has become an industry standard and describes their extensions to this standard architecture such as entity resolution and shallow semantic parsing that further improve system performance for nontrivial use cases. He demonstrates an end-to-end approach for consistently building conversational interfaces with production-level accuracies that have proven to work well for several applications across diverse verticals.
@Scale 2019: Privacy
Fairness and privacy in AI/ML systems
Krishnaram Kenthapadi, Tech Lead, Fairness, Explainability, and Privacy, LinkedIn
With the ongoing explosive growth of AI/ML models and systems, Krishnaram explores some of the ethical, legal, and technical challenges that researchers and practitioners alike encounter. He discusses the need for adopting a fairness and privacy by design approach when developing AI/ML models and systems for different consumer and enterprise applications. Then she focuses on the application of fairness-aware machine learning and privacy-preserving data-mining techniques in practice by presenting case studies spanning different LinkedIn applications, such as fairness-aware talent search ranking, privacy-preserving analytics, and LinkedIn salary privacy and security design.
Fighting fraud blindfolded
Subodh Iyengar, Software Engineer, Facebook
Logging is essential to running a service or an app. But every app faces a dilemma: The more data is logged, the more we understand the problems of users, but the less privacy they have. One way to add privacy is to report events without identifying users; however, this anonymity can allow fraudulent logs to be reported. Subodh introduces the problem of fraudulent reporting alongside an approach to using cryptography, including blind signatures, to enable fraud-resistant anonymous reporting. He presents a specific application to ads, demonstrating how fraud-resistant private reporting can be feasible; some different approaches for implementation; and some open problems.
DNS privacy at scale: Lessons and challenges
Nick Sullivan, Head of Research, Cloudflare
It’s no secret that the use of the domain name system reveals a lot of information about what people do online. The use of traditional unencrypted DNS protocols reveals this information to third parties on the network, introducing privacy risks to users as well as enabling country-level censorship. In recent years, internet protocol designers have sought to retrofit DNS with several new privacy mechanisms to help provide confidentiality to DNS queries. The results of this work include technologies such as DNS-over-TLS, DNS-over-HTTPS, and encrypted SNI for TLS. Nick shares some of the technical and political challenges that arise when deploying these technologies.
Data Transfer Project: Expanding data portability at scale
Jessie Chavez, Engineering Manager, Google
William Morland, Software Engineer, Facebook
The Data Transfer Project was launched in 2018 to create an open source, service-to-service data portability platform so all individuals across the web could easily move their data between online service providers whenever they want. Jessie and William offer a technical overview of the architecture as well as several components developed to support the Data Transfer Project’s ecosystem, including common data models, the use of industry standards, the adapter framework, and more. They walk us through a framework developer setup on how to integrate with the project and provide users the ability to port data into and out of their services.
Firefox origin telemetry with Prio
Anthony Miyaguchi, Data Engineer, Mozilla
Measuring browsing behavior by site origin can provide actionable insights into the broader web ecosystem in areas such as blocklist efficacy and web compatibility. But an individual’s browsing history contains deeply personal information that browser vendors should not collect wholesale. Anthony discusses ways to precisely measure aggregate page-level statistics using Prio, a privacy-preserving data collection system developed by Stanford researchers and deployed in Firefox. In Prio, a small set of servers verifies and aggregates data through the exchange of encrypted shares. As long as one server is honest, there is no way to recover individual data points. He explores the challenges Mozilla faced when implementing Prio, both in Firefox and its Data Platform, touching on how Mozilla validated its deployment of Prio through two experiments: one which collects known telemetry data and one which collects new data on the application of Firefox’s blocklists further across the web. He concludes with the results of Mozilla’s experiments and how those results are informing their plans.
@Scale 2019: Security
Leveraging the type system to write secure applications
Shannon Zhu, Software Engineer, Facebook
Shannon discusses ways to extend the type system to eliminate entire classes of security vulnerabilities at scale. Application security remains a long-term and high-stakes challenge for most projects that interact with external users. Python’s type system is already widely used for readability, refactoring, and bug detection — Shannon demonstrates how types can also be leveraged to make a project systematically more secure. She investigates (1) how static type checkers such as Pyre or MyPy can be extended with simple library modifications to catch vulnerable patterns, and (2) how deeper type-based static analysis can reliably flag remaining use cases to security engineers.
Securing SSH traffic to 190+ data centers
Samuel Rhea, Product Manager, Cloudflare
Evan Johnson, Security Engineering Manager, Cloudflare
Cloudflare maintains thousands of servers in more than 190 points of presence that need to be accessed from multiple offices. Samuel and Evan discuss their experiences depending on a private network and SSH keys to securely connect to those machines. They share the risk that the private network perimeter poses if breached and the need to carefully manage and revoke those keys as needed. They demonstrate how they resolved these challenges, by building and migrating to a model in which they expose the servers to the public internet and authenticate them with an identity provider to reach them. To do this, they deployed a system that leverages ephemeral certificates, based on user identity, to delete SSH keys as an organization. Samuel and Evan ultimately share what they’ve learned in three years: That Cloudflare has been building a zero-trust layer on top of its existing network to secure both HTTP and non-HTTP traffic.
Enforcing encryption at scale
Mingtao Yang, Software Engineer, Facebook
Ajanthan Asogamoorthy, Software Engineer, Facebook
Facebook runs a global infrastructure that supports thousands of services, with many new ones spinning up daily. Protecting network traffic is taken very seriously, and engineers must have a sustainable way to enforce security policies transparently and globally. One requirement is that all traffic that crosses “unsafe” network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management. Mingtao and Ajanthan describe the infrastructure they built for enforcing the “encrypt all’ policy on the end hosts, as well as alternatives and trade-offs encompassing how they use BPF programs. Additionally, they discuss Transparent TLS (TTLS), a solution that they’ve built for services that could not enable TLS natively or could not easily upgrade to a newer version of TLS.
The call is coming from inside the house: Lessons in securing internal apps
Hongyi Hu, Engineering Manager, Dropbox
Locking down internal apps presents unique and frustrating challenges for appsec teams. Your organization may have dozens if not hundreds of sensitive internal tools, dashboards, and control panels, running on heterogenous technical stacks with varying levels of code quality, technical debt, external dependencies, and maintenance commitments. Hongyi discusses experiences in managing internal appsec, conveying the technical and management lessons Dropbox has learned in the process. He captures what worked well — finding a useful mental model to organize a road map and rolling out content security policy — and what didn’t.
Streaming, flexible log parsing with real-time applications
Bartley Richardson, AI Infrastructure Manager and Senior Data Scientist, NVIDIA
Logs from cybersecurity appliances are numerous, generated from heterogeneous sources, and frequently victim to poor hygiene and malformed content. Relying on an already understaffed human workforce to constantly write new parsers, triage incorrectly parsed data, and keep up with ever-increasing data volumes is bound to fail. Using RAPIDS, an open source data science platform, Bartley conveys how creating a more flexible, neural network approach to log parsing can overcome these obstacles. He presents an end-to-end workflow that begins with raw logs, applies flexible parsing, and then applies stream analytics (e.g., rolling z-score for anomaly detection) to the near real-time parsing. By keeping the entire workflow on GPUs (either on premises or in a cloud environment), he demonstrates near real-time parsing and the ability to scale to large volumes of incoming logs.
Automated detection of blockchain network events
Shamiq Islam, Director of Security, Coinbase
Ever wondered what goes on behind the scenes to keep user assets safe in the notoriously dangerous field of cryptocurrency custodianship? Turns out you can model cryptocurrency protocols after existing communications networks, then build tooling to monitor and respond to threats as they emerge. Shamiq opens with examples of concerning threats and walks us through ways to detect those threats. He further discusses challenges Coinbase experienced extending their tools from one protocol and heuristic to many protocols and heuristics.