苏黎世联邦理工DS3Lab:构建以数据为中心的机器学习系统

本文涉及的产品
函数计算FC,每月15万CU 3个月
简介: 苏黎世联邦理工DS3Lab:构建以数据为中心的机器学习系统

机器之心知识站与国际顶尖实验室及研究团队合作,将陆续推出系统展现实验室成果的系列技术直播,作为深入国际顶尖团队及其前沿工作的又一个入口。赶紧点击「阅读原文」关注起来吧!


12月15日-12月22日,最新一期「机器之心走近全球顶尖实验室」邀请到苏黎世联邦理工学院(ETH Zurich) DS3Lab实验室带来分享。

苏黎世联邦理工学院(ETH Zurich) DS3Lab 实验室 (https://ds3lab.inf.ethz.ch/) 由该校助理教授张策以及16名博士生和博士后组成,实验室隶属于计算机系系统组(https://systems.ethz.ch/)。我们相信,让所有人都可以经济、高效地开发安全可信的机器学习应用是未来机器学习技术推动社会发展的关键一环。以此为中心,DS3Lab 的研究重点是构建以数据为中心、以人为中心,并具有高效可扩展能力的下一代机器学习平台和系统。


DS3Lab 主要致力于两大研究方向,覆盖了机器学习开发、运行、和运维的整个生命周期:1) Ease.ML项目: 该项目主要研究如何设计、管理、加速以数据为中心的机器学习开发、运行和运维流程;2) ZipML项目: 该项目面向新的软硬件环境(如云计算、分布式、数据库、新硬件等),设计实现高效可扩展的机器学习系统。


DS3Lab 的研究成果影响了多个企业级的机器学习平台或其子系统的设计,比如 openGauss 数据库的 DB4AI 系统、Bagua 分布式机器学习平台和 Persia 分布式推荐系统;实验室同时也和工业界(例如 Microsoft、Google、Alibaba、eBay、Oracle, IBM, Kuaishou)合作,共同研发了一些原型系统。DS3Lab 也致力于促进人工智能对社会的积极影响,与一系列国际组织(如世界自然基金会)合作,将机器学习应用于气候、地理、医疗等领域

近年来,DS3Lab 的成员在研究方面获得过诸如 Google Focused Research Award, ICLR Outstanding Paper Award, SIGMOD Best Paper Award, SIGMOD Research Highlight Award 等奖项。许多 DS3Lab 的博士生和博士后在毕业或者出站后能够在世界知名大学,如丹麦哥本哈根大学(University of Copenhagen),创建自己的研究小组,开启新的学术生涯。

12月15日-12月22日,来自苏黎世联邦理工学院(ETH Zurich) DS3Lab实验室的11位嘉宾将带来6期分享:构建以数据为中心的机器学习系统,详情如下:

12月15日 20:00-21:30

主题一:Building ML Systems in the Era of Data-centric AI

Speaker: Ce Zhang

Biography: Ce is an Assistant Professor in Computer Science at ETH Zurich. The mission of his research is to make machine learning techniques widely accessible---while being cost-efficient and trustworthy---to everyone who wants to use them to make our world a better place. He believes in a system approach in enabling this goal, and his current research focuses on building next generation machine learning platforms and systems that are data-centric, human-centric, and declaratively scalable. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD at University of Wisconsin-Madison and spent another year as a postdoctoral researcher at Stanford. His work has received recognitions such as the SIGMOD Best Paper Award, SIGMOD Research Highlight Award, and Google Focused Research Award, and has been featured and reported by Science, Nature, the Communications of the ACM, and a various media outlets such as Atlantic, WIRED, Quanta Magazine, etc.

Abstract: In this talk, I will provide a broad overview of the research activities of DS3Lab, centered on improving the scalability and usability of machine learning systems and platforms. Many of these activities will be discussed in detail in the following six sessions.


主题二:Ease.ML/DataScope: Data Debugging for ML

Speaker: Bojan Karlas

Biography: Bojan is a fourth-year PhD student at the Systems Group of ETH Zurich advised by Prof. Ce Zhang. His research revolves around data management systems for machine learning. Particularly, he works on building principled tools for managing machine learning workflows, data debugging and data cleaning for ML. His industry experience involves research internships at Microsoft, Oracle and Logitech, as well as a two-year software engineer position at the Microsoft Development Center Serbia where he worked in the SQL Server Parallel Data Warehouse team. Prior to joining ETH, he got a computer science master's degree at EPFL in Lausanne and a software engineering bachelor's degree at ETF Belgrade.

Abstract: Which training examples in my training set are more “guilty” to make my trained machine learning (ML) model low accuracy, unfair, or non-robust to adversarial attacks? If some of my training examples contain missing information, which incomplete training example should I repair first? Answering questions like these will have a profound impact on future ML systems to guide users through both development (MLDevs) and operation (MLOps). This requires us to understand the fundamental concept of data importance with respect to model training.In this talk, we focus on two types of data errors (missing values and wrong values) and examine principled methods for guiding the attention of our data debugging effort. Specifically, for each of these types of data errors, we will go over two kinds of ways to represent data importance which we focus on in our work. For missing data we look at information gain, and for wrong data we look at the Shapley value. There have been intensive recent efforts in speeding up and scaling up the calculation of these importance measures via MCMC or proxy models such as k-nearest neighbors. However, all these efforts only deal with ML training in isolation and ignore the data processing pipelines (or queries) precursing ML training in most, if not all, ML applications. This significantly limits the application of these techniques in real-world scenarios. Our work takes the first step towards jointly analyzing data processing pipeline with ML training and we develop novel algorithms for computing these important metrics in polynomial time.


12月16日 20:00-21:00

主题:Ease.ML/ModelPicker and Ease.ML/CI: Model Testing and CI/CD for Machine Learning

Speaker: Cedric Renggli and Merve Gurel

Biography:

  • Cedric is a PhD Student supervised by Ce Zhang in the Systems Group at ETH Zurich. His main research interest lies in the foundation of usable, scalable and efficient data-centric systems to support all kinds of interactions in a machine learning model lifecycle, broadly known as MLOps. This notably led to the definition of new engineering principles, such as how to run a feasibility study for ML application development, or how to perform continuous integration (CI) of ML models with statistical guarantees. Cedric is furthermore working on efficient methods to enable model search functionalities in pre-trained model collections. This typically serves as a starting point to solve new machine learning tasks using transfer learning.Cedric holds a bachelor's degree from the Bern University of Applied Sciences and received his MSc in Computer Science from ETH Zurich in 2018. His work on Efficient Sparse AllReduce For Scalable Machine Learning was awarded with the silver medal of ETH Zurich for outstanding master thesis. During his PhD, Cedric has worked as a research intern and student research consultant at Google Brain in Zurich.
  • Merve is a PhD candidate within theSystems Group in theDepartment of Computer Science at ETH Zurich. Her research goal is to form a principled understanding of many aspects of machine learning systems: from hardware to label efficiency and robustness. Moreover, using this knowledge, she aims to build theoretically-sound, usable, robust and scalable machine learning systems. Recently, her efforts have been recognized by Google where she has received the 2021 Google Scholarship award. She obtained her masters from theSchool of Computer and Communication Sciences at EPFL, where she was a research scholar at the Information Theory Laboratory. There she worked on topics of advanced probability and information theory such as convergence rates of bounded martingales and rate-distortion theory. During and after her masters, she was also a researcher at IBM Research in Zurich, and worked on building denoising algorithms for modern radio telescopes.

Abstract:Machine learning projects are usually not completed after having successfully trained a machine learning model satisfying the required performance indicators (e.g., accuracy). Rather the model enters into the operational phase, whereby multiple new challenges can occur. In this talk, we will focus on two main data-centric challenges in the operational stage of a ML model lifecycle: (1) how to continuously integrate and test a model to fulfill some requirements with strong statistical guarantees, and (2) how to select the best generalizing model for a target classification task by querying as minimum label as possible. We illustrate their benefits in several practical scenarios.

In the first part of this talk, we focus on continuous integration, an indispensable step of modern software engineering practices to systematically manage the life cycles of system development. Developing a machine learning model is no difference — it is an engineering process with a life cycle, including design, implementation, tuning, testing, and deployment. However, most, if not all, existing continuous integration engines do not support machine learning as first-class citizens. In this talk we present ease.ml/ci, to our best knowledge, the first continuous integration system for machine learning providing statistical guarantees. The challenge of building ease.ml/ci is to provide rigorous guarantees, e.g., single accuracy point error tolerance with 0.999 reliability, with a practical amount of labeling effort, e.g., 2K labels per test. We design a domain specific language that allows users to specify integration conditions with reliability constraints, and develop simple novel optimizations that can lower the number of labels required by up to two orders of magnitude for test conditions popularly used in real production systems.

In the second part, we present Model Picker, a set of active model selection strategies to select the model with the best generalization capability for a target task. Specifically, we ask: “Given k pre-trained classifiers and a stream or pool of unlabeled data examples, how can we actively decide whose label to query so that we can distinguish the best model from the rest while making a small number of queries?” Towards that, we introduce two active model selection strategies under the umbrella of Model Picker for pool- and stream-based settings. We also establish their theoretical guarantees, and extensively demonstrate its effectiveness in our experimental studies. We show that the competing methods often require up to 2.5x more labels to reach the same performance with Model Picker.


12月17日 20:00-21:00

主题:Ease.ML/Snoopy: Automatic Feasibility Study for Machine Learning

Speaker: Luka Rimanic

Biography:Luka Rimanic earned his PhD in 2018 in the mathematics department at University of Bristol, after completing Part III at the University of Cambridge. Back in the days, he was exploring the field of additive combinatorics, whilst using a number of other fields in conjunction. Upon completing his PhD degree, Luka had a spell in industry as a consultant for machine learning tasks before joining the DS3Lab, Systems Group at ETH, in October 2019. He currently works on several projects concerning differential privacy, transfer learning and the usability of machine learning systems, with particular focus on the theory behind such systems. In recent years his work has been published in various top-tier conferences.

Abstract:A common problem that domain experts who are using today’s AutoML systems encounter is what we call unrealistic expectations - when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning. Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In practice, if these were done by human machine learning consultants, they would first analyze the representative dataset for the defined task and assess the feasibility of the target accuracy - if the target is not achievable, one can then explore alternative options by refining the dataset, the acquisition process, or investigating different task definitions. In this talk, we ask: Can we automate this feasibility study process?

We will present a system whose goal is to perform an automatic feasibility study before building machine learning applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error. Over the years, we have been conducting a series of work which provide important insights into many decisions taken when designing our system. These consist of building a novel framework for evaluating Bayes error estimators, theory behind applying feature transformation to improve the performance of certain estimators, and system optimisations that involve a multi-armed bandit approach, together with an improved version of the successive-halving algorithm under certain assumptions. We will describe each of these ingredients, culminating in end-to-end experiments that show how users are able to save substantial time and monetary efforts.


12月20日 20:00-21:00

主题:Bagua & Persia: Distributed ML Systems (Bagua & Persia)

Speaker: Shaoduo Gan and Binhang Yuan

Biography:

  • Shaoduo Gan has attained his PhD degree from ETH Zurich in 2021. His research is focused on distributed learning systems and the framework of machine learning. His work has been published at VLDB, SIGMOD, IEEE TPDS, ICML and NeurIPS.
  • Binhang Yuan is a Postdoc researcher in the Computer Science Department,  ETH Zürich. He obtained his Ph.D in Computer Science from the Computer Science Department,  Rice University. His research interests include database for machine learning, distributed machine learning, and recommendation systems.

Abstract:The increasing scalability and performance of distributed machine learning systems has been one of the main driving forces behind the rapid advancement of machine learning techniques. In this talk, we will introduce two recently released open source distributed learning toolkits (Bagua and Persia) developed under the collaboration between DS3Lab and Kwai Seattle AI Lab.

Bagua is a general purpose distributed data-parallel training system, which is designed to bridge the gap between the current landscapes of learning systems and optimization theory. Recent years have witnessed plenty of algorithmic design advances for data parallelism in order to lower the communication via system relaxations: quantization, decentralization, and communication delay. However, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Therefore, we build BAGUA, a MPI-style communication library, providing a collection of primitives that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 128 GPUs, BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2 times) across a diverse range of tasks.

Persia is a distributed training system specialized for extremely large scale recommender systems. Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation--the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest of the neural network is increasingly computation-intensive. We resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms. Both theoretical demonstration and empirical study up to 100 trillion parameters have been conducted to justify the system design and implementation of Persia.


12月21日 20:00-21:00

主题:Data Ecosystem Integration for Machine Learning: Serverless and Databases

Speaker: Jiawei Jiang and Lijie Xu

Biography:

  • Jiawei Jiang is now a postdoctoral researcher in the Department of Computer Science of ETH Zürich. He obtained his Ph.D in Computer Science from Peking University in 2018. His research interests include scalable machine learning, distributed optimization, and automatic machine learning.
  • Lijie Xu is now a postdoctoral researcher in the Department of Computer Science of ETH Zürich. He obtained his PhD degree from the Institute of Software, Chinese Academy of Sciences. His current research interests include in-database machine learning, big data systems, and distributed systems.

Abstract:How to integrate machine learning to today’s data ecosystems effectively and efficiently is a challenging problem. Today’s data ecosystems have different data storage and computation paradigms, which may degrade the ML performance and scalability if implemented on them. In this talk, we focus on integrating ML to two types of data ecosystems:

(1) ML on serverless: The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over serverful infrastructures (IaaS). In this talk, we will present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimisation algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.

(2) In-DB ML (DB4AI): One key problem for in-DB ML is about how to implement an effective and efficient Stochastic Gradient Descent (SGD) paradigm in DB. SGD requires random data access that is inherently inefficient when implemented in database systems that rely on block-addressable secondary storage such as HDD and SSD. In this talk, we will present a new SGD paradigm for in-DB ML with a novel data shuffling approach. Compared with existing approaches, our approach avoids a full data shuffle while maintaining a comparable convergence rate of SGD as if a full shuffle was performed. We have integrated it into PostgreSQL and openGauss, by introducing three new physical operators. The experimental results show that our approach can achieve comparable convergence rate with the full shuffle based SGD, and outperform the state-of-the-art in-DB ML tools.


12月22日 20:00-21:00

主题:ML and Modern Hardware Hardware

Speaker: Zeke Wang and Wenqi Jiang

Biography:

  • Dr. Zeke Wang is currently a research Professor, affiliated to the Artificial Intelligence Collaborative Innovation Center,  School of Computer Science, Zhejiang University. In 2011, he received his Ph.D. from Zhejiang University's School of Instrumental Studies, and worked as an assistant researcher at Zhejiang University's School of Instrumental Studies from 2012 to 2013. From 2013 to 2017, he was a postdoctoral fellow at Nanyang Technological University and National University of Singapore. From 2017 to December 2019, he was a postdoctoral fellow in the Systems Group of ETH Zurich. Served as a program committee member of multiple international conferences (such as KDD) and reviewers of multiple international journals (such as TPDS, TC, TCAD).
  • Wenqi Jiang is a PhD student at ETH Zurich advised by Prof. Gustavo Alonso. His research is centered around improving and replacing state-of-the-art software-based systems by emerging heterogeneous hardware. To be more specific, he is exploring how reconfigurable hardware (FPGAs) can be applied and deployed in data centers efficiently. Such research ranges from application-specific designs such as recommendation systems and information retrieval systems to infrastructure development such as distributed frameworks and virtualization on the cloud.

Abstract:Various hardwares have been adopted for (distributed) machine learning, in this talk, we will first talk about some recent attempts to combine SmartNIC and FPGAs to efficiently support machine learning workflows; to be more specific, in the second half of the talk we will talk about some advances about leveraging FPGAs to accelerate the inference procedure in recommendation systems.

As Moore’s Law is about to end, the computing power of traditional CPUs cannot continue to increase rapidly, and the current network speed continues to increase rapidly, so the current CPU computing power cannot meet the requirements of the network for data processing capabilities, so we need to offload the network protocol stack  to the SmartNIC to reduce the computing pressure of the server CPU; on the other hand, we can also reasonably offload some computing tasks in the Machine Learning training system to the SmartNIC to improve the overall performance of the training system. At present, ASIC-based SmartNICs can only support specific and limited application offloading; and ARM processor-based SmartNICs have very limited computing power and cannot support the offloading of complex functions. For this reason, we think that the SmartNIC based on high-end FPGA can flexibly support the Machine Learning training system. Our SmartNIC has enough computing power to allow for communication-related task offloading, such as compression/decompression.

Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads which are typically bound by computation, recommendation inference is largely bound by memory due to the many random accesses needed to lookup the embedding tables. To this end, we first present MicroRec (MLSys'21), a high-performance FPGA inference engine for recommendation systems that tackles the memory bottleneck by both computer architecture and data structure solutions. Once the memory bottleneck is removed, the DNN computation becomes the main bottleneck again due to the limited power of computation delivered by FPGAs. Thus, we further design and implement a high-performance and heterogeneous recommendation inference cluster named FleetRec (KDD'21) that takes advantage of the strengths of both FPGAs and GPUs. Experiments on three production models up to 114 GB show that FleetRec outperforms optimized CPU baseline by more than one order of magnitude in terms of throughput while achieving significantly lower latency.

相关文章
|
14天前
|
人工智能 自然语言处理 安全
通过阿里云Milvus与PAI搭建高效的检索增强对话系统
阿里云向量检索Milvus版是一款全托管的云服务,兼容开源Milvus并支持无缝迁移。它提供大规模AI向量数据的相似性检索服务,具备易用性、可用性、安全性和低成本等优势,适用于多模态搜索、检索增强生成(RAG)、搜索推荐、内容风险识别等场景。用户可通过PAI平台部署RAG系统,创建和配置Milvus实例,并利用Attu工具进行可视化操作,快速开发和部署应用。使用前需确保Milvus实例和PAI在相同地域,并完成相关配置与开通服务。
|
7天前
|
机器学习/深度学习 数据采集 JSON
Pandas数据应用:机器学习预处理
本文介绍如何使用Pandas进行机器学习数据预处理,涵盖数据加载、缺失值处理、类型转换、标准化与归一化及分类变量编码等内容。常见问题包括文件路径错误、编码不正确、数据类型不符、缺失值处理不当等。通过代码案例详细解释每一步骤,并提供解决方案,确保数据质量,提升模型性能。
125 88
|
5天前
|
SQL 存储 人工智能
DMS+X构建Gen-AI时代的一站式Data+AI平台
本文整理自阿里云数据库团队Analytic DB、PostgreSQL产品及生态工具负责人周文超和龙城的分享,主要介绍Gen-AI时代的一站式Data+AI平台DMS+X。 本次分享的内容主要分为以下几个部分: 1.发布背景介绍 2.DMS重磅发布:OneMeta 3.DMS重磅发布:OneOps 4.DMS+X最佳实践,助力企业客户实现产业智能化升级
DMS+X构建Gen-AI时代的一站式Data+AI平台
|
12天前
|
机器学习/深度学习 数据采集 算法
机器学习在生物信息学中的创新应用:解锁生物数据的奥秘
机器学习在生物信息学中的创新应用:解锁生物数据的奥秘
113 36
|
19天前
|
机器学习/深度学习 人工智能
Diff-Instruct:指导任意生成模型训练的通用框架,无需额外训练数据即可提升生成质量
Diff-Instruct 是一种从预训练扩散模型中迁移知识的通用框架,通过最小化积分Kullback-Leibler散度,指导其他生成模型的训练,提升生成性能。
47 11
Diff-Instruct:指导任意生成模型训练的通用框架,无需额外训练数据即可提升生成质量
|
13天前
|
人工智能 Kubernetes Cloud Native
跨越鸿沟:PAI-DSW 支持动态数据挂载新体验
本文讲述了如何在 PAI-DSW 中集成和利用 Fluid 框架,以及通过动态挂载技术实现 OSS 等存储介质上数据集的快速接入和管理。通过案例演示,进一步展示了动态挂载功能的实际应用效果和优势。
|
17天前
|
人工智能 运维 API
PAI企业级能力升级:应用系统构建、高效资源管理、AI治理
PAI平台针对企业用户在AI应用中的复杂需求,提供了全面的企业级能力。涵盖权限管理、资源分配、任务调度与资产管理等模块,确保高效利用AI资源。通过API和SDK支持定制化开发,满足不同企业的特殊需求。典型案例中,某顶尖高校基于PAI构建了融合AI与HPC的科研计算平台,实现了作业、运营及运维三大中心的高效管理,成功服务于校内外多个场景。
|
1月前
|
机器学习/深度学习 人工智能 自然语言处理
模型训练数据-MinerU一款Pdf转Markdown软件
MinerU是由上海人工智能实验室OpenDataLab团队开发的开源智能数据提取工具,专长于复杂PDF文档的高效解析与提取。它能够将含有图片、公式、表格等多模态内容的PDF文档转化为Markdown格式,同时支持从网页和电子书中提取内容,显著提升了AI语料准备的效率。MinerU具备高精度的PDF模型解析工具链,能自动识别乱码,保留文档结构,并将公式转换为LaTeX格式,广泛适用于学术、财务、法律等领域。
230 4
|
1月前
|
机器学习/深度学习 存储 运维
分布式机器学习系统:设计原理、优化策略与实践经验
本文详细探讨了分布式机器学习系统的发展现状与挑战,重点分析了数据并行、模型并行等核心训练范式,以及参数服务器、优化器等关键组件的设计与实现。文章还深入讨论了混合精度训练、梯度累积、ZeRO优化器等高级特性,旨在提供一套全面的技术解决方案,以应对超大规模模型训练中的计算、存储及通信挑战。
84 4
|
2月前
|
机器学习/深度学习 算法 数据挖掘
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构。本文介绍了K-means算法的基本原理,包括初始化、数据点分配与簇中心更新等步骤,以及如何在Python中实现该算法,最后讨论了其优缺点及应用场景。
160 4