本文是两篇 serverless 论文笔记合集:第 1 部分整理自 Serverless Computing 笔记(原文 2020-09-09),第 2 部分整理自 SoCC 2020 paper notes(原文 2020-10-28)。


第 1 部分 · SOSP’19 / Berkeley View 系列

(原文写于 2020-09-09)

1.1 Serverless Computing: One Step Forward, Two Steps Back

Three Design Patterns

  1. Embarrassingly parallel functions
    In some applications, each function invocation is an independent task and never needs to communicate with other functions. such “map” functions, which can directly exploit Lambda’s auto-scaling features to scale.

  2. Orchestration functions
    A second class of use cases leverages serverless functions simply to orchestrate calls to proprietary autoscaling services, such as large-scale analytics. For example, using Lambda functions to preprocess event
    streams before funneling them to Athena via S3.

  3. Function Composition
    The third category consists of collections of functions that are composed to build applications and thus need to pass along outputs and inputs.

Serverless Is Too Less

  1. Limited Lifetimes
    After 15 minutes, function invocations are shut down by the Lambda infrastructure.
    There is no way to ensure that subsequent invocations are run on the same VM.

  2. I/O Bottlenecks
    Recent studies show that a single Lambda function can achieve on average 538Mbps network bandwidth.
    Worse, AWS appears to attempt to pack Lambda functions from the same user together on a single VM, so the limited bandwidth is shared by multiple functions.

  3. Communication Through Slow Storage
    Two Lambda functions can only communicate through an autoscaling intermediary service.
    Hence maintaining state across client calls requires writing the state out to slow storage, and reading it back on every subsequent call.

A number of discussion follow directly.

FaaS is a Data-Shipping Architecture.

FaaS hinders Distributed Computing(slow and expensive storage).

Case Studies: Model Training

(1). Lambda (465 minutes):

Each Lambda is allocated the maximum lifetime (15 min) and 640MB RAM and runs as many training iterations as possible.

Each iteration in Lambda took 3.08 seconds: 2.49 to fetch a 100MB batch from S3 and 0.59 seconds to run the AdamOptimizer.

We trained the model over 10 full passes of the training data, which translates to 31 sequential lambda executions, each of which runs for 15 minutes, or 465 minutes total latency. This costs $0.29.

(2). Tensorflow (22 minutes):

For comparison, we trained the same model on an m4.large EC2 instance, which has 8GB of RAM and 2vCPUs. In this setting, each iteration is significantly faster (0.14 seconds): 0.04 seconds to fetch data from an EBS volume and 0.1 seconds to run the optimizer.

The same training process takes about 1300 seconds (just under 22 minutes), which translates to a cost of $0.04.

Lambda’s limited resources and data-shipping architecture mean that running this algorithm on Lambda is 21× slower and 7.3× more expensive than running on EC2.

Case Studies: Prediction Serving

Only on CPU, and we limited all experiments here to 10-message batches.

If the model was retrieved on every invocation, the average latency over 1,000 batch invocations for the Lambda application was 559ms per batch with S3 and was 447ms with SQS queue.

An EC2 machine (not function) to receive SQS message batches—this showed a latency of 13ms per batch averaged over 1,000 batches—27× faster than our “optimized” Lambda implementation.

An EC2 machine (not function) with ZeroMQ had a per batch latency of 2.8ms—127× faster than the optimized Lambda implementation.

SQS request rate alone would cost $1,584 per hour.

Case Studies: Distributed Computing

try alternative solutions to achieve distributed computation.

we measure the cost of 1KB argument for a Lambda function invocation — this incurs both I/O and function overheads (303ms).

The cost of explicit I/O from Lambda to S3 is 108ms and to DynamoDB is 11ms.

The cost of explicit I/O from EC2 to S3 is 106ms and to DynamoDB is 11ms.

The cost of explicit I/O from EC2 to 0MQ is 290us.


1.2 Cirrus: a Serverless Framework for End-to-end ML Workflows

This work proposes Cirrus, a distributed ML training framework that addresses these challenges by leveraging serverless computing.

End-to-end ML Workflow Challenges

Over-provisioning

The heterogeneity of the different tasks in an ML workflow leads to a significant resource imbalance during the execution of a training workflow.

Explicit resource management

The established approach of exposing low-level VM resources, such as storage and CPUs, puts a significant burden on ML developers who are faced with the challenge of provisioning, configuring, and managing these resources for each of their ML workloads.

Serverless Computing limitations

Small local memory and storage

AWS lambdas can only access at most 3GB of local RAM and 512MB of local disk

Low bandwidth and lack of P2P communication

We find that the largest AWS Lambda can only sustain 60MB/s of bandwidth. AWS Lambdas do not allow peer-to-peer communication. Thus, common communication strategies used for datacenter ML, such as tree-structured or ring-structured AllReduce communication [43], become impossible to implement efficiently in such environments.

Short-lived and unpredictable launch times

AWS lambdas can take up to several minutes to start after being launched. This means that during training, lambdas start at unpredictable times and can finish in the middle of training. This requires ML runtimes for lambdas to tolerate the frequent departure and arrival of workers.

Lack of fast shared storage

shared storage needs to be low-latency, high-throughput, and optimized for the type of communications in ML workloads.

Cirrus Design

(1) Worker runtime

The worker runtime provides two APIs.

Data iterator API: prefetches and buffers minibatches in the lambda’s local memory in parallel with the worker’s computations to mitigate the high latency (>10ms) of accessing S3.

Data store client API: data compression, sparse transfers of data, asynchronous communication and sharding across multiple nodes.

(2) Distributed data store

Intermediate stored data(in cloud VMs) to be shared by all workers.

It achieves latencies as low as 300μs versus ≈ 10ms for AWS S3.

Cirrus Implementation

(1) Client backend

Lambdas that are launched during training are relaunched automatically when their lifetime terminates (every 15 minutes).

The backend keeps a pool of threads that can be used for responding to requests for new lambda tasks.

(2) Distributed data store

Cirrus’s distributed data store provides an interface supporting a key-value store interface (set/get) and a parameter-server interface (send gradient / get model).

several optimizations:

  • multithreaded server that distributes work across many cores
  • data compression for the gradient and models
  • sparse gradient and model data structures

(3) Worker runtime

For data access, the runtime provides a minibatch-based iterator backed by a local memory ring-buffer that allows workers to access training minibatches with low latency.

In addition, it provides an efficient API to communicate with the distributed data store.

Evaluation

Tensorflow was executed on a 32-core node (performed better than on 1 Titan V GPU) and Cirrus ran in 10 lambdas.

Problems

  1. evaluation not fair.
  2. Distributed data store how to attend 300μs not clear.
  3. only focus on throughput not on latency(only example for training).

1.3 Narrowing the Gap Between Serverless and its State with Storage Functions (Shredder)

Shredder Design

Internally, Shredder consists of three layers: a networking layer, a storage layer, and a function layer.

Each CPU core runs all three layers, but CPU cores follow a shared-nothing design; the state of these layers is partitioned across CPU cores to avoid contention and synchronization overheads.

The storage layer hosts all tenants’ data in memory and has a get()/put() key-value interface.

The function layer matches incoming requests to their storage function code and context, and it executes the operation within a per-core instance of the V8 runtime.

Each V8 runtime has a set of embedded trusted access methods to avoid expensive calls between the function runtime and the storage layer.

Isolation and Context Management

Shredder relies on V8’s Contexts to isolate tenant-provided code.

Zero-copy Data Access

Local get() and put() operations are optimized to avoid copying records into and out of storage whenever possible.

Eliminating Boundary Crossings with CSA

CSA is portable across hardware architectures, and it can be translated into highly efficient machine code.

Key idea

  1. Separating storage from functions.
  2. V8 runtime per-core to ensure isolation.
  3. Zero-copy Data Access
  4. Trusted CSA code within the V8 runtime is given read-only access to the data storage.

Evaluation

Cost of Isolation; CPU Scalability; Tenant Scalability;

Problems

  1. the Zero-copy Data Access only supports local memory.
  2. for remote get, it only discusses the throughput, not the latency.
  3. V8 JavaScript runtime is heavy.
  4. binding compute with memory limits the scalability.

1.4 Centralized Core-granular Scheduling for Serverless Functions

Existing serverless platforms lack deployment and scheduling mechanisms.

Scheduling granularity

server-level scheduling:

Unpredictable performance due to sharing of physical and virtual resources among function invocations.

distributed task-scheduling:

Scalability challenge stems from latency of scheduling and task migration.

centralized and core-granularity scheduling:

core-granularity improves performance predictability and the centralized design maintains a global view of cluster resources.

A centralized core-granular scheduler

The centralized scheduler runs on a multicore server. We distinguish between cores on the scheduler server, called scheduler cores, and cores on worker servers, called worker cores.

Each scheduler core maintains a list of references to idle worker cores.

i -> iv: when a request comes, it chooses a worker core to execute. After finished, return it.
v -> vii: if no available worker cores, borrow from others.

Problems

  1. no evaluation
  2. not specific to ML

1.5 Discussion notes (设计草稿)

Platform vs user constraints

Platform:

  • virtualization technologies
  • context switch
  • distributed approach (scalability)
  • state locally

User:

  • 640M RAM and 15min limit
  • state from the slow external store
  • event granularity (sub-graph)

Common:

  • warmed up run-time
  • work assignment (event-driven trigger)

Burst

  1. 预测机制:提前 copy weights 到 function 附近位置(存储资源比计算资源便宜)
  2. 拆分机制:把 graph 拆分成多个 stage(function 不宜过大,利用 stage 加载时间去 cover 中间结果传输时间)
  3. 基于 ONNX 来拆分和执行 stage

将 graph 分成 stage:

  • 如果传输中间结果的时间,小于,加载 stage 的时间,可以认为,传输时间不产生费用(和 latency)
  • 选取数据量小的中间结果进行传输
  • 拆分的 stage,满足现有的 lambda 的内存要求
  • 模型未开始阶段,先预测需要多少的 function,提前把各个 stage 的 weight 传输过去

stateful:

  • 预加载到内存
  • 确保传输的中间结果比较小
  • 加入模型压缩,模型蒸馏

trade-off between latency, throughput and cost per request.

Important point:

  • tail-latency

讨论点:

  • cpu 是否 batch 越大越好
  • 预测所需资源(cpu, memory, therefore time)
  • 提出一种度量来描述 burst
  • 全 serverless(包含 frontend)
  • 先考虑技术(拆分 model + 调度执行),再考虑应用场景(如 request burst)

分两步自动化:

  1. 用 state machine 统一管理好已经创建好的 functions
  2. 自动化创建 functions
  3. 预测 workload 并自动拆分 onnx model

第 2 部分 · SoCC’20 系列

(原文写于 2020-10-28)

2.1 WuKong: In Search of a Fast and Efficient Serverless DAG Engine

A serverless-oriented, decentralized, data-locality aware DAG engine.

  • minimize the network communication overhead while maximizing data locality whenever possible.
  • decentralized scheduling where a DAG is partitioned into sub-graphs that are distributed to separate task executors.
  • provides efficient storage mechanisms for managing intermediate data.

A Journey from the serverful to the serverless

  • A Strawman Scheduler
    • Performance bottlenecks: the large number of concurrent TCP connection requests is easy to overwhelm the scheduler.
  • Publish/Subscribe Model
    • Sending task completion messages through pub/sub channels was more efficient than using a large number of concurrent TCP connections.
    • struggles to launch Lambda functions quickly enough for large, bursty workloads due to the large cost of invoking a Lambda function.
  • +Parallel Invokers

WuKong design

  • Static Scheduling
    • A static schedule contains three types of operations: task execution, fan-in, and fan-out.
    • Two leaf nodes implies two static schedules, blue and red.
    • T1, T2, T3, T5.
  • Dynamic Scheduling
    • Scheduling conflict zone between static schedules.
    • dependency counter.
    • T4, T6.
  • Storage Management
    • Task Executors publish their intermediate and final task output objects to the KV Store.
    • Final outputs are relayed to a Subscriber process in the Scheduler for presentation to the Client.
    • Small Fan-out Task Invocations are handled by task itself.
    • Large Fan-out Task Invocations are handled by KVS.

Notes

  1. doesn’t consider the cold-start issue.
    • As mentioned earlier, serverless computing suffers from cold starts. We address this issue by warming up a pool of Lambdas.
  2. only works when computation latency outweighs invocation latency between the two lambdas.
  3. methods:
    • Decentralization of Task Executors
    • Parallel-Invoker
    • KV Store Proxy

2.2 Sequoia: Enabling Quality-of-Service in Serverless Computing

Sequoia is designed as a drop-in framework that improves overall management by enabling QoS in a lightweight and low overhead manner.

Limitations with the current serverless platform

  • Inconsistent and incorrect concurrency limits
    • Default concurrency limits are documented to be 1,000, but up to 1,200 concurrent functions are run in parallel.
  • Mid-chain drops
    • A burst of 1,000 Fan-2’s will ultimately result in 2,000 concurrent functions, only 48-54% chains successfully completing.
  • Burst intolerance
    • Significant losses occur when burst due to the cold start issue.
  • HTTP prioritization
    • HTTP has a higher priority than background invocations.
  • Inefficient resource allocation
    • VM/container pool leads to inefficient resource allocation.
  • Concurrency collapse
    • Concurrency reaches the limit but then drops and does not immediately recover.

Sequoia design

  • QoS Scheduler
    • enqueue DAGs to Pending Queue (PQ), enqueue each node of DAGs to Chain Running Queue (CRQ).
    • Worker Threads in the thread pool are the main driver. When a thread schedules a function it blocks until its function completes and afterwards makes itself available to the thread pool.
  • Logging Framework
    • state information from live metric streams.
  • Policy Framework
    • an entry point to add, remove, or alter policies in the system.

Notes

  1. We can use a thread pool to manage the concurrency more precisely.

2.3 Kappa: A Programming Framework for Serverless Computing

Kappa is a framework that simplifies serverless development by providing a familiar programming model.

It uses checkpointing to handle lambda function timeouts, and provides concurrency mechanisms that enable parallel computation and coordination.

Kappa design

  • Checkpointing
    • To run long tasks on time-bounded lambdas, Kappa checkpoints program state periodically and restores from this checkpoint upon lambda function timeout.
  • Concurrency API
    • To program parallel lambdas, Kappa provides a concurrency API which is modeled after Python’s built-in multiprocessing package and should be familiar to programmers.
  • Fault tolerance
    • Using checkpoints, Kappa ensures that execution never diverges due to nondeterminism.

Checkpoint latency

For 1M size checkpoint, the latency is 80ms for S3 and 6ms for Redis.

Notes

  1. Checkpoint is not necessary for most serverless applications. Parallel composition is more common.
  2. Due to the low cost of serverless computing, failure-retry is more efficient than checkpoint.
  3. Redis is faster than S3.

2.4 Photons: Lambdas on a diet

Photons is a framework leveraging workload parallelism to co-locate multiple instances of the same function within the same runtime.

Concurrent invocations can then share the runtime and application state transparently, without compromising execution safety.

Photons design

  • Vertical and horizontal scaling
    • Vertical scaling: scheduled on the same machine
    • Horizontal scaling: scheduled on another machine
    • drawback: invocations infect each other and resource contention.
    • a good scale policy is hard for users.

Evaluation

  • Memory
  • Cold starts
  • performance / cost tradeoff

Notes

  • This approach sacrifices the auto-scale ability.

2.5 ServerlessBench: Characterizing Serverless Platforms

ServerlessBench is an open-source benchmark suite for characterizing serverless platforms.

  • Four critical metrics:
    • Communication performance.
    • Startup latency.
    • Stateless execution.
    • Resource efficiency and performance isolation.
  • 12 test cases.

Test cases & notes

TestCase 1: Varied resource needs.

These results suggest that splitting the application and provisioning the appropriate amount of resources to the composing functions can save costs with a small additional communication overhead.

TestCase 2: Parallel composition.

Notice that in-function parallelization does not perform as well as inter-function parallelization. This is because serverless platforms restrict the computing resources (e.g., vCPUs) that can be allocated to a function instance.

TestCase 8: Function size comparison.

Functions with larger code sizes suffer from longer startup latency.

A serverless application should only import the minimal needed packages and pack only necessary dependencies.

TestCase 11/12: Memory bandwidth contention / CPU contention.

There exists resource contention between serverless applications running on the same machine.


2.6 Conclusion / 整体观察

  • Decentralization of task executors.
  • Using a thread pool to manage the concurrency more precisely.
  • Performance-cost tradeoff.
    • AWS Lambda provides vCPUs in proportion to the provisioned memory size.
    • Less CPU means less memory, so splitting the model into slices can reduce the memory.
    • Same execution time but less cost.
  • Probability model — model split optimization based on latency probability.

Training

  1. Low cost: split model into slices, less memory but same execution time.
  2. Stateless: each lambda holds a small model slice and only has a single task, either fwd or bwd.
    • no memory swap in and out.
  3. Support distributed training.