Pipelet

You,projects

Recently, a paper I worked on as an undergraduate researcher at Berkeley's NetSys Lab was published in HotNets '24 (opens in a new tab). The paper introduces a new design pattern, Knactor, for microservices in Kubernetes that makes it easier to build and manage distributed applications.

I have worked on building this project for almost 6 months: designing and building the system, thinking through multiple optimizations and performing extensive performance benchmarking. The development process has taugh me a lot about taking abstract ideas, turning them into research proof-of-concepts and how to then use these prototypes to prove their usefulness through data gathering.

In this post, I’ll dive into my contribution to the project: designing an experimental asynchronous data synchronization mechanism for microservices.

What is Knactor?

When building distributed applications, there are two common ways to connect services: RPC calls and Pub/Sub messaging.

Service Composition

Knactor is a service composition design pattern rather than a specific implementation. Different Knactors can synchronize different types of resources — such as data, API calls, or configuration state.

One implementation of this pattern is my project Pipelet, a Knactor that specializes in syncing data between services.

What is Pipelet?

Pipelet is an infrastructure service that synchronizes data asynchronously between source and destination databases. It is Kubernetes-native (opens in a new tab), written in Python and Golang, and designed for high-performance data synchronization for large-scale systems. Instead of requiring services to expose APIs for data exchange(like in RPC composed systems), Pipelet treats data as a first-class concept, allowing services to interact with their local stores while Pipelet handles global consistency.

For example, imagine your service generates metrics data. Instead of pushing metrics directly to a central metrics database, the service writes to its own local store. Pipelet collects this data asynchronously, batches and processes it as needed, and then writes it efficiently to the central store.

This approach has several benefits:

Design

While designing Pipelet, I split it into three key parts:

Pipelet Design

I took this two-layer approach — one layer for orchestration and another layer for high-speed data transfers — since it let me leverage each language where it works best. For the operator, I used Kopf (opens in a new tab) framework in Python to keep track of changes in the Kubernetes state. Whenever a Pipelet CRD appeared or updated, the operator either spun up a new data sync job in Go or updated an existing one.

The Go synchronizers run as a background process in the same Docker container as the operator, periodically reading data from source stores, applying any transformations, and then writing the results to destination stores. Go is great here because it’s fast and provides efficient concurrency through Go routines making the system more performant.

Finally, for data storage, I used Zed lakes (opens in a new tab) (from the Brimdata project (opens in a new tab)). They made it easy to handle large data sets and run simple ETL transformations. This meant that Pipelets could filter, aggregate and reshape data before sending data off.

By splitting pipelet into the Python frontend and Go backend, I tried experimenting with using and integrating different languages in a single system. While there were benefits to doing so, in hindsight, I would use Go for both parts to reduce the complexity of the system.

Pipelet Operator

Kubernetes operators (opens in a new tab) provide a way to automate cluster tasks based on resource state changes. In Pipelet, the operator is responsible for detecting Pipelet-specific Custom Resource Definitions (CRDs) and taking the corresponding action—such as starting or modifying a synchronization job. I used Kopf (opens in a new tab) framework to trigger callbacks when Pipelet CRDs were added or updated (i.e. when kubectl edit is used).

The operator runs as a pod within the Kubernetes cluster with privileged access to the K8s API. This integration allowed the Pipelet operator to read and manage relevant Pipelet resources.

CRD Example

Using the CRDs users can specify data sources and destinations, ETL transformations, synchronization intervals, and other key parameters. Below is an example CRD that sets up a single Pipelet:

apiVersion: knactor.io/v1
kind: Pipelet
metadata:
  name: "p1"
spec:
  control:
    intent:
      src: ["load@main"]
      dst: ["c1@main"]
      flow: "ts | sort -r ts | sum(random_number) by random_number"
      interval: 1.0
      eoio: true
      action: "create"
    status:
      assigned: ""
      active: false

This definition instructs Pipelet to pull data from the load source in the main branch, apply the specified transformation (flow), and push the processed data to c1—also in the main branch—every second. Setting eoio (exactly-once-in-order) ensures that data is synchronized without duplication or gaps.

Once this CRD is created or updated, the operator reads it via the Kubernetes API and writes a corresponding update to the sync table in Zed, which kicks off the data synchronization tasks in the Golang binary.

Data Synchronizer

The data synchronizer is a Golang application which runs as a background process in the same K8s pod as the operator. It consists of a manager loop which receives the synchronization jobs and spins up worker Go routines. Worker routines periodically query the source Zed databases, collect newly ingested data, apply the configured ETL transformations and write the transformed data to one or more destinations.

If the manager receives a specification with a delete action, it will send a stop signal to the worker and cleanly halts the synchronization process.

Pipelet Design

The synchronizer receives jobs as objects similar to the Pipelet CRD which are managed in a dedicated pipelet datastore. Pipelet job objects contain all necessary information to move and transform data between sources and destinations and have the following definition:

name: str
 
sources: str[]
dest: str[]
in_flow: str
out_flow: str
 
interval: int
eoio: bool
 
lake: str
action: str

Each field serves a specific purpose:

Zed Lake Data Storage

At our lab, Zed lakes have been used in a few different projects since they offer a clean interface for storing and querying large datasets. Their bash-like command syntax made it easy to run data ETL operations directly on the data without needing extra tools. Building on that experience, Pipelet integrates Zed lakes as its primary storage layer, leveraging the same query language to perform data transformations.

Although my implementation uses Zed lakes as the storage layer, the overall system design is flexible to support other databases. The synchronization logic can be adapted to work with other storage backends such as Postgres or MongoDB while reusing the rest of the Pipelet infrastructure.

System Optimizations

To demonstrate the flexibility of the pipelet design, the project also implements multiple system optimizations.

Pipelets Reconsiliation

Maintaining a global view of all active Pipelets allows the operator to consolidate redundant data transfers. If multiple Pipelets read data from the same source but send it to different destinations, the Pipelet operator merges them into a single sync job. This approach cuts down on bandwidth by performing one data download instead of multiple separate downloads, then sending the data to each destination.

Concretely for the following example, the synchronizer will only need to read data once from the source instead of 3 times after pipelet consolidation.

Pipelet Design

Such a reconciliation process is difficult in a pure RPC model, where no central authority orchestrates the overall data flow. By contrast, Pipelet’s operator can see the entire cluster configuration at once, enabling it to unify transfers and drastically reduce overhead.

As a side-note, this optimization is inspired by Software Defined Networking (opens in a new tab) where a centralized, software-based control plane manages how network devices are configured and directs traffic through the network. Because the control plane has a comprehensive view of the network, it can implement optimizations that wouldn’t be possible with traditional, device-centric management. Similarly, Pipelet's operator can make more sophisticated decisions about synchronizing and routing information through the Kubernetes cluster.

Decoupling of Control and Data Planes for Data Transfers

By decoupling system management from data transfers, Pipelet creates a clear division of responsibilities beteween the operator and the synchronizer. This abstraction allows for sophisticated data transfer scheduling and optimizations while using proper language for high throughput transfers.

This layered approach introduces additional complexity compared to traditional RPC-based models. However, the potential benefits of orchestrating data placement, optimizing transfers at the cluster level, and consolidating multiple jobs into fewer sync processes wouldn’t be possible without these abstractions. While the extent of such optimization gains wasn’t fully explored in my project, the framework is in place for advanced optimizations in the future.

Cross-cluster synchronization

Pipelet specifications support any reachable data lake addresses, enabling seamless data synchronization across multiple Kubernetes clusters without requiring custom cluster routing. This flexibility allows services in different clusters to exchange data as if they were within the same environment.

While these are just a few examples of how Pipelet optimizes data movement, having a global view of all active synchronizations opens the door for more advanced optimizations in the future.

Benchmarking Data Transfer Performance

To assess the impact of the Pipelet design and the Golang rewrite, I compared its performance against the original Python-only approach. The experiment involved a single source Zed data store and 100 client pods running in the cluster. Each client attempted to read large amounts of randomly generated string data from the load store and performed a simple ETL transformation. The graph below aggregates the latency for reading the data from the source and writing it to the destination.

The benchmark compares three different setups:

pipelets-go-vs-no-pipelet.png

The results show the impact of adopting a two-layered system. The biggest performance gain comes from moving data synchronization from Python to Golang. Even at the 40th percentile, the original Python-based system is about 60ms slower than the Pipelet implementation, with an even more significant drop-off beyond the 80th percentile.

Interestingly, even the relatively small optimization of sync consolidation leads to a measurable performance improvement. This suggests that further optimizations — especially in how Pipelet manages heterogeneous workloads — could yield even greater efficiency gains.

An important observation from the benchmarks is that the API-centric approach has a long latency tail. Running 100 client pods on a single-node Minikube cluster led to heavy contention, as each pod queried the storage backend directly. With more time and resources, I’d explore cloud-based setups to better understand performance in a production-like environment.

Impact of Golang on the benchmarking results

While working on the project, I didn’t specifically collect performance data of the original design which used Golang-based data transfers. As a result, it's unclear how much of the performance gain comes from the Pipelet architecture itself versus the switch to Golang. However, what’s important is that this design enables such optimizations in the first place.

If a system is tightly coupled to a specific language and cannot be rewritten in Golang, achieving efficient data transfers could be a major challenge. With the Pipelet approach, the application logic remains independent of the synchronization mechanism. Regardless of the language the services are written in, Pipelet can always leverage the most optimal data transfer implementation.

Conclusions

This project represents the first large-scale "serious" system I designed and implemented mostly from scratch, built in preparation for our paper’s publication at HotNets. While experimental in nature, Pipelet introduces interesting ideas for abstracting data movement in distributed applications.

One of the key takeaways is how separating system management from data synchronization enables high-level global optimizations. By dividing responsibilities between a Python-based operator and a high-performance Go synchronizer, Pipelet can consolidate redundant syncs, optimize data placement, and use the most efficient language for data transfer—regardless of the application’s original implementation. This level of abstraction is particularly useful when application logic cannot be rewritten but still requires robust, high-performance data synchronization.

Beyond the technical aspects, this project taught me a great deal about translating an abstract system idea into a tangible, working prototype. From breaking down the design, implementing the system, and debugging obscure frameworks, to benchmarking its performance and refining it into a paper-ready proof of concept. This process mirrored exactly what I now see now in real-world software engineering: navigating complex, technical ecosystems, and iterating toward a functional, scalable solution.

Acknowledgements

I want to thank Silvery Fu (opens in a new tab) for working closely with me on this project, providing a lot of system design feedback and writing the project research paper.

References

© Taras Priadka.RSS