This paper is included in the Proceedings of the 2022 USENIX

This paper is included in the Proceedings of the

2022 USENIX Annual Technical Conference.

July 11–13, 2022 • Carlsbad, CA, USA

978-1-939133-29-8

Open access to the Proceedings of the

2022 USENIX Annual Technical Conference

is sponsored by

Zero-Change Object Transmission

for Distributed Big Data Analytics

Mingyu Wu, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University

and Shanghai AI Laboratory; Shuaiwei Wang, Institute of Parallel and Distributed Systems, SEIEE,

Shanghai Jiao Tong University; Haibo Chen and Binyu Zang, Institute of Parallel and Distributed

Systems, SEIEE, Shanghai Jiao Tong University and Engineering Research Center for Domain-specific

Operating Systems, Ministry of Education, China

https://www.usenix.org/conference/atc22/presentation/wu

Zero-Change Object Transmission for Distributed Big Data Analytics

Mingyu Wu

1,2

, Shuaiwei Wang

, Haibo Chen

1,3

, and Binyu Zang

1,3

Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University

Shanghai AI Laboratory

Engineering Research Center for Domain-speciﬁc Operating Systems, Ministry of Education, China

Abstract

Distributed big-data analytics heavily rely on high-level lan-

guages like Java and Scala for their reliability and versatility.

However, those high-level languages also create obstacles for

data exchange. To transfer data across managed runtimes like

Java Virtual Machines (JVMs), objects should be transformed

into byte arrays by the sender (serialization) and transformed

back into objects by the receiver (deserialization). The object

serialization and deserialization (OSD) phase introduces con-

siderable performance overhead. Prior efforts mainly focus on

optimizing some phases in OSD, so object transformation is

still inevitable. Furthermore, they require extra programming

efforts to integrate with existing applications, and their trans-

formation also leads to duplicated object transmission. This

work proposes Zero-Change Object Transmission (ZCOT),

where objects are directly copied among JVMs without any

transformations. ZCOT can be used in existing applications

with minimal effort, and its object-based transmission can

be used for deduplication. The evaluation on state-of-the-art

data analytics frameworks indicates that ZCOT can greatly

boost the performance of data exchange and thus improve the

application performance by up to 23.6%.

1 Introduction

High-level languages like Java and Scala are welcomed in

areas like big-data analytics thanks to their reliable and ver-

satile managed runtime environment. However, the abstrac-

tion provided by the managed runtime also introduces per-

formance overhead, especially for data exchange. Since man-

aged runtimes like Java Virtual Machines (JVMs) store data

in an opaque object-based format, they have to transform

objects into interpretable binary streams before exchanging.

The transformation contains two phases: a serialization phase

transforming objects into a byte array, and a deserialization

phase transforming the byte array back into objects. The

object serialization/deserialization (OSD) mechanism intro-

duces considerable transformation overhead and has become a

signiﬁcant performance bottleneck in distributed object trans-

mission, especially for applications demanding large-scale

data exchange through network [3,6, 13, 15, 44].

Prior work has recognized the performance problem in

OSD and proposed different approaches, both in software [26,

38, 39] and hardware [16, 32, 40, 46], to mitigate its effect.

However, those approaches mainly focus on optimizing spe-

ciﬁc phases in OSD, and the data transformation is still in-

evitable. Furthermore, although they can boost the perfor-

mance of OSD, many of them require extra programming

efforts to annotate serialization points or change the original

inter-JVM communication model. Last but not least, they treat

the transferred data as a monolithic byte array instead of indi-

vidual objects, which makes it difﬁcult to identify duplicated

transmission and misses optimization opportunities.

Instead of optimizing OSD, this work aims at directly elimi-

nating the whole OSD process. To this end, this work proposes

Zero-Change Object Transmission (ZCOT), which provides

an ideal data exchange mechanism where objects are trans-

ferred among JVMs through

direct object copying

. When a

JVM receives objects from others, it can directly process them

without any modiﬁcations (Zero-Change). ZCOT removes the

demands for object transformation and thus improves the

performance of data exchange.

However, it is non-trivial to achieve zero-change communi-

cation given each JVM manages objects in a process-speciﬁc

and opaque format. To this end, ZCOT ﬁrst introduces a glob-

ally shared abstraction named exchange space, a part of the

Java heap space accessible for multiple JVMs in a distributed

environment. ZCOT further adopts its distributed class-data

sharing (DCDS) mechanism, which provides a uniﬁed object

format to make objects in the exchange space interpretable

for all JVMs. To remain compatible with traditional OSD-

based applications, ZCOT proposes a two-level transmission

mechanism to bridge the gap between object-based copying

and traditional byte-based transmission.

As ZCOT introduces a globally shared exchange space,

it is responsible to manage objects shared among multiple

JVMs. By introducing a metadata server, ZCOT memorizes

USENIX Association 2022 USENIX Annual Technical Conference 137

the stored location for objects and helps build data transmis-

sion channels between JVMs. Since objects in big-data analyt-

ics are usually exchanged as a whole dataset, ZCOT embraces

group-based object management, which organizes objects in

groups and greatly reduces the trafﬁc between the metadata

server and JVMs. Furthermore, ZCOT also integrates with

the garbage collections (GC) triggered in individual JVMs

and reduces the GC pause time.

ZCOT sends objects instead of byte arrays during transmis-

sion, which makes it object-conscious and easier to identify

duplicated objects. This work thus proposes a data dedupli-

cation mechanism to further optimize the data transmission.

The deduplication module in ZCOT leverages the exchange

space abstraction to memorize which objects have been sent

and avoids unnecessary object transmission in the future. Nev-

ertheless, deduplication may introduce references (or depen-

dencies) among different datasets. To this end, ZCOT extends

its distributed memory management module to consider inter-

group dependencies.

This work implements ZCOT in the HotSpot JVM of Open-

JDK 11, the long-time-support version for OpenJDK. ZCOT

is well-integrated with existing features in OpenJDK (like

APPCDS [30]) to remain friendly to Java developers. We

evaluate ZCOT against state-of-the-art OSD libraries and

optimizations with both the micro-benchmark and macro-

benchmark. The micro-benchmark contains both basic and

complicated data structures for data transmission, while the

macro-benchmark contains two big-data analytics frame-

works (Spark and Flink). The result for micro-benchmark

shows that ZCOT outperforms other OSD libraries, espe-

cially for complicated data structures, and reaches up to 4.35

speedup compared with Naos [39], a state-of-the-art optimiza-

tion on OSD. As for macrobenchmark, ZCOT outperforms the

default OSD libraries in both Spark and Flink and thus boosts

the application time by up to 23.6% and 22.2%, respectively.

To summarize, the contribution of ZCOT includes:

•

A distributed shared abstraction named exchange space

to enable zero-change object transmission among JVMs

while remaining compatible with traditional OSD-based

applications.

•

A memory management mechanism on the globally

shared space integrated with GC in individual JVMs.

•

A data deduplication module to identify and eliminate

unnecessary object transmission for further performance

improvement.

•

Experiments on communication-intensive workload to

show the performance improvement of ZCOT over ex-

isting OSD libraries.

JVM 1

001100101…

serialization

JVM 2

001100101…

deserialization

Figure 1: The workﬂow of OSD

2 OSD Background

2.1 OSD

Language runtimes provide a high-level abstraction for

platform-independent code execution. As for user objects,

runtimes store them with an opaque format, which maintains

object data together with corresponding metadata (type infor-

mation, synchronization, memory management, etc.). Taking

Java as an example, JVMs maintain a header for each object

to store its metadata.

However, when data exchange among JVMs is required,

objects must go beyond the runtime scope. For example, ob-

jects might be persisted into disks and reused by other JVMs

later; they can also be sent and received through network.

To support those scenarios, objects have to be interpretable

even when leaving JVMs. Therefore, JVMs embrace the ob-

ject serialization/deserialization (OSD) mechanism, which

transforms Java objects into a generalized data format (seri-

alization) and transforms back when reusing in JVMs (de-

serialization). The Java system library (JSL) already provides

a built-in OSD library for applications. Figure 1 shows the

workﬂow of JSL’s OSD. As for the serialization part, objects

are transformed into a byte array that follows a data format

agreed among JVMs. The byte array will be written into disks

or sent through network. When another JVM receives the byte

array, it transforms the byte array back into objects through

deserialization.

The OSD mechanism has two major advantages. First, the

library provides a general-purpose data format so that Java ob-

jects can be transformed among JVMs with different versions

and conﬁgurations. Second, the serialized data is compressed

and induces smaller footprints in both disks and network.

2.2 Limitations and opportunities

The major disadvantage for OSD is its performance penalty.

The performance problem of OSD in big-data analytics is

three-fold.

Transformation overhead.

OSD introduces extra phases

for object persistence and transmission. To serialize an object,

OSD should traverse all its reachable objects and store their

type information. As for deserialization, OSD should scan

serialized data and reconstruct objects.

Memory footprint.

OSD generates a considerable number

of temporary objects during data transformation. As shown

138 2022 USENIX Annual Technical Conference USENIX Association

Round 1

a -> b

c -> e

d -> f

…

Round 2

a -> 1.0

b -> 0.4

c -> 0.2

…

Round 3

a -> 0.8

b -> 0.9

c -> 0.1

…

……

010010…

011110…

100100…

111010…

011001…

111100…

010110…

010000…

011100…

serialized bytescontents

Figure 2: Duplicated data transmission in the page-rank ap-

plication

in Figure 1, byte arrays are generated during serialization and

become useless after sending out. Those temporary objects

can increase the memory pressure and cause more frequent

GCs.

Duplicated transmission.

Big-data analytics leverage

OSDs in many rounds of communication and duplicated ob-

jects may be repetitively transformed and exchanged in each

round. Figure 2 shows a concrete example in Spark [44],

which calculates the popularity for each URL (simpliﬁed as

letters) with the page-rank algorithm [31]. Since the algorithm

executes for multiple iterations, the data transmission is also

conducted in many rounds. In the ﬁrst round, the URL-based

network topology is sent through network, which consists of

many string pairs to indicate the point-to relationships among

URLs. In the later rounds, the rank value for each URL is

iteratively propagated, which is organized as key-value pairs.

Note that the strings in the key-value pairs are URLs that have

been sent in the ﬁrst round. Unfortunately, since all objects

have been transformed and merged into byte arrays, JVMs

cannot tell that some objects have been received before. They

have to receive all objects as a monolithic byte array, which

leads to unnecessary network transmission and OSD phases.

In the Spark page-rank application, over 60% of transferred

objects are duplicated.

Furthermore, the advantages of OSD also fade with ad-

vances in hardware technologies. For example, the band-

width of off-the-shelf network devices can reach 100Gb/s

or larger, which makes network transmission time less im-

portant, so OSD may become a more signiﬁcant bottleneck.

On the other hand, the general data format is not always re-

quired. Therefore, many optimizations have been proposed

to reduce the performance overhead of OSD, both in soft-

ware [26,38,39] and hardware [16,32,40,46]. Since hardware-

based approaches require building customized hardware ac-

celerators to improve OSD, this work mainly focuses on

software-based approaches with off-the-shelf hardware.

2.3 State-of-the-art optimizations

The basic idea behind OSD is to achieve an agreement on ob-

ject representation among JVMs. Therefore, optimizations on

OSD should consider how to create the agreement so that ob-

jects can cross JVMs’ boundaries. Besides, they also need to

consider issues like compatibility with existing applications.

Kryo.

Kryo [38] is a fast OSD library for Java. Compared

to JSL’s OSD, Kryo reﬁnes the binary data format to achieve

a smaller serialized data size and better performance. Applica-

tions like Spark have leveraged Kryo as its default serializer.

Nevertheless, Kryo does not eliminate any phases in OSD;

objects still need to be transformed back and forth.

Skyway.

Skyway [26] proposes to directly send object

graphs instead of serialized bytes. With Skyway, the seri-

alization phase is nearly removed as objects are no longer

transformed to a binary format. Although Skyway has simpli-

ﬁed phases in OSD, modiﬁcations on objects are still required.

First, it needs to transform the type information in the head-

ers to a globally-agreed ID so that it can be identiﬁed by all

JVMs. Second, it needs to ﬁx references after copying, as

objects have been moved to different addresses. Moreover,

Skyway also requires programmers to mark the point where

the serialization phase starts manually.

Naos.

Naos [39] is a network-speciﬁc data transmission

mechanism. Similar to Skyway, Naos also employs a global

service to reach agreements for types, but it relies on RDMA

technology to achieve rapid zero-copy object transmission.

However, Naos still requires modiﬁcations on both object

headers and references. Besides, it only supports network-

based transmission, and existing applications need signiﬁcant

modiﬁcations to leverage Naos.

2.4 Summary

Prior optimizations have proposed different solutions to re-

duce the overhead of OSD. However, they cannot eliminate

the whole OSD process. Table 1 compares the built-in OSD in

JSL with other optimizations. Although recent work like Naos

eliminates the serialization phase, a deserialization phase is

still required to ﬁx the type information and the references.

Besides, none of them has considered the duplicated data

transmission problem.

This work proposes ZCOT (short for Zero-Change Object

Transmission), which aims to eliminate the whole OSD pro-

cess during data exchange. In ZCOT, object transmission is

conducted in the most straightforward way: the sender JVM

copies objects and the receiver can directly use them with-

out any modiﬁcations. ZCOT also considers the duplicated

transmission problem and provides a deduplication module.

Finally, ZCOT is not bound to speciﬁc network technologies

(like RDMA) and provides easy-to-integrate interfaces for

existing applications.

USENIX Association 2022 USENIX Annual Technical Conference 139

Data transmission mechanisms Serialization Deserialization Ease of Integration Data deduplication

JSL Slow Slow Yes No

Kryo Medium Medium Yes No

Skyway Fast (removed) Medium Medium No

Naos Fast (removed) Medium No No

ZCOT (this work) Fast (removed) Fast (removed) Yes Yes

Table 1: Comparisons on existing OSD optimizations against our work ZCOT

3 Design of ZCOT

3.1 Overview

The core idea of ZCOT is to build a distributed-shared-

memory (DSM)-like abstraction for JVMs running on dif-

ferent machines. Figure 3 illustrates the architecture of ZCOT.

In a ZCOT-enabled system, the heap for each JVM consists of

two parts: its private space (the original Java heap) and a glob-

ally shared exchange space. Objects are originally managed

in the private space. When they require to be sent through

network or persisted into disks, they will be copied to the ex-

change space. The exchange space is an abstraction available

for all JVMs; each JVM can directly access objects therein.

Therefore, object transmission can be achieved with direct

copying to the exchange space, and the whole OSD process

can be eliminated.

JVM 1 JVM 2 JVM 3

private space

direct write direct read

exchange space

…

Figure 3: The architecture of ZCOT

The idea for building a DSM-like abstraction is well-known

and has been studied for decades [2, 7, 9, 14, 20, 21, 25, 33

–

35, 41, 42, 45]. Although our exchange space shares similar

wisdom with DSM, it is only used for data exchange and does

not need to tackle complicated issues like coherence. It also

assumes objects in the exchange space are immutable, which

usually holds for big-data analytics like Spark and Flink. If

a write operation occurs on objects in the exchange space,

ZCOT creates a copy for it on the JVM’s private heap. Never-

theless, combining the DSM concept with data transmission

in high-level languages is still not trivial. To enable efﬁcient

and easy-to-use object transmission, ZCOT has to resolve the

following challenges.

•

How to build a shared exchange space so that all JVMs

can access it freely? (Section 3.2)

•

How to leverage the exchange space abstraction to sup-

port OSD-based applications? (Section 3.3)

•

How to manage objects in the exchange space in the

presence of garbage collections in individual JVMs?

(Section 4)

•

How to resolve the duplicated transmission problem?

(Section 5)

3.2 Distributed class-data sharing

ZCOT relies on its distributed class-data sharing (DCDS)

mechanism to build a globally accessible shared space. DCDS

guarantees that class-related metadata will be mapped into

the same virtual memory address for all JVMs. This helps

JVMs to achieve an agreement on the class metadata, so no

type-related modiﬁcations (e.g., identiﬁers) are required.

user jar

JDK tool

class archives

exchange space

class A

class X

class D

…

class space

Header

Data

JVM1

JVM2

JVM3

object space

Figure 4: The workﬂow of distributed class-data sharing

Figure 4 elaborates the workﬂow of DCDS. First, the clus-

ter manager should prepare a shared class archive for all JVMs.

The class archive should contain all classes whose correspond-

ing object instances would be shared during inter-JVM com-

munication. ZCOT relies on the tools provided by OpenJDK

to generate such class archives [30]. Afterward, the archive

will be used during JVM startup, and classes in the archive

will be mapped to a given virtual address. The virtual ad-

dress range is also memorized and marked as a part of the

exchange space (class space in Figure 4). This step assures

that JVMs share the same view on the classes. As shown in

Figure 4, an object in the exchange space stores a reference

to its class-related metadata. Since the reference points to the

class space, the object’s class information is interpretable for

all JVMs. Although DCDS requires the data types of appli-

cations should be known in advance, mainstream big-data

analytics frameworks usually guarantee this by sending a fat

jar ﬁle for execution.

Figure 5 shows how ZCOT transfers objects through net-

work with its DCDS support. First, the sender JVM applies for

an available memory chunk in the exchange space for object

copying. This is achieved by communicating with an external

140 2022 USENIX Annual Technical Conference USENIX Association

private space exchange space view

meta-server

1 2

copy

fault

(a) Sender: deep copy to a given address

private space exchange space view

meta-server

fault

forward

copy

(b) Receiver: trigger a page fault and fetch data from the

sender

Figure 5: The workﬂow of ZCOT

metadata server (details in Section 4). Second, the sender

JVM copies objects to the chunk’s memory address. This step

is similar to a deep copy in a normal Java application. To

detect cycles and avoid repeated copying on the same object,

we add a marker word in each object header to store its new

address if it has been copied.

The copied objects are kept on the sender machine and

lazily retrieved by receivers. When a receiver JVM tries to

access this part of data (Figure 5b), it encounters a page fault

since the data is unavailable on its machine. We have regis-

tered the page fault handler in ZCOT-enabled JVMs so that

they can request the metadata server for faulted pages. The

metadata server has tracked the ownership of memory ad-

dresses in the exchange space, so it forwards the request to

the data owner. Afterward, the sender builds a connection

with the receiver and puts the requested objects to the desired

address. Now the receiver can directly access those objects

for further processing, with neither metadata updating nor

reference ﬁxing (namely zero-change).

3.3 Supporting OSD-based scenarios

Thanks to the exchange space abstraction, a JVM can directly

access received objects without any modiﬁcations. However,

this mechanism is not compatible with traditional OSD-based

applications, which usually adopt byte arrays for commu-

nication. To this end, ZCOT should provide user-friendly

interfaces to integrate easily with applications.

Programming interfaces.

JSL provides stream-based

classes for OSD implementation. The

ObjectOutputStream

class provides the

writeObject

method to serialize an

object into a stream (usually ﬁles or network). Similarly,

the

ObjectInputStream

class provides the

readObject

method to deserialize data into objects. Therefore, prior

OSD optimizations like Skyway implement new seri-

alizers/deserializers by inheriting those two classes for

ease of integration. ZCOT adopts a similar strategy and

Figure 6 shows its basic classes:

ZCObjectOutputStream

and

ZCObjectInputStream

, which are subclasses of

ObjectOutputStream

and

ObjectInputStream

, re-

spectively. Compared with

ObjectOutputStream

ZCObjectOutputStream

slightly modiﬁes the interface for

writeObject

to support different OSD-based scenarios

(discussed later). To use ZCOT-based communications,

applications only need to replace the original stream classes

with ours. In contrast, prior work requires developers to

modify the original communication model or annotate the

serialization points [26, 39].

OSD-compatibility.

To remain compatible with OSD in-

terfaces (

writeObject

and

readObject

), ZCOT should also

transfer data with byte arrays. To this end, ZCOT adopts a

two-level transmission mechanism. As illustrated in Figure 7,

ZCOT transfers data via both frontend and backend. The

frontend transmission is compatible with OSD interfaces, but

it only sends metadata, including the object’s start address

and the data length. When

ZCObjectInputStream

receives

the metadata through

readObject

, it directly accesses the

corresponding address and fetches objects through backend

transmission if a page fault is triggered (as mentioned in Fig-

ure 5b). ZCOT will launch dedicated VM threads in both

sender and receiver JVMs to transfer the requested objects.

This two-level design ﬁlls the gap between the byte-based

OSD interfaces and the object-based transmission in ZCOT.

Supporting different OSD scenarios.

In OSD libraries,

objects are serialized and written into a stream (e.g., the

out variable deﬁned on Line 3 in Figure 6) when invoking

writeObject

, which are usually redirected into ﬁles or net-

work. To support both scenarios, ZCOT adds a parameter

volatile in the constructor of

ZCObjectOutputStream

(Line

6). When volatile is set to false, the copied objects will be writ-

ten into a ﬁle, and the memory pages can be soon reclaimed

USENIX Association 2022 USENIX Annual Technical Conference 141

1 // Output class

2 class ZCObjectOutputStream extends ObjectOutputStream {

3 private OutputStream out; // Private output stream

5 // Constructor

6 public ZCObjectOutputStream(OutputStream out,

boolean volatile /* Mode */)

7 throws IOException {...}

9 // Compatible with the serialization interface

10 public void writeObject(Object obj)

11 throws IOException {...}

12 ...

13 }

15 // Input class

16 class ZCObjectInputStream extends ObjectInputStream {

17 private InputStream in; // Private input stream

19 // Constructor

20 public ZCObjectInputStream(InputStream in)

21 throws IOException {...}

23 // Compatible with the deserialization interface

24 public Object readObject()

25 throws IOException{...}

26 ...

27 }

Figure 6: Basic classes in ZCOT

through GC (details in Section 4). Nevertheless, those objects

still reserve a corresponding virtual address in the exchange

space. When the object data is read by other JVMs, the meta-

data server asks the sender to pass the ﬁle so the receiver

can map it to the corresponding memory address. The case

is simpler when volatile is true, which indicates a network-

based transmission. In this scenario, objects are only kept in

memory and can be reclaimed only if they have been read by

others.

0x3000

a b

0x3000

Send buﬀer

OutputStream

(byte array)

InputStream

(byte array)

readObject()

writeObject(a)

start address = 0x3000

a b

start address = 0x3000

Sender Receiver

Receive buﬀer

Java threads

VM threads

Frontend

Backend

Figure 7: The two-level data transmission mechanism in

ZCOT

Assumptions.

Note that ZCOT is mainly designed to im-

prove the data exchange phase for big data analytics, so it

makes several assumptions about the transferred data. First, all

classes related to communication should be known in advance

so that they can be packed into the class archive. Second, the

transferred objects are read-mostly, otherwise copy-on-write

operations would be triggered for modiﬁcation operations.

Lastly, objects are managed in large groups and share similar

life cycles, so they can be efﬁciently managed in the exchange

space. Since representative big-data analytics systems like

Spark conform to the above assumptions, ZCOT works well

for them.

4 Memory Management

Since the global exchange space is built atop a DSM-like ab-

straction, ZCOT should manage objects distributed to differ-

ent machines. Furthermore, the managed runtimes complicate

the scenario as they introduce their own memory manage-

ment strategy: garbage collections (GC). This section will

introduce how ZCOT manages the distributed exchange space

while remaining harmonized with GC in JVMs.

4.1 Group-based management

Unlike traditional DSM-based systems, ZCOT introduces

group, a semantic-aware notion for distributed memory man-

agement. As analyzed in Section 2, big-data analytics frame-

works treat serialized objects as a whole dataset (monolithic

byte array) and retrieve them together. Therefore, ZCOT puts

all objects copied in the same

writeObject

invocation to a

group so that they are managed together. When a receiver

encounters a page fault, ZCOT will send all related data pages

belonging to the same group to the receiver and avoid fu-

ture faults. This mechanism, namely group-based prefetching,

leverages the semantics in the OSD scenario to mitigate the

page-based management overhead in traditional DSM.

4.2 Metadata server

The metadata server is the core module for ZCOT’s memory

management. JVMs communicate with the metadata server

through remote procedure calls (RPCs) to acquire or release

memory resources in the exchange space. Figure 8 illustrates

the core data structures in the metadata server. The metadata

server is agnostic to groups; groups are only managed by

individual JVMs. It partitions the shared exchange space into

equal-sized memory chunks (256MB by default) for memory

allocation and deallocation. It also maintains an allocation

bitmap to mark if a chunk has been allocated. Each chunk is

assigned with an integer ID, which is calculated by its relative

offset compared with the exchange space’s start address. To

track the stored locations of chunks, the metadata server main-

tains a copy set for each chunk, which is stored in a chunk

mapping table. The copy set contains JVMs storing a copy of

the corresponding chunk (in memory or disk), which are also

represented with integer IDs. The mapping between a JVM’s

ID and information (e.g., IP address) is stored in a separated

member table.

142 2022 USENIX Annual Technical Conference USENIX Association

Since each JVM needs to communicate with the metadata

server, its reliability becomes considerable. To tolerate fail-

ures on the metadata server, we can introduce replicas for

it, and the overhead would be acceptable given the low fre-

quency of communications between the metadata server and

worker JVMs (several times in a data-processing stage lasting

for seconds).

exchange space

…

0 1 0 0 1 0 0 … 0 0 0

allocation bitmap

free chunk

chunk copy-set

1 {1}

4 {0}

…

JVMID ip:port

0 ip0:7270

1 ip1:2233

chunk mapping table

allocated chunk

member table

Figure 8: Important data structures in the metadata server.

4.3 RPC interfaces

The metadata server provides four important RPC interfaces

listed below.

int register(std::string ip, int port);

Chunk* acquire();

Chunk* get_remote(Address addr);

int release(Chunk* chunk);

is only invoked when a JVM is

launched. ZCOT has provided a JVM option -XX:+UseZCOT,

and a JVM enabling this option automatically spawns an RPC

thread and sends a

RPC to the metadata server with

its IP address and listening port. After receiving the RPC, the

metadata server saves the IP and port number to the member

table, generates an integer as the JVM’s ID, and returns with

the ID. For subsequent RPCs, JVMs should always attach the

returned ID to help the metadata server maintain the stored

locations of objects (omitted in the interfaces above).

acquire

. When a JVM runs out of allocated memory from

the exchange space, it should send

acquire

RPCs for more

memory resources. After receiving an

acquire

request, the

metadata server scans its bitmap to allocate an available chunk.

Afterward, the metadata server memorizes the relationship

between the allocated chunk and the JVM’s ID and returns

the chunk. To reduce the overhead of bitmap scanning, ZCOT

memorizes the address of the last successfully allocated chunk

and starts scanning there. If the scanned address reaches the

end of the exchange space, ZCOT will continue scanning from

the beginning. To handle simultaneous

acquire

requests,

ZCOT introduces a bitmap lock to ensure the bitmap is exclu-

sively accessed.

get_remote

. The

get_remote

interface is used by JVMs

encountering a page fault when accessing a virtual address.

Since a page fault indicates the requested objects are not

stored locally, the JVM sends

get_remote

to fetch the corre-

sponding chunk. After receiving

get_remote

, the metadata

server gets the corresponding chunk containing the address

and ﬁnds which JVMs store the chunk by scanning the chunk

mapping table. As illustrated in Section 3.2, the metadata

server forwards the request to the corresponding JVM for

actual data transmission. Since the size of a chunk is rel-

atively large, sending chunks may introduce considerable

performance overhead. To reduce the transferred data size,

the sender JVM only sends used pages in the chunk, which

are represented as the length of data in the frontend trans-

mission (Figure 7). Due to ZCOT’s group-based prefetching

mechanism, the sender may directly send multiple chunks in

the same group to the receiver. In this case, the receiver is

responsible for sending an auxiliary RPC to update the copy

set in the metadata server.

release

. The

release

interface is relatively simple. When

a JVM ﬁnds that objects in a chunk are no longer used, it sends

release

to give up this chunk. After receiving

release

, the

metadata server removes the JVM’s ID from the correspond-

ing copy set in the chunk mapping table. If no JVM stores

this chunk, the metadata server will reclaim it by marking the

corresponding bit as free in the bitmap.

4.4 Garbage collection

JVMs have already implemented their garbage collection

(GC) algorithms to automatically reclaim unused memory.

When GC is triggered, JVMs track all live objects and re-

claim memory consumed by dead ones. Since objects in the

exchange space are reachable from individual JVMs, they

will also be affected by GC. To this end, ZCOT has integrated

its memory management strategy with G1, the default GC al-

gorithm in OpenJDK, to ensure the correctness of distributed

memory management and reduce GC overhead.

G1 basics.

G1GC (short for Garbage-ﬁrst Garbage Col-

lection [8] is the default garbage collector after OpenJDK

9 [29]. G1 divides the Java heap into equal-sized regions for

ease of management. It also maintains per-region metadata

named remember sets to memorize all references pointing to

objects in the same region. The remember set is updated by

instrumenting all write operations in Java code (also known

as write barriers). The G1 algorithm is mostly stop-the-world,

which means that application threads should be paused until

GC ends. During GC

, each selected region is processed si-

multaneously: a dedicated GC thread scans the remember set

of a region, ﬁnds all reachable objects, and copies them to an

empty region (named survivor region).

Integrated with G1.

ZCOT extends the region-based de-

sign of G1 to support the exchange space. It proposes ZCRe-

For simplicity, we only discuss the young GC and mixed GC in G1

USENIX Association 2022 USENIX Annual Technical Conference 143

gion, a new kind of region allocated from the metadata server.

Compared with regions in G1, the size of ZCRegion is not

ﬁxed. Each ZCRegion corresponds to a group in the exchange

space, and all objects therein are expected to have the same

life cycle. Since objects in ZCRegions have different behav-

iors compared with those in other regions, G1 should treat

them specially. First, we modify the behavior of write barriers

to consider ZCRegions. When a reference points to objects in

a ZCRegion, we do not memorize this reference but only mark

the ZCRegion as used. This is because objects in a ZCRegion

are only collected when no references point to any of them.

Similarly, GC threads do not need to scan ZCRegions during

GC because all objects are treated as alive if there exists any

reference pointing to the region. When GC ends, the JVM will

scan all ZCRegions and ﬁnd those containing no incoming ref-

erences. For those ZCRegions, the JVM invokes the

release

RPC to reclaim corresponding chunks. If objects in a group

are written into disks, the corresponding ZCRegion can also

be reclaimed by GC, but the JVM does not invoke

release

since the virtual address is still reserved by the group.

In summary, our design successfully integrates the memory

management of the exchange space with G1GC. When GC

ends, the memory resource in the exchange space is automati-

cally reclaimed by following the reachability-based algorithm.

Furthermore, by specially handling regions in the exchange

space, we avoid unnecessary metadata tracking and object

scanning. In some cases, this design can even reduce GC

pause time (as shown in Section 6.3).

5 Transmission Deduplication

Since ZCOT sends objects instead of byte arrays during trans-

mission, it would be much easier to track transmitted objects

and conduct deduplication. This section introduces the data

deduplication module in ZCOT based on its object-centric

transmission mechanism.

5.1 Overview

Figure 9 shows the effect of ZCOT’s data deduplication mod-

ule in the aforementioned page-rank example (compared

against Figure 2). When sending the URL-based network

topology in the ﬁrst round, the sender has copied all URL

string pairs (together with two string objects) into their corre-

sponding addresses. In the next few rounds, the application

sends key-value pairs to update rank values for each URL.

Since all key-value pairs are sent as objects, it is much simpler

for ZCOT to ﬁnd that all URL objects have been sent. There-

fore, the sender can directly update the references in those

key-value pairs with the addresses in the exchange space and

thus remove duplicated transmission on URL objects.

ZCOT runtime should be further extended to achieve data

deduplication. First, ZCOT should track copied objects to

rapidly ﬁnd duplicated transmission. Second, ZCOT should

manage dependencies among object groups for safe memory

reclamation.

5.2 Duplication detection

A straw-man design for duplication detection would be scan-

ning all objects in the exchange space. However, this design

would induce considerable overhead given the large number

of objects. ZCOT instead follows a simple detection criterion:

if an object is in the exchange space, an attempt to copy it is

a duplicated one.

We still use page-rank as an example to explain ZCOT’s

duplication detection. Suppose a JVM receives the network

topology in round 1 (consider Figure 9); it reads URL objects

from the exchange space and uses them in the following

rounds. Therefore, when it propagates updated rank values to

other JVMs, it still uses the URL objects received from others.

When copying the URL-rank pairs in the next few rounds,

ZCOT checks each object’s address and thus avoids copying

those URL objects already in the exchange space.

5.3 Dependency management

Although data deduplication in ZCOT can reduce the net-

work overhead by avoiding repeated copying on the same

object, it also complicates memory management by intro-

ducing inter-group references. As mentioned in Section 4.1,

each invocation to

writeObject

creates a new group for

object management, and each group is separately used by call-

ing

readObject

. After deduplicating objects from different

groups, objects in a group can hold references to those in

another group, which should be correctly handled especially

when a group is being garbage collected. To this end, ZCOT

has managed those references as dependencies among groups.

Due to the large number of inter-group references, ZCOT

does not maintain reference-level dependencies. When a

group holds a reference to any objects in other groups, ZCOT

marks the group as dependent on others. The dependency

tracking is still achieved by extending the write barriers. To

memorize all dependencies, ZCOT extends the chunk map-

ping table in the metadata server to contain a dependency

set for each chunk, which stores all other chunks it relies on.

When a JVM ﬁnds that its group relies on another group af-

ter deduplication, it sends a new RPC

add_dependency

the metadata server. Since the metadata server is not aware

of groups, the RPC should specify all chunk IDs owned by

the group it relies on. Those chunk IDs will be added to the

corresponding dependency set by the metadata server.

Figure 10 uses an example to illustrate how ZCOT lever-

ages dependencies during object copying. After encounter-

ing a page fault on chunk 4, the receiver JVM sends a

get_remote

RPC to the metadata server. By fetching the

dependency set in the chunk mapping table, the metadata

server ﬁnds all chunks that chunk 4 depends on. Afterward,

144 2022 USENIX Annual Technical Conference USENIX Association

Round 1

a -> b

c -> e

d -> f

…

Round 2

a -> 1.0

b -> 0.4

c -> 0.2

…

Round 3

a -> 0.8

b -> 0.9

c -> 0.1

…

H ref ref

H 1.0

H ref ref

H 0.4

H ref ref

H 0.2

‘a’

H ‘b’

H ‘c’

H ref ref

H 0.8

H ref ref

H 0.9

H ref ref

0.1

……

Objects sent in round 1 Objects sent in later rounds

Figure 9: ZCOT avoids duplicated object transmission in page-rank

chunk copy-set

1 {0, 1}

{0,1}

Metadata server

{0}

{}

{1,2,3}

{}

dep-set

JVM 0 JVM 1

Chunk 1 Chunk 2Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 4

get_remote(chunk=4)forward(chunk=4, dep={3})

Transfer

Chunk 3 Chunk 4

Figure 10: ZCOT avoids sending duplicated data with depen-

dency tracking

ZCOT checks the copy set for each chunk to ﬁnd if the re-

ceiver JVM already has a copy. If the receiver does not have

a copy, ZCOT adds the corresponding chunk ID in a message,

which will be forwarded to the sender JVM later for real data

transmission. In our example, since the receiver JVM already

has copied chunk 0 and 1, it only needs to receive chunk 4

(requested) and chunk 3 (dependent). This example indicates

that ZCOT can avoid duplicated data submission with slight

modiﬁcations on the metadata server.

5.4 Garbage collection

Adding dependencies also complicates GC for individual

JVMs. Since a group (represented as ZCRegions) can be ref-

erenced by others stored in remote JVMs, local GC cannot

determine if a group can be safely reclaimed. For example,

suppose JVM 0 stores a group (chunk 0) that contains a refer-

ence to another group (chunk 1) stored on JVM 1. Although

JVM 1 no longer contains references to chunk 1, the chunk

should not be collected because JVM 0 may access it through

references in chunk 0. To this end, we extend G1GC to con-

sider remote inter-group references.

In our reﬁned GC algorithm, once a JVM detects a ZCRe-

gion has incoming references from other ZCRegions (through

write barriers), it marks the region as pinned and thus cannot

be reclaimed. It also sends the dependency relationship to

the metadata server through RPCs. When GC ends, the JVM

skips all pinned ZCRegions and only collects those with no

incoming references. A pinned ZCRegion can be reclaimed

when the metadata server ﬁnds that all chunks relying on

it have been released. In this case, the metadata server will

send a

canRelease

message to all JVMs in the correspond-

ing copy set, and those JVMs will mark the ZCRegion as

unpinned to safely reclaim it in later GC cycles.

5.5 Internalization

Big-data analytics usually generate a large number of objects

with simple types, such as Integer, String, Double, etc. Open-

JDK has provided an internalization mechanism to merge

those objects with the same content together. For example,

Integer objects whose values are between -128 and 127 would

be merged into one if their values are equal. ZCOT also em-

braces this mechanism for deduplication, but in its distributed

exchange space. It extends DCDS so that all JVMs allocate

a small region at the same virtual address during start-up to

contain globally-shared Integer objects. Thanks to this opti-

mization, the number of transferred Integers can be greatly

reduced.

6 Evaluation

6.1 Experimental setup

ZCOT is implemented atop the HotSpot JVM in OpenJDK

11.0.8-GA, with 8,327 lines of C code and 654 lines of Java

code. We leverage the following workloads to evaluate ZCOT.

Microbenchmark.

The microbenchmark contains four dif-

ferent data types used in prior work [26, 39]: 2-dimension

points, key-value pairs, hashmaps, and media objects. To sim-

ulate big-data scenarios, we transfer them in large arrays

whose length is 65536. Since some baselines crashed for

large arrays of media objects, we reduced the length to 16384

for this data structure.

Spark.

Spark (v3.0.0) is a data analytics engine that re-

quires massive data transmission among JVMs.

USENIX Association 2022 USENIX Annual Technical Conference 145

Flink.

Apache Flink [6] (v1.14) is a distributed data pro-

cessing engine for both batch and streaming workloads.

As for baselines, we compare ZCOT with two commonly-

used OSD libraries (JSL and Kryo) and two state-of-the-art

OSD optimizations (Naos and Skyway

Our test environment includes a cluster with four nodes

connected by 100 Gbit/s Mellanox ConnectX-5 NICs. Each

node contains dual Xeon E5-2650 CPUs and 128GB DRAM.

6.2 Microbenchmark

To directly compare ZCOT with state-of-the-art OSD opti-

mizations, we leverage the microperf tester in the Naos’ open-

source repository for evaluation. The tester involves a sender

and a receiver deployed on two separate machines and reports

the communication time with different type of data objects.

The heap size for all workloads is 16GB.

Figure 11 shows the results for ZCOT and other baselines,

which are the average of 1000 times of repetitive execution.

ZCOT achieves the best performance of all except for 2-

dimensional points. The average speedup is 2.28

, 1.94

2.19

, 3.95

compared with Naos, Skyway, Kryo, and JSL,

respectively. The result also suggests that ZCOT performs

better for complicated data structures. The media class from

the Java serialization benchmark set (JSBS) [37] is the most

complicated one, so the improvement is the largest especially

against Naos (4.35

). This is because the computation over-

head increases when the data structure becomes more com-

plex. For simple data structures like points, ZCOT’s reduction

on data transformation is offset by larger network overhead,

so it performs slightly worse than Naos and Skyway.

100

150

200

250

Map Media Pair Point

Execution time (ms)

JSL Kryo Naos Skyway ZCOT

Figure 11: The evaluation results for microbenchmark

6.3 Spark

Ease of integration.

To adopt ZCOT in Spark, we need to

implement a new data serializer

ZCSerializer

to replace

the default

KryoSerializer

. Although the name seems to

Both implemented by Naos’ authors

WC TC

PR KM

Figure 12: The performance of Spark applications

involve OSD phases, it is only for compatibility consider-

ations and still remains zero-change during transmission.

ZCSerializer

contains 70 lines of code, and most of them

are inherited from the JSL serializer. Furthermore, we re-

place the original stream classes from JSL with ours. If a

Spark user wants to enable ZCOT, she only needs to (1) con-

ﬁgure the spark.serializer to

ZCSerializer

and (2) add -

XX:+UseZCOT to the launch option of all JVMs, which is

quite simple.

Evaluation results.

We leverage ﬁve applications in the

example directory of Spark for evaluation. Their descriptions

and evaluated datasets are shown in Table 2. We conﬁgure

one node as the metadata server and Spark master while the

other three servers as Spark workers. The Java heap size for

each node is set to 80GB.

Figure 12 shows the results for all applications. The results

indicate that ZCOT can improve the performance by 13.9%

and 24.1% on average compared with Kryo and JSL, respec-

tively. Although Kryo has optimized the OSD performance

over JSL, our evaluation shows that the data transmission can

be further improved.

Application Dataset

PageRank (PR) LiveJournal [4]

Word Count (WC) LiveJournal

KMeans (KM) USCensus1990 [10]

Transitive Closure (TC) Blogs [1, 17]

Logistic Regression (LR) SUSY [5]

Table 2: Evaluated applications and datasets for Spark

We have further broken the results into four different

phases: write (serialization), read (deserialization), compu-

146 2022 USENIX Annual Technical Conference USENIX Association

tation, and garbage collection (GC). Since the four phases

are not overlapped in Spark (the GC phase only contains

stop-the-world time), the accumulated time is equal to the

overall execution time. Figure 12 indicates that the perfor-

mance mainly comes from the improvement in OSD-related

parts. Since OSD occupies a considerable portion in page-

rank execution, ZCOT can reach its best improvement (23.6%

and 38.1% w.r.t. Kryo and JSL). Averaged across all applica-

tions, ZCOT can reach 4.19

speedup in the write part and

2.95

in the read part over the default Kryo serializer (4.52

and 3.81

speedup for write and read part in JSL). As for

GC, ZCOT shows comparable pause time with others. In PR,

LR, and TC, the GC time is even shorter than JSL and Kryo.

Although ZCOT needs to manage the copied groups (ZCRe-

gions), its coarse-grained collection strategy avoids scanning

objects inside ZCRegions. Moreover, ZCOT avoids generat-

ing monolithic byte arrays by eliminating the serialization

phase, which can mitigate the memory pressure and introduce

less frequent GC.

Note that the computation time in ZCOT is somewhat larger

than that in JSL and Kryo. This can be explained by two rea-

sons. First, since ZCOT does not compress the object contents

during transmission to achieve zero-change, the transferred

data size is larger than JSL and Kryo, which leads to larger

network overhead (included in the computation part). Second,

the data deduplication module makes objects in the same

dataset scattered into different virtual address ranges, which

may lead to more random memory accesses and cache misses.

Nevertheless, the overall performance improvement is satis-

fying.

Results for deduplication.

We have also studied the ef-

fect of our data deduplication module. As shown in Table 3,

ZCOT can reduce the transferred data size for all four applica-

tions, ranging from 8.1% to 53.8%. Even for the non-iterative

application (WC), ZCOT is also helpful thanks to its inter-

nalization optimization technique. Meanwhile, LR and KM

receive smaller savings because they generate many different

Double objects in each iteration, which cannot be reused and

deduplicated. The result indicates that duplicated transmis-

sion is common in data analytics and ZCOT’s optimizations

are helpful. Note that the number of transferred bytes after

deduplication is still much larger than that in Kryo and JSL,

since both of them coverts objects in a compact format before

transmission. Therefore, it is still preferred to use ZCOT with

larger network bandwidth.

PR WC TC KM LR

dedup 15.25 4.13 5.03 5.37 5.55

no-dedup 31.64 5.50 10.88 5.86 6.04

Table 3: Average transferred bytes (GB) for Spark executors

Various settings and overhead analysis.

We evaluate the

performance of ZCOT with various settings on the heap size

and the chunk size by using PR as an example. The results in

Figure 13 show that ZCOT is not sensitive to different settings

and reaches similar performance. We have also studied the

overhead of write barriers by running Spark applications atop

ZCOT’s JVMs (with Kryo serializers) and comparing the per-

formance against vanilla JVMs. The average overhead among

all applications is 2.73%, which is much smaller compared

with the improvement brought by ZCOT. We also ﬁnd the

average communication overhead with the metadata server is

only several milliseconds for each data-processing iteration,

which usually lasts for seconds.

100

150

200

20 40 60 80

Heap Size (GB)

Execution time (s)

(a) Maximum heap size

100

150

200

64 128 256 512

Chunk Size (MB)

Execution time (s)

(b) Chunk size

Figure 13: Results for PR under various settings

6.4 Flink

Ease of integration.

We have also integrated ZCOT to Flink,

another big-data analytics framework. Although Flink adopts

its built-in serializer and deserializer for OSD, the integration

is not complicated since we only need to replace them with

ZCOT’s OSD-compatible interfaces and streams.

Evaluation results.

We leverage four representative SQL

queries in the TPC-H benchmark for evaluation (Q1, Q3, Q6,

and Q10) and rely on its built-in generator to create input

data (10GB). The conﬁguration is similar to Spark: we launch

three workers on different machines for evaluation, but the

Java heap for each node is 20GB. Since the read and write

phases are overlapped in Flink, we do not break the execution

time into parts. The results in Figure 14 show that ZCOT

outperforms the built-in serializer in Flink for three out of

four queries and leads to 2.3%-22.2% improvement in query

execution time. ZCOT does not improve Q6 since it does

not involve a reduce operator and the amount of transferred

data is limited. It performs the best for Q10 (22.2%) since it

reaches 4.40

improvement for the write part and 1.44

for

the read part. The speedup is smaller compared with Spark

since Flink’s built-in serializers are manually optimized for

speciﬁc data structures (like tuples). Nevertheless, ZCOT still

shows better performance than the vanilla version of Flink,

which suggests the importance of zero-change transmission

mechanism.

USENIX Association 2022 USENIX Annual Technical Conference 147

Q1 Q3 Q6 Q10

Execution time (s)

Vanilla ZCOT

Figure 14: The performance of Flink applications

7 Related work

7.1 OSD optimizations

OSD has become a considerable performance bottleneck es-

pecially for large-scale communication-intensive applications.

To optimize the time-consuming phases in OSD, prior work

such as Kryo [38], Skyway [26], and Naos [39] has reﬁned

the transmission data format or leveraged the advances in

network hardware technologies. ZCOT instead aims at elim-

inating the whole OSD process. Apart from software-based

techniques, another line of work adopts hardware-based ap-

proaches to reduce OSD overhead. Optimus Prime [32] builds

a data transformation accelerator (DTA) to improve the OSD

throughput for microservices. Cereal [16] co-designs the data

transmission format with hardware accelerators to improve

the performance and energy efﬁciency of Spark applications.

Morpheus [40] moves the deserialization phase into smart

SSDs, while Hgum [46] leverages FPGAs to handle OSD

tasks. ZCOT is based on off-the-shelf hardware and thus or-

thogonal to those hardware-based optimizations.

7.2 Distributed language runtimes

The idea for building a distributed language runtime (e.g., dis-

tributed JVMs) has been explored for decades. Java/DSM [43]

builds a distributed JVM atop DSM for heterogeneous com-

puting. JESSICA [21, 47] provides a single global thread

space and transparently migrates Java threads for load bal-

ance. Comet [14] builds a DSM-abstraction for JVMs running

on both mobile devices and the cloud and relies on its mem-

ory model to achieve effective code ofﬂoading. Semeru [41]

proposes a universal Java heap abstraction so that a Java appli-

cation can freely access all memory resources in a memory-

disaggregated architecture. Those systems leverage a shared

heap to synchronize data among different endpoints, but they

do not consider the performance overhead of inter-JVM com-

munication for large applications. XMem [42] enables efﬁ-

cient type-safe object sharing among multiple JVMs on the

same physical machine, but it does not consider distributed en-

vironments. ZCOT also proposes a distributed runtime design,

but it mainly focuses on boosting data transmission among

multiple JVMs.

7.3 Runtime optimizations for Java

High-level languages like Java are intensively used in large-

scale, distributed applications, which stimulates research inter-

ests in runtime optimizations for performance improvement.

ITask [11] makes data processing tasks interruptible when

facing large memory pressure, which leads to better perfor-

mance and fewer out-of-memory errors. Yak [27] divides the

application execution into epochs and triggers GC when an

epoch ends. Broom [12] embraces a region-based design and

puts objects with the same lifecycle into the same region for

fast reclamation. ScissorGC [18,19] proposes shadow regions

to improve the scalability of full GC phase. Taurus [22, 23]

coordinates GC from different JVMs to reach better perfor-

mance or smaller tail latency. Facade [28] and Deca [36]

store massive data objects in off-heap memory to reduce GC

pressure, while Gerenuk [24] enables speculative execution

on serialized data to reduce both memory footprint and GC

overhead. ZCOT focuses on eliminating the OSD process and

duplicated object transmission, and it also collects objects by

coordinating with the metadata server.

8 Conclusion

This work introduces ZCOT, which aims to eliminate the

object serialization/deserialization phase in data exchange

among language runtimes (like JVMs). ZCOT provides an

exchange space where objects are interpretable for all JVMs,

which removes the need for any data transformation during

object transmission. It also uncovers the duplicated object

transmission problem and provides a corresponding dedu-

plication mechanism. The evaluation shows that ZCOT can

signiﬁcantly improve the performance of object transmission.

9 Acknowledgement

We sincerely thank our anonymous shepherd and reviewers

for their insightful suggestions. This work is supported in

part by the National Natural Science Foundation of China

(No. 62172272, 61925206, 62132014). Binyu Zang (

byzang@

sjtu.edu.cn) is the corresponding author.

A Artifact Appendix

Abstract

ZCOT, or Zero-Change Object Transmission, is proposed to

optimize data exchange among multiple Java virtual machines

(JVMs) in a distributed environment. Instead of sending and

148 2022 USENIX Annual Technical Conference USENIX Association

receiving data with the costly object serialization/deserial-

ization (OSD) phase, ZCOT allows JVMs to directly com-

municate with Java objects, which signiﬁcantly improves the

data exchange time, especially for applications like big data

analytics.

Scope

This artifact (including binaries, source code, documents, and

scripts) is used to conduct the main experiments in ZCOT,

which consists of the following two parts:

• Micro-benchmark performance

. The result should

show that ZCOT outperforms recent OSD optimizations

(Skyway [26] and Naos [39]) and state-of-the-art OSD

libraries (Kryo [38] and JSL) for most data structures

used in Naos’ microbenchmark.

• Spark performance

. The result should show that ZCOT

outperforms Kryo and JSL-based Spark applications in

both data exchange and task execution.

Note that we only report numbers evaluated on our ma-

chines, so the results might be different with various hardware

conﬁgurations.

Contents

We pack all related ﬁles into a zipped one, which contains the

following contents.

• README.

A ﬁle containing instructions for artifact

evaluation.

• ZCOT-jdk.

The source code of a modiﬁed OpenJDK to

support ZCOT.

• Meta-server.

The source code of the metadata server

used in ZCOT.

• Micro. Scripts and jars used for the micro-benchmark.

• Spark.

Since the code size of Spark is quite large, we

provide an executable binary for Spark, which is slightly

modiﬁed to evaluate ZCOT.

• Naos-jdk.

A slightly modiﬁed version of Naos’ Open-

JDK to compare with ZCOT.

Hosting

Currently our code is not ready for open-source. Nevertheless,

you can contact us via

[email protected]

to obtain the

artifact.

Requirements

Hardware requirements.

We evaluate ZCOT on four nodes

connected by 100 Gbit/s Mellanox ConnectX-5 NICs. The

NIC bandwidth has a signiﬁcant impact on ZCOT’s perfor-

mance.

Software requirements.

The operating system used in our

machines is Ubuntu 16.04.2, but higher versions are also

acceptable. Note that huge pages should be enabled to run

ZCOT. Dependencies for installing OpenJDK have been listed

in the README ﬁle.

References

[1]

Lada A Adamic and Natalie Glance. The Political Blogosphere and

the 2004 US Election: Divided they Blog. In Proceedings of the 3rd

International Workshop on Link Discovery, pages 36–43. ACM, 2005.

[2]

Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy

Ousterhout, Marcos K Aguilera, Aurojit Panda, Sylvia Ratnasamy, and

Scott Shenker. Can far memory improve job throughput? In Proceed-

ings of the Fifteenth European Conference on Computer Systems, pages

1–16, 2020.

[3]

Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu,

Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin,

Ali Ghodsi, et al. Spark sql: Relational data processing in spark. In

Proceedings of the 2015 ACM SIGMOD international conference on

management of data, pages 1383–1394, 2015.

[4]

Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan.

Group formation in large social networks: membership, growth, and

evolution. In Proceedings of the 12th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 44–54,

2006.

[5]

Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for

exotic particles in high-energy physics with deep learning. Nature

communications, 5(1):1–9, 2014.

[6]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl,

Seif Haridi, and Kostas Tzoumas. Apache ﬂink: Stream and batch

processing in a single engine. Bulletin of the IEEE Computer Society

Technical Committee on Data Engineering, 36(4), 2015.

[7]

John B Carter, John K Bennett, and Willy Zwaenepoel. Implementation

and performance of munin. In Proceedings of the thirteenth ACM

symposium on Operating systems principles, pages 152–164, 1991.

[8]

David Detlefs, Christine H. Flood, Steve Heller, and Tony Printezis.

Garbage-ﬁrst garbage collection. In Proceedings of the 4th Interna-

tional Symposium on Memory Management, ISMM 2004, Vancouver,

BC, Canada, October 24-25, 2004, pages 37–48. ACM, 2004.

[9]

Aleksandar Dragojevi

c, Dushyanth Narayanan, Miguel Castro, and

Orion Hodson. Farm: Fast remote memory. In 11th

{

USENIX

}

Sympo-

sium on Networked Systems Design and Implementation (

{

NSDI

}

14),

pages 401–414, 2014.

[10]

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[11]

Lu Fang, Khanh Nguyen, Guoqing Xu, Brian Demsky, and Shan Lu.

Interruptible tasks: Treating memory pressure as interrupts for highly

scalable data-parallel programs. In Proceedings of the 25th Symposium

on Operating Systems Principles, pages 394–409, 2015.

[12]

Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios

Vytiniotis, Ganesan Ramalingam, Manuel Costa, Derek G Murray,

Steven Hand, and Michael Isard. Broom: Sweeping out garbage col-

lection from big data systems. In 15th Workshop on Hot Topics in

Operating Systems (HotOS {XV}), 2015.

USENIX Association 2022 USENIX Annual Technical Conference 149

[13]

Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw,

Michael J Franklin, and Ion Stoica. Graphx: Graph processing in a

distributed dataﬂow framework. In 11th

{

USENIX

}

Symposium on

Operating Systems Design and Implementation (

{

OSDI

}

14), pages

599–613, 2014.

[14]

Mark S Gordon, D Anoushe Jamshidi, Scott Mahlke, Z Morley Mao,

and Xu Chen.

{

COMET

}

: Code ofﬂoad by migrating execution trans-

parently. In 10th

{

USENIX

}

Symposium on Operating Systems Design

and Implementation ({OSDI} 12), pages 93–106, 2012.

[15] Apache Hadoop. Hadoop, 2009.

[16]

Jaeyoung Jang, Sung Jun Jung, Sunmin Jeong, Jun Heo, Hoon Shin,

Tae Jun Ham, and Jae W Lee. A specialized architecture for ob-

ject serialization with applications to big data analytics. In 2020

ACM/IEEE 47th Annual International Symposium on Computer Ar-

chitecture (ISCA), pages 322–334. IEEE, 2020.

[17]

Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W.

Mahoney. Statistical properties of community structure in large social

and information networks. In Proc. Int. World Wide Web Conf., pages

695–704, 2008.

[18]

Haoyu Li, Mingyu Wu, and Haibo Chen. Analysis and optimizations

of java full garbage collection. In Proceedings of the 9th Asia-Paciﬁc

Workshop on Systems, pages 1–7, 2018.

[19]

Haoyu Li, Mingyu Wu, Binyu Zang, and Haibo Chen. Scissorgc: scal-

able and efﬁcient compaction for java full garbage collection. In Pro-

ceedings of the 15th ACM SIGPLAN/SIGOPS International Conference

on Virtual Execution Environments, pages 108–121, 2019.

[20]

Kai Li and Paul Hudak. Memory coherence in shared virtual memory

systems. ACM Transactions on Computer Systems (TOCS), 7(4):321–

359, 1989.

[21]

Matchy JM Ma, Cho-Li Wang, and Francis CM Lau. Jessica: Java-

enabled single-system-image computing architecture. Journal of Par-

allel and Distributed Computing, 60(10):1194–1222, 2000.

[22]

Martin Maas, Krste Asanovi

c, Tim Harris, and John Kubiatowicz. Tau-

rus: A holistic language runtime system for coordinating distributed

managed-language applications. Acm SIGPLAN Notices, 51(4):457–

471, 2016.

[23]

Martin Maas, Tim Harris, Krste Asanovi

c, and John Kubiatowicz. Trash

day: Coordinating garbage collection in distributed systems. In 15th

Workshop on Hot Topics in Operating Systems (HotOS {XV}), 2015.

[24]

Christian Navasca, Cheng Cai, Khanh Nguyen, Brian Demsky, Shan

Lu, Miryung Kim, and Guoqing Harry Xu. Gerenuk: thin computation

over big native data using speculative program transformation. In Pro-

ceedings of the 27th ACM Symposium on Operating Systems Principles,

pages 538–553, 2019.

[25]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis

Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software dis-

tributed shared memory. In 2015

{

USENIX

}

Annual Technical Confer-

ence ({USENIX}{ATC} 15), pages 291–305, 2015.

[26]

Khanh Nguyen, Lu Fang, Christian Navasca, Guoqing Xu, Brian Dem-

sky, and Shan Lu. Skyway: Connecting managed heaps in distributed

big data systems. In Proceedings of the Twenty-Third International

Conference on Architectural Support for Programming Languages and

Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-

28, 2018, pages 56–69. ACM, 2018.

[27]

Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Shan Lu,

Sanazsadat Alamian, and Onur Mutlu. Yak: A high-performance big-

data-friendly garbage collector. In 12th

{

USENIX

}

Symposium on

Operating Systems Design and Implementation (

{

OSDI

}

16), pages

349–365, 2016.

[28]

Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guo-

qing Xu. Facade: A compiler and runtime for (almost) object-bounded

big data applications. ACM SIGARCH Computer Architecture News,

43(1):675–690, 2015.

[29] OpenJDK. Jep 248: Make g1 the default garbage collector, 2017.

[30] OpenJDK. Jep 310: Application class-data sharing, 2018.

[31]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.

The pagerank citation ranking: Bringing order to the web. Technical

report, Stanford InfoLab, 1999.

[32]

Arash Pourhabibi, Siddharth Gupta, Hussein Kassir, Mark Sutherland,

Zilu Tian, Mario Paulo Drumond, Babak Falsaﬁ, and Christoph Koch.

Optimus prime: Accelerating data transformation in servers. In Pro-

ceedings of the Twenty-Fifth International Conference on Architectural

Support for Programming Languages and Operating Systems, pages

1203–1216, 2020.

[33]

Daniel J Scales and Monica S Lam. The design and evaluation of a

shared object system for distributed memory machines. In Proceed-

ings of the 1st USENIX conference on Operating Systems Design and

Implementation, pages 9–es, 1994.

[34]

Ioannis Schoinas, Babak Falsaﬁ, Alvin R Lebeck, Steven K Reinhardt,

James R Larus, and David A Wood. Fine-grain access control for

distributed shared memory. In Proceedings of the sixth international

conference on Architectural support for programming languages and

operating systems, pages 297–306, 1994.

[35]

Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Distributed shared

persistent memory. In Proceedings of the 2017 Symposium on Cloud

Computing, pages 323–337, 2017.

[36]

Xuanhua Shi, Zhixiang Ke, Yongluan Zhou, Hai Jin, Lu Lu, Xiong

Zhang, Ligang He, Zhenyu Hu, and Fei Wang. Deca: a garbage collec-

tion optimizer for in-memory data processing. ACM Transactions on

Computer Systems (TOCS), 36(1):1–47, 2019.

[37] Eishay Smith. Jvm-serializers, 2020.

[38] Esoteric Software. Kryo, 2021.

[39]

Konstantin Taranov, Rodrigo Bruno, Gustavo Alonso, and Torsten Hoe-

ﬂer. Naos: Serialization-free RDMA networking in java. In 2021

USENIX Annual Technical Conference (USENIX ATC 21), pages 1–14.

USENIX Association, July 2021.

[40]

Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, and

Steven Swanson. Morpheus: Creating application objects efﬁciently

for heterogeneous computing. ACM SIGARCH Computer Architecture

News, 44(3):53–65, 2016.

[41]

Chenxi Wang, Haoran Ma, Shi Liu, Yuanqi Li, Zhenyuan Ruan, Khanh

Nguyen, Michael D Bond, Ravi Netravali, Miryung Kim, and Guo-

qing Harry Xu. Semeru: A memory-disaggregated managed runtime.

In 14th

{

USENIX

}

Symposium on Operating Systems Design and Im-

plementation ({OSDI} 20), pages 261–280, 2020.

[42]

Michal Wegiel and Chandra Krintz. Xmem: type-safe, transparent,

shared memory for cross-runtime communication and coordination. In

Proceedings of the 29th ACM SIGPLAN Conference on Programming

Language Design and Implementation, pages 327–338, 2008.

[43]

Weimin Yu and Alan Cox. Java/dsm: A platform for heterogeneous

computing. Concurrency: Practice and Experience, 9(11):1213–1224,

1997.

[44]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott

Shenker, and Ion Stoica. Spark: Cluster computing with working

sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing,

HotCloud’10, Boston, MA, USA, June 22, 2010, page 95. USENIX

Association, 2010.

[45]

Matthew J Zekauskas, Wayne A Sawdon, and Brian N Bershad. Soft-

ware write detection for a distributed shared memory. In Proceedings

of the 1st USENIX conference on Operating Systems Design and Im-

plementation, pages 8–es, 1994.

[46]

Sizhuo Zhang, Hari Angepat, and Derek Chiou. Hgum: Messaging

framework for hardware accelerators. In 2017 International Conference

on ReConFigurable Computing and FPGAs (ReConFig), pages 1–8.

IEEE, 2017.

[47]

Wenzhang Zhu, Cho-Li Wang, and Francis CM Lau. Jessica2: A dis-

tributed java virtual machine with transparent thread migration support.

In Proceedings. IEEE International Conference on Cluster Computing,

pages 381–388. IEEE, 2002.

150 2022 USENIX Annual Technical Conference USENIX Association