Network Working Group Y. Cui
Internet-Draft L. Sun
Intended status: Informational Tsinghua University
Expires: March 13, 2016 September 10, 2015
Internet Storage Sync: Problem Statement
draft-cui-iss-problem-01
Abstract
Internet storage services have become more and more popular. They
attract huge number of users and produce a significant share of
Internet traffic. However, most existing Internet storage services
make use of proprietary sync protocols to achieve the data
synchronization. And almost all of them are proved to be not
efficient enough and have room for improvement. This document
outlines the related problems caused by inefficient proprietary sync
protocols and shows a demand for an efficient and standard sync
protocol.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on March 13, 2016.
Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Cui & Sun Expires March 13, 2016 [Page 1]
Internet-Draft iss Problems September 2015
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology and Concepts . . . . . . . . . . . . . . . . . . 3
3. Architecture of Internet Storage Service . . . . . . . . . . 4
4. Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1. Complicated Support for APIs . . . . . . . . . . . . . . 6
4.2. Unavailable Cross-service Sync . . . . . . . . . . . . . 6
4.3. Redundant Similar Clients . . . . . . . . . . . . . . . . 7
4.4. Different Capability Configurations and Implementations . 7
4.5. Challenges in Mobile and Wireless Environments . . . . . 9
4.6. Unsatisfactory Collaborative Work Ability . . . . . . . . 10
5. Advantages of Standard Sync Protocol . . . . . . . . . . . . 11
6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . 12
7. Security Considerations (TBD) . . . . . . . . . . . . . . . . 12
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13
9. Informative References . . . . . . . . . . . . . . . . . . . 13
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13
1. Introduction
Internet storage services provide a way for users to synchronize
local files or folders with remote servers. It enables the users to
back up and share their local data on the Internet and makes it
possible for users to access, retrieve and modify their synchronized
data via multiple terminals. In recent years, the explosion in the
popularity of Internet storage service has attracted more and more
users and also providers (e.g. Google, Microsoft, Amazon and etc.)
offering cheaper and larger storage space. Dropbox, typically
considered as the market leader, announced that they have more than
400 million registered users. Thus it is not strange that Internet
storage services have accounted for a significant share of Internet
traffic and this number will keep growing in the future.
Existing Internet storage services employ data synchronization (sync)
to perform retrieving/uploading local files from/to the remote
servers. A sync protocol between client and server is required to
achieve that. Almost all existing Internet storage services use
their own proprietary sync protocols. Having various proprietary
sync protocols impedes the development of Internet storage service
since it sacrifice the user experience when users desire to
experience multiple services or wish to share local files with users
from other services. Furthermore, having proprietary protocols
Cui & Sun Expires March 13, 2016 [Page 2]
Internet-Draft iss Problems September 2015
increases the complexity of developing new Internet storage service
for the new comers in this market.
Previous work shows that most existing sync protocols employed by
different Internet storage services are not efficient that usually
waste limited bandwidth and introduce extra traffic. Such
inefficiency issues are more challenging in mobile and wireless
environments. For example, the wireless connectivity is interrupted
when the user is uploading the pictures using his mobile phone. It
turns out that the synchronization is failed but still consumes lots
of traffic and cell phone battery. The unsatisfactory performance
caused by the limitations of immature sync protocols and poor system
designs has indeed become a critical problem in the development of
Internet storage service.
To address the problems mentioned above, an open and standard storage
sync protocol is required. In addition to this, this standard sync
protocol are expected to be more efficient which is able to
accelerate sync process, reduce unnecessary traffic and have a better
performance in mobile and wireless environments.
This document outlines the problems arisen in existing Internet
storage services with inefficient proprietary sync protocols.
Section 2 lists the terminology and related concepts of Internet
storage service. Section 3 introduces the architecture of existing
Internet storage services. Section 4 describes the main problems and
issues that need to be considered. Section 5 explains the advantages
of using open and standard sync protocol. Section 6 identifies the
differences between ISS and related work in IETF (i.e. rsync over
WebDAV).
2. Terminology and Concepts
Data synchronization (sync): The most important operation of Internet
storage services and is more than remote file transfer. It makes it
possible for the client to automatically update changes to the files
stored in the remote servers. Changes on a local file will be
notified to the client promptly
Client: An application which is installed at the user side (i.e. on
multiple terminals). It enables users to access and experience
Internet storage service.
Control server: The entity that takes the responsibility of
authenticating users, managing metadata information and also
notifying changes to the client. It stores authentication and
metadata information of users.
Cui & Sun Expires March 13, 2016 [Page 3]
Internet-Draft iss Problems September 2015
Data storage server: The entity that stores the synchronized files of
users.
Control data: The control information exchanged with control server
to fulfil the data sync process. Typical control data includes
metadata (e.g. hashes for chunks), authentication information and
etc.
Content data: The original data of user local file, often in forms of
small chunks.
Sync protocol: A communication protocol between client and remote
servers to achieve data synchronization. It contains control flow
and data flow. Sync protocols are always built on HTTPS/HTTP.
o Control flow: This flow is for client and control server to
exchange control data.
o Data flow: This flow is for transmitting content data between
client and data storage servers.
Sync efficiency: A performance metric that indicates how fast the
changes can be synchronized to the Internet with the lowest traffic
overhead.
Useful capabilities to improve sync efficiency:
o Chunking: Split large file into small chunks.
o Bundling: Transmit multiple small chunks as a single big chunk.
o Deduplication: Avoid retransmission of existing content on the
Internet.
o Delta-encoding: Only synchronize modified data.
o Compression: Compress data before transmission.
3. Architecture of Internet Storage Service
The architecture of most Internet storage services is generally
composed of three major components: client, control server and data
storage server. And the whole architecture is shown in Figure 1.
Cui & Sun Expires March 13, 2016 [Page 4]
Internet-Draft iss Problems September 2015
* * * * * * * *
* * * * * * * * * * * * * *
* INTERNET *
* +------------+ +------------+ *
------| Control | | +------------+ *
| * | server | | |Data storage|========
| * +------------+ + | servers | * |
| * +------------+ * |
| * * * * * * * * * * * * * * |
Control Flow * * * * * * * * Data Flow
| |
| |
| +--------+ |
---------------------| Client |=====================
+--------+
Figure 1
With the help of sync protocol, all the three components could
communicate with each other. Control server is responsible for
storing all the control data, including authentication information,
metadata and etc. And once there are changes made on synchronized
files, the control server will notify the clients. However the other
type of data, content data, is stored in the form of chunks on the
data storage servers with no knowledge of sources, users and
relationship with other data chunks. That is to say, one user file
may be split and stored on several different data storage servers.
These two kinds of servers are separate logical entities and are
usually deployed in different locations. Every time the client
synchronize a local file to the Internet, it needs to exchange both
control data with the control server and content data with the data
storage servers.
4. Problems
Existing popular Internet storage services, including Dropbox,
OneDrive, GoogleDrive and etc, are using their own proprietary sync
protocols to achieve the data synchronization. Using different
proprietary protocols are always considered not to be beneficial to
the development of Internet services. Moreover, previous works and
measurements have revealed that the sync efficiency of existing
Internet storage services do not have a good performance. This
section describes current problems for Internet storage services
caused by their sync protocols. We summarize five specific problems
and such problems are looking forward to be addressed by the IETF
community.
Cui & Sun Expires March 13, 2016 [Page 5]
Internet-Draft iss Problems September 2015
4.1. Complicated Support for APIs
Popular Internet storage services provide APIs to encourage more
people to develop third party clients. The APIs allow user programs
to access to the service provider's servers and then synchronize
local files with those servers. These APIs can also include some
further advanced features or functions to make the client work
better. Different providers have different APIs provided to the
developers and it is really common that their APIs have different
styles and features. Typically, the service provider need to provide
different sets of APIs for different platforms (e.g. Windows or
Android) and update them frequently.
As for the developers, they need to learn the provided APIs in order
to design and implement their own clients. It is an obvious
advantage for a third party client that it can support multiple
Internet storage services. There have already been some successful
third party clients that support multiple services (e.g. ExpanDrive
[ExpanDrive], IFTTT [IFTTT]). However it is not easy for the
developers to learn and apply so many different APIs to develop and
maintain their third party clients.
In summary, it is obvious that both providers and developers suffer
from the complicated support for APIs to some extent.
4.2. Unavailable Cross-service Sync
Sharing is one of the most important functions provided by Internet
storage services. With this function provided, files in the Internet
could be easily synchronized and manipulated by different people and
groups. Anyone who is permitted to read and download the file is
able to modify and upload new versions of this file to the Internet.
However, this sharing function merely works well inside a single
service. That is to say, users who are using the same Internet
storage service could easily achieve the sharing and coordinated
operations on their files. When referring to the sharing among
different Internet storage services, it is not complete since the
sync among different services is not available. Different services
using different proprietary sync protocols results in the
unavailability. For example, if the shared files are stored on a
Dropbox server, a GoogleDrive client cannot retrieve/upload them
through Dropbox's sync protocol since it has no idea of the Dropbox's
sync protocol. And it is apparently impossible to employ its own
GoogleDrive's sync protocol to retrieve/upload files on Dropbox
server.
Cui & Sun Expires March 13, 2016 [Page 6]
Internet-Draft iss Problems September 2015
Currently, if a Dropbox user still wishes to share his file with a
GoogleDrive user, he will make it with the help of basic HTTP
connections. The Dropbox user will send an HTTP link of this file to
the GoogleDrive user. After clicking on that link, the GoogleDrive
user could download this file through HTTP. However the only thing
that the GoogleDrive user can do with the shared file is to read and
download it. He cannot modify and update the shared file since
Dropbox and GoogleDrive are using two different proprietary sync
protocols.
The existing sharing function among different services is actually
incomplete (i.e. only download is available) and far away from
people's expectation. In order to achieve a complete and useful
sharing function, the sync among different services should be
available.
4.3. Redundant Similar Clients
The emergency of more and more Internet storage services provides
users with a wide range of choices for storing their local files
remotely. Like other Internet applications, users are not restricted
to use only one of those services. Actually, they tend to have
multiple accounts for different Internet storage services and
experience them simultaneously. One important reason is that users
are always pursuing better functionality. For example, Dropbox is
better at file processing, OneDrive is better at the interoperability
and compatibility with Microsoft Office while GoogleDrive has a
better performance at mail attachment. To enable all the desired
functions and features, a simple way is to register and use all the
desired Internet storage services. Furthermore, people may simply
need multiple Internet storage services for larger storage space and
higher reliability.
However, having and using different Internet storage service results
in a problem that user should have multiple similar client
applications. Since almost all commercial Internet storage services
have their own proprietary sync protocols and corresponding client
applications. Installing and running multiple client applications
sacrifices the user experience and also increases the complexity of
syncing files with different providers' servers in Internet. For
instance, users usually suffer from duplicate operations in order to
upload the same file to their different service accounts.
4.4. Different Capability Configurations and Implementations
Data synchronization is not a simple remote file transfer process, it
can implement several capabilities to optimize the data storage usage
and speed up data transmissions. There exists five well-known
Cui & Sun Expires March 13, 2016 [Page 7]
Internet-Draft iss Problems September 2015
capabilities that are employed by Internet storage services to
improve the sync efficiency: chunking, bundling, deduplication,
delta-encoding and compression.
However, the investigation of [Benchmarking] shows that different
Internet storage services have different capability configurations
and implementations. And most existing Internet storage services do
not implement all the five capabilities during their sync processes.
Lack of such capabilities can do affect the sync efficiency. For
example, when user wishes to synchronize multiple small files to the
Internet, bundling is really a useful capability to reduce the sync
time. If the bundling is not implemented, the user will suffer from
TCP slow start effect since there will be a new connection for each
small file. Bundling small files together can effectively reduce the
number of TCP connections so that the whole sync time and traffic
overhead can be significantly decreased. Further measurement details
and conclusions for other capabilities could be found in
[Benchmarking]. Table 1 shows different capabilities implementations
of four popular Internet storage services (i.e. Dropbox,
GoogleDrive, OneDrive and Seafile) on Windows OS.
+----------------+-------------+-------------+-------------+-------------+
| Capabilities | Dropbox | GoogleDrive | OneDrive | Seafile |
| | | | | |
+----------------+-------------+-------------+-------------+-------------+
| Chunking | 4MB | 8MB | Variable | Variable |
+----------------+-------------+-------------+-------------+-------------+
| Bundling | Yes | No | No | No |
+----------------+-------------+-------------+-------------+-------------+
| Deduplication | Yes | No | No | Yes |
+----------------+-------------+-------------+-------------+-------------+
| Delta-encoding | Yes | No | No | No |
+----------------+-------------+-------------+-------------+-------------+
| Compression | Yes | Yes | No | No |
+----------------+-------------+-------------+-------------+-------------+
Table 1
Measurements and study from [QuickSync] reveal that sync efficiency
of current Internet storage services still have plenty of rooms for
improvement since they do not understand and implement the key
capabilities and sync protocol correctly. The remaining part of this
subsection lists few specific problems.
Chunking is the most widely implemented capability that simplifies
the transmission recovery when the synchronization of a large file is
interrupted. Different implementations of chunking has different
Cui & Sun Expires March 13, 2016 [Page 8]
Internet-Draft iss Problems September 2015
chunking schemes (i.e. dynamic chunking or static chunking) and chunk
sizes. Typically, smaller chunk size and dynamic chunking scheme
(e.g. Content Defined Chunking) are better for detecting and
eliminating redundancy. While the ability to detect more redundancy
is not always equal to better sync efficiency since it will introduce
more computation overhead. A trade-off between computation time and
transmission time need to be considered to achieve an effective
chunking. A better chunking strategy may be network-aware which
means the sync should be able to employ appropriate chunking strategy
according to its current network condition.
Delta-encoding is an algorithm to achieve incremental sync that only
modified data is transmitted. It is hard to be implemented that only
Dropbox from existing commercial Internet storage services supports
this capability. However, measurement results from [QuickSync] show
that incremental sync is not always available for all the cases. For
some typical sync workloads, the incremental sync results in sync
traffic 10 times larger than the necessary modified size. We need to
design an improved delta-encoding algorithm that makes the
incremental sync always available in various scenarios.
Application-layer acknowledgement mechanism is another critical
feature that has an impact on sync time and efficiency. Most
existing Internet storage services employ a sequential
acknowledgement mechanism that the next chunk is only allowed to be
transmitted until the last chunk's acknowledgement has been received.
As a result, users usually suffer from high sync latency when
synchronizing many small files in a high RTT environment. A delayed
acknowledgement mechanism enables the client to send and pipeline
chunks without waiting for previous acknowledgements that markedly
reduces the sync time.
4.5. Challenges in Mobile and Wireless Environments
The increasing number of mobile terminals introduces the requirement
of synchronizing data on any device via any connectivity at anytime
and anywhere. A change made on the data through the desktop is
required to be automatically transferred to the user's mobile phone
or other mobile devices. Based on the measurements from
[Look_at_Mobile_Cloud], current mobile Internet storage services do
not have a satisfactory performance on sync efficiency. The root
cause and problem are twofold:
First of all, mobile devices have limited storage and computation
ability, it is really hard to implement all the five useful
capabilities discussed previously on a mobile client (Table 2 shows
the implementations for capabilities on Android OS). And the
measurement results from [Look_at_Mobile_Cloud] shows that none of
Cui & Sun Expires March 13, 2016 [Page 9]
Internet-Draft iss Problems September 2015
existing mobile Internet storage services implement all the five key
capabilities. Actually, only very few of them could be found on a
mobile Internet storage client. That explains why most Internet
storage services wastes limited bandwidth, produce large useless
traffic and suffer long sync time in the mobile environment. How to
implement all the desired capabilities with lower requirement of
storage and computation resources is a critical problem needs to be
addressed.
+----------------+-------------+-------------+-------------+-------------+
| Capabilities | Dropbox | GoogleDrive | OneDrive | Seafile |
| | | | | |
+----------------+-------------+-------------+-------------+-------------+
| Chunking | 4MB | 260K | 1MB | No |
+----------------+-------------+-------------+-------------+-------------+
| Bundling | No | No | No | No |
+----------------+-------------+-------------+-------------+-------------+
| Deduplication | Yes | No | No | No |
+----------------+-------------+-------------+-------------+-------------+
| Delta-encoding | No | No | No | No |
+----------------+-------------+-------------+-------------+-------------+
| Compression | No | No | No | No |
+----------------+-------------+-------------+-------------+-------------+
Table 2
Secondly, wireless connectivity is not very stable due to the nature
of signals. Its limited bandwidth, higher packet loss and other
drawbacks have a higher requirement for the incremental sync. It is
not a wise choice to use full-file sync in wireless condition since
users may suffer frequent sync failures and large traffic.
[Look_at_Mobile_Cloud] points out two challenges that account for the
complexity and difficulty of implementing incremental sync in a
practical mobile Internet storage services. First is that many
existing Internet storage services are built on top of RESTful
infrastructure which means the data is only allowed to be accessed at
the file level. Second is that most delta-encoding algorithms work
in the file granularity. Both of them have a conflict with the
architecture of Internet storage services which splits the file into
small chunks and stores them in different servers distributedly.
4.6. Unsatisfactory Collaborative Work Ability
With the popularity of Internet storage services, collaborative work
is becoming an important feature of such services. This feature is
especially important and provides convenience for a team or an
organization since participants could easily edit and retrieve the
Cui & Sun Expires March 13, 2016 [Page 10]
Internet-Draft iss Problems September 2015
target file on the Internet. Currently, such collaborative work
ability is still unsatisfactory that some common operations may lead
to unexpected results.
For example, parallel updates from different end users may result in
a version conflict. If two or more users are editing the same file,
it is hard to make the file updated correctly. How to ensure the
correctness without sacrificing the user experience is a considerable
problem (since a simple way to avoid this is to allow only one user
to modify the document at one time). In other words, this conflict
management should be transparent to the end users.
5. Advantages of Standard Sync Protocol
An open and standard sync protocol between client and server can
effectively address the problems mentioned above. The sync protocol
consists of two types of flows: control flow and data flow. Control
flow is between client and control server. It is intended for user
authentication, metadata management and also the active notification
of data changes. Data flow is between client and data storage
servers which is only for transmitting actual file data (in the form
of numerous chunks). The combining work of control flow and data
flow enables the whole data synchronization. According to the
analysis of problems above, the key capabilities should be supported
as options in the sync protocol and it would be better if the
protocol is network-aware. The rest of this section lists the
advantages of employing an open and standard sync protocol.
First off, with a standard sync protocol provided, a third party
client that supports multiple Internet storage services is easy to
implement since APIs provided by different providers would be
unnecessary or at least simplified. This would attract more and more
people or organizations to develop and implement their own client
(sometimes it is even possible for the user himself to implement his
client). As a result, users do not need multiple clients for
multiple services any more and their user experience is improved.
Furthermore, the competition in the (third party) client market is
increasing which is benefit for the users. They are able to choose
their clients flexibly and the frequent update of clients enable
users to obtain more better features and functions.
Another advantage of having standard sync protocol is that the sync
among different services is available or at least possible to
achieve. If two different services both employ the standard sync
protocol, their users could share files with each other using the
same standard sync protocol (not the basic HTTP any more). That is
to say, the user could access, retrieve, modify or upload files of
users from other different service.
Cui & Sun Expires March 13, 2016 [Page 11]
Internet-Draft iss Problems September 2015
Using standard sync protocol also makes it easy to improve Internet
storage services. Compared with the existing proprietary formats,
standard sync protocol is totally open and designed by many
contributors. People are welcome to revise and improve the standard
protocol. We believe that both users and providers will benefit a
lot from such a standard sync protocol.
6. Related Work
WebDAV ([RFC4918]) provides an alternative way to sync local data
with remote web servers. It can be treated as previous IETF effort
on file collections, authoring and versioning over HTTP. Typical
WebDAV protocol extends HTTP protocol to enable users to
collaboratively edit and manage files on remote servers. It is more
suitable for structured data and suffer from inefficiency problem.
Rsync ([rsync]) is a classic delta-encoding algorithm which
implements the incremental sync (i.e. only changes are synchronized).
Applying rsync when using WebDAV (rsync over WebDAV) offers a good
solution that improves the sync efficiency dramatically. People may
confused when ISS and rsync over WebDAV are put together and may ask
why we'd like to propose the ISS work with the existence of rsync
over WebDAV. The following lists the differences between ISS and
rsync over WebDAV and can help to answer the question.
o Different system architectures: ISS employs a distributed system
architecture that the control server and data server are
separated. In rsync over WebDAV, the web server is all for
control and data storage. Distributed architecture are beneficial
to privacy protection and efficient use of resources.
o Different sync efficiency requirements: ISS is designed for
network storage services, which will handle different kinds of
data (including movie, image, docs and etc.). It is reasonable
that the ISS will handle large data files more frequently which
has a higher requirement on sync efficiency. Inefficient sync or
sync failure of large data file is more likely to happen but less
tolerable to the end user.
o Different version control requirements: ISS requires more when
referring to the version control. It should be more like the git
which parallel updates (conflict management) or other features
should be considered.
7. Security Considerations (TBD)
TBD
Cui & Sun Expires March 13, 2016 [Page 12]
Internet-Draft iss Problems September 2015
8. Acknowledgements
The authors would like to thank Barry Leiba, Mark Nottingham, Julian
Reschke, Marc Blanchet, Mike Bishop, Haibing Song, Philip Hallam
Baker, Michiel de Jong, Zeqi Lai and Ted Lemon for their valuable
comments and contributions to this work.
9. Informative References
[Benchmarking]
Drago, I., Bocchi, E., Mellia, M., Slatman, H., and A.
Pras, "Benchmarking Personal Cloud Storage", IMC , 2013.
[ExpanDrive]
"ExpanDrive", .
[IFTTT] "IFTTT", .
[Inside_Dropbox]
Drago, I., Mellia, M., Munafo, M., Sperotto, A., Sadre,
R., and A. Pras, "Inside Dropbox: Understanding Personal
Cloud Storage Services", IMC , 2012.
[Look_at_Mobile_Cloud]
Cui, Y., Lai, Z., and N. Dai, "A First Look at Mobile
Cloud Storage Services: Architecture, Experimentation and
Challenge", IEEE Network , 2015.
[QuickSync]
Cui, Y., Lai, Z., Wang, X., Dai, N., and C. Miao,
"QuickSync: Improving Synchronization Efficiency for
Mobile Cloud Storage Services", MOBICOM , 2015.
[RFC4918] Dusseault, L., Ed., "HTTP Extensions for Web Distributed
Authoring and Versioning (WebDAV)", RFC 4918,
DOI 10.17487/RFC4918, June 2007,
.
[rsync] "rsync", .
Authors' Addresses
Cui & Sun Expires March 13, 2016 [Page 13]
Internet-Draft iss Problems September 2015
Yong Cui
Tsinghua University
Beijing 100084
P.R.China
Phone: +86-10-6260-3059
Email: yong@csnet1.cs.tsinghua.edu.cn
Linhui Sun
Tsinghua University
Beijing 100084
P.R.China
Phone: +86-10-6278-5822
Email: lh.sunlinh@gmail.com
Cui & Sun Expires March 13, 2016 [Page 14]