Version Control of the World Wide Web
Request for Comment: ?
Status of this Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.Status of this Memo
Abstract
A printed book is static point, you can reference a point. But the
webpage is a line of points, a linear history of points, and this
behavior is not reflected or accommodated in HTTP 1.1. When one views
a page as an repeat observer, all one can discern is the slope of this
line, using the points one has physically archived (seen before,
saved) or cached in one's resident/ephemeral memory. Other than by way
of personal accountability, there is no comprehensive mechanism for
retroactively addressing or retrieving a webpage at a previous
point. The Internet Archive, offers one approach of archiving content
at set intervals, however coverage is limited and the approach to
predicting changes in content are naive.
1. Introduction
This document builds off rfc2291, specifically addressing resource
versioning and how various versioning systems for the web might work
or look if they were designed to (a) track changes [diffs] to files, (b)
notify [pub/sub] other services of CRUD of content, and (c) expose a
mechanism for retroactively accessing a resource as a previous point
in time.
2. Rationale
If a document is worth publishing and reading once, it is because the
author deemed its contents valuable for representing a state of
knowledge at a given time. Because the nature of the world has
inherint stochasticism, (see the Frame Problem (c)) it is expected
that one should discover one or more assumptions or conclusions be
incorrect (e.g. of occam's razon, there are many ways of being wrong,
only one way to be right). In the current process, the end user has no
way to introspect this intermediary learnings and progress, which are
often fundamental to understanding of a solution. In the words of
Richard Feynman: https://www.youtube.com/watch?v=Hb8P9N4KGm8&t=12m34s
3. Approaches
There are centralized and decentralized approaches to addressing with
this. Both flavors, AFAIK, boil down to three strategies:
1) Baking into the protocol (e.g. HTTP 1.x) the capability to
passively detect changes (perhaps against a central db, or by
performing potentially expensive diffs upon any resource suspected of
modifying a document) and then amalgamate such findings in a master
(possibly distributed) ledger (think block chain). Under this
approach, the server wouldn't be responsible for maintaing/preserving
diffs, and thus also wouldn't be responsible for serving diffs. Thus
the master ledger / block chain would have to address (or itself stor)
the contents of each diff. In my opinion, Internet Service Providers
(or those we obligate to manage DNS) should be responsible for such
things (assuming no one volunteers), which would probably be a bad
idea. This approach optimizes coverage and includes all sites from day
1, if done correctly without change to any individual site. It would
conversely impact the speed of HTTP which I expect will be
unacceptable to the majority, especially search engines, who value
speed as a superior economic variable to versioning. One way around
this is to have HTTP compute the hash of a resource and then dispatch
a 3rd party service to do the diffing. The problem with this
locking/blocking on the old resource long enough to capture a diff
before the resource is updated. There are other challenges and
complications, such as under/overfitting. Sometimes resources are
produced dynamically, such as a REST API whose job is to return random
test data, and versioning doesn't make sense for these endpoint. HTTP
guessing based on the MIME or other header variables just increases
complexity, slows speed, and still admits edge cases.
2) A website is personally accountable for (a) tracking changes and
maintaining a history (self-archiving diffs)*, or reporting (e.g. via
exposing a pub/sub or rss style interface) to those things which care
to archive it. As we have seen from RSS, many sites (and frameworks
which create sites; e.g. wordpress) participate in such
movements. Given precedent, it seems this is the slow road of
addoption the web will likely take. I imagine this is the path we will
ultimately take as it doesn't make much practical sense to overload
protocols by making them more complicated than they need. A higher
level protocol, as WebDAV describes rfc2291, or a comparable opt-in
policy may make the most sense. It's work noting that in section
4.7. rcf2291, changes are proposed to HTTP for DELETE and PUT.
*both accomplished well by git version control, assuming non-binary data)
3) Continue to rely on 3rd party services, like the Internet Archive,
to make a "best effort" at being accountable for archiving everything
at arbitrary intervals (perhaps informed by a few heuristics, like
website size, or estimated rate of change).
TL;DR
Most web services do not have the ability to return a resource as it
was at a specific point in time (other than the present). This is
something which should be supported by / baked into the protocol layer
of the web (http).
Wikipedia Permanent Links[d]
5. Sources
[a] http://www.xmpp.org/extensions/xep-0060.html
[b] https://tools.ietf.org/html/rfc7231
[c] https://en.wikipedia.org/wiki/Frame_problem
[d] https://en.wikipedia.org/wiki/Help:Permanent_link
Conversation is continued here