Ice cold archive storage

tpetry · on April 11, 2019

The interesting part it's cheaper than AWS Glacier ($4 per TB per month) and slightly more expansive than AWS Glacier Deep Archive ($0.99 per TB per month) but the data is available immediately and not in hours like glacier where you have to pay a hefty premium for faster access to the data.

shittyadmin · on April 11, 2019

Interesting, unlike Glacier this is significantly cheaper than Backblaze B2, meaning I might have to reconsider how I do my backups again. Any good backup tools supporting this type of service?

I rely on Restic at the moment which seems to need fast read access to data, but their incremental snapshotting is great. It'd be ideal if I could find something like that supporting these "cold storage" solutions.

xoa · on April 11, 2019

One thing I do consider a real value add for AWS Glacier though has been their native support for offline media import/export. Ie., you can just send them a hard drive of your own for data load, and pay to get a hard drive back out as well. As gigabit (or faster) class WAN slowly spreads this will someday become unnecessary, but right now in many, many places a company could easily have terabytes to backup but 10/1 ADSL as their best available connection. Even with faster connections, aggressive data caps are sadly not infrequent. Whether it's for initial load, ongoing use or faster recovery, sometimes there is still nothing like a multi-TB drive or two in the mail.

There are 3rd parties that will do it for you (Iron Mountain is at least one) but that's an extra cost and Google takes no responsibility for it. I assume this is an example of a place where Amazon is able to leverage its wholistic business, with a Cloud service that can also take advantage of their physical logistics system. Google's service here is quite significantly cheaper and has some nice features though, but even if it's not worth a $4/$1.23 premium for Amazon I could definitely see continuing to pay Amazon some premium ($2 vs $1.23 say) for that alone anywhere with limited high speed WAN availability.

boulos · on April 11, 2019

Disclosure: I work on Google Cloud.

We also have a Transfer Appliance [1], that comes in two sizes (100T and just under 500T). We don’t currently support shipping one filled up with your data for recovery/export though.

[1] https://cloud.google.com/transfer-appliance/

deftnerd · on April 11, 2019

Backblaze also offers that option. You can mail them up to 8tb on an external HD and have it loaded into their system for $190, up to 256gb on a USB stick for $100. [1]

You can also request a "B2 Fireball" [2] from them. It's basically a small array that they mail to you for $550 with 70tb of storage. You fill it up and send it back to them within the month, and they'll load the data into your account.

[1] https://www.backblaze.com/b2/cloud-storage-pricing.html (Bottom of the page)

[2] https://www.backblaze.com/b2/solutions/datatransfer/fireball...

xoa · on April 11, 2019

For comparison, Amazon supports up to 16TB in their basic service, with an $80 flat handling fee per storage device and then $2.50 per data loading hour. Since they support 2.5"/3.5" SATA and external eSATA & USB 2/3.0 interfaces and it's a pure sequential transfer, it's not much trouble to get at least close to maximum sequential, which even for decent spinning rust should allow a good half TB an hour at least. I've never tried an SSD so I'm not sure if they can saturate 6 Gbps, but as even a 32 hour transfer of 16TB would only be another +$80 it may not be generally relevant anyway.

Amazon's equivalent to B2 Fireball is "AWS Snowball" (amusingly enough, not sure if there is a bit of fun name riffing between the two here), which is a service fee of $200/50TB and $250/80TB device, any onsite days after the first 10 at $15/day.

It's interesting how the pricing mix is on this feature though. Amazon offers lower potential ingress pricing depending on your use, though notably if you kept the Snowball a whole month the pricing would get very close to the Fireball (+20 days @$15/day brings the price to $500/$550 respectively, though the former with 20TB less and the latter with 10TB more).

Backblaze and Google are both much cheaper to get data out of though, Amazon's Glacier and descendent services remain very much deep freeze focused.

rob-olmos · on April 11, 2019

What's the shipping costs on a Snowball or other appliances?

mmahemoff · on April 11, 2019

AWS's new Glacier Deep is actually cheaper than Google's Ice Cold, $1/TB/month.

https://aws.amazon.com/about-aws/whats-new/2019/03/S3-glacie...

jhack · on April 11, 2019

Those retrieval costs, though...

tedmiston · on April 11, 2019

Anyone know the retrieval cost for Ice Cold? I don't see it mentioned in the post.

pnutjam · on April 11, 2019

If it's the same as their other storage, which isn't really clear... About $50 /TB

shittyadmin · on April 11, 2019

On B2 that's $10... yeah, might be reconsidering this...

m3nu · on April 11, 2019

I expect that Google will also charge for retrieval. Their egress is really really expensive.

There may also be a minimum storage period, like Amazon has.

Let's wait and see.

puzzle · on April 11, 2019

It looks like a lower tier than the existing Coldline and Nearline (7x cheaper for storage than the former). Both have a minimum period, so this one is likely to have one as well. Coldline and Nearline are more expensive than regular storage when fetching objects, which means ice cold storage is probably even more expensive when you restore (is it going to be 7x, too, keeping symmetry?).

Hamuko · on April 11, 2019

Is their egress more expensive than Amazon's? Because when I had a look at that, it sure wasn't cheap either.

jamesjamesm · on April 11, 2019

I've never tried it, but I know https://www.arqbackup.com/ supports Google Cloud.

tedmiston · on April 19, 2019

The concept, idea, and flexibility of Arq is great, ideal even. The amount of control is nice. I wish it were open source.

The actual product is pretty painful when you need to do a recovery, especially if you don't know where the file lived on disk. I haven't tried newer Arq Cloud Backup destination to see if it improves the search experience.

That said my experience is from more than a year ago and I would try it again if they were able to bring their search on par with current consumer backup offerings.

Youden · on April 11, 2019

The place where this won't be as cheap as Backblaze is retrieval. Unless Google makes a big change, you'll still have to pay for network egress, which is obscenely priced: https://cloud.google.com/storage/pricing#network-egress

phiresky · on April 11, 2019

Borg Backup is mostly the same as Restic (regarding dedup / incremental backup) [1] and aggregates data into large chunks.

If you only backup from a single machine it has a local cache of already backed up data, this has the large advantage that it basically only needs to push the delta data to the remote, not do any kind of synchronization to check what is already there or not.

[1]: https://stickleback.dk/borg-or-restic/

rsync · on April 11, 2019

"Borg Backup is mostly the same as Restic (regarding dedup / incremental backup)"

... with one very big difference - you can only point borg at an SSH host. You can't point borg at S3 or B2 or Glacier, etc.

rsync.net supports both borg and restic, but even the heavily discounted plans[1] are much more expensive than "Cold Storage" or Glacier, because they are live, random access UNIX filesystems ...

[1] https://www.rsync.net/products/borg.html

m3nu · on April 11, 2019

Shameless plug: I built a backup service[1] just for Borg and the price per TB on the large plan is $5/TB. Not as cheap as "cold storage", but still better than rsync.net and the same as B2.

Also worth pointing out that my storage is calculated after compression and deduplication. So depending on the data a Borg backup can be much smaller than the actual data.

1: https://www.borgbase.com

pnutjam · on April 12, 2019

Interesting. I've been backing up to a storage node at time4vps. I have an older plan at about $15/ quarter. https://billing.time4vps.eu/?affid=1881

phiresky · on April 11, 2019

True - which is kind of weird, because as far as I understand their respective "databases", borg would be more suited for arbitrary remote storage because it should only need a "upload file" command basically without any interactivity, except for its robustness checks and some additional flexibility (having multiple backup sources, deleting data that is no longer needed).

Restic seems more made from the ground up to utilize the existing power of a filesystem as a database, so it needs remote storage that offers quick interactivity (esp. checking existing files), i.e. it's impossible to use something like Glacier as a backend.

It's not a problem for me since I just backup to a local drive and (am planning to setup) synchronization to a remote dumb storage.

Aeolun · on April 11, 2019

I need more information than this post provides before switching archival solutions.

Actually, since it’s google I likely wouldn’t consider them regardless.

specto · on April 11, 2019

I've been using duplicati for some time. It works ok, not perfect. Wish I could send backups to multiple locations especially (eg local/B2)

shark1 · on April 11, 2019

What tool do you use to do your backups? rclone?

m3nu · on April 11, 2019

Rclone only copies stuff. It doesn't compress, deduplicate or version. Some backends to versioning though.

yread · on April 11, 2019

> Unlike tape and other glacially slow equivalent

shots fired I like when multinational corporations with revenues the size of midsized countries engage in some childish puns

bzillins · on April 11, 2019

The title seems like a reference to the 2003 Outkast song Hey Ya! https://www.youtube.com/watch?v=PWgvGjAhvIw

Alright alright alright!

dredmorbius · on April 11, 2019

Contrast: Petabox, from the Internet Archive.

https://archive.org/web/petabox.php

Density: 1.4 PetaBytes / rack

Power consumption: 3 KW / PetaByte

No Air Conditioning, instead use excess heat to help heat the building.

Raw Numbers as of August 2014:

4 data centers, 550 nodes, 20,000 spinning disks

Wayback Machine: 9.6 PetaBytes

Books/Music/Video Collections: 9.8 PetaBytes

Unique data: 18.5 PetaBytes

Total used storage: 50 PetaBytes

Costs are $2/GB, lifetime, I believe.

https://help.archive.org/hc/en-us/articles/360014755952-Arch...

penagwin · on April 11, 2019

Does anybody know what the retrieval fees will likely look like? I've been wary of most of the "cloud archival" solutions because while they're cheap to put data into, they seem charge you a billion dollars to actually retrieve it.

ocdtrekkie · on April 11, 2019

FWIW, this is still an ideal model for backup storage: If your more regular backup model is robust and your network is well-secured, you'll never need retrieval. And if you need it, you need it, and it's justifiable to spend big to save your business.

lreeves · on April 11, 2019

Backup plans don't mean much unless you fully test a restoration process periodically though.

penagwin · on April 11, 2019

I'd be confident with periodically testing just little random parts.

For me this is a "last resort backup", costs little to keep around, and god-forbid we ever need it. BUT that means we need to account for the case were we do need it! And if it's going to cost too much then there's no point in the backup anyway.

ocdtrekkie · on April 11, 2019

I would generally agree. First of all, you're going to test a lot of your restore processes with backups which are closer to home: You should make sure your VMs can all restore from your onsite (or just less icy) backups, for instance. As long as you're confident in that, the only thing you need to test with "ice cold" storage is that you can successfully restore a single VM from it, since you know all of your VMs can be restored.

penagwin · on April 11, 2019

I'm mostly just evaluating it for personal/hobby backups (few terabytes), I know business will for sure look at it differently.

lucb1e · on April 11, 2019

Same here. As a company you can go "we need this to save our asses, I don't care if it costs $50k in a 4 person company", but personally I kind of do care about the cost for retrieval...

I've been comparing cloud storage prices to hard drive prices for years now. My first thought when seeing the storage prices was "huh, that might actually be worth it", but depending on the retrieval costs, you might still want to roll your own no matter the storage costs. For private use, I am (was?) planning a variant of this as soon as I am finished doing a server migration: https://old.reddit.com/r/DataHoarder/comments/7rjcdn/home_ma...

z3t4 · on April 11, 2019

Also compare with ftp and storage vps. Cheapest ive seen is 1$ per TB /month.

icebraining · on April 12, 2019

A cheap storage VPS won't give you 11 9s of durability, though. Chances are a single drive failure will cause data loss.

lucb1e · on April 12, 2019

It's your backup, not your primary system. The odds that more than one drive fails within the same, say, week, is probably perfectly acceptable for most people.

z3t4 · on April 12, 2019

You can hash the data at regular intervals to make sure it's intact. For example adding this command to crontab:

    /usr/bin/md5sum --quiet -c md5sum.chk

idlewords · on April 11, 2019

What is the meaning of the claim about "99.999999999% annual durability"? Does that mean one chance in 100B of an object being unretrievable?

ajay-d · on April 11, 2019

Backblaze wrote a good piece on that https://www.backblaze.com/blog/cloud-storage-durability/

idlewords · on April 11, 2019

Thank you for that link! It's very interesting to me that all the calculations laid out here assume independent failure events.

ahelwer · on April 11, 2019

It is universal practice within cloud service providers to span redundancies across "fault domains" - basically things which could make failures correlated, like being in the same machine/power strip/datacenter/geo region. If you assume your fault domain analysis is good, then your failures should be independent. Many global outages are the result of a previously-unidentified fault domain, like the Azure certificate issues. Of course past a certain point it becomes unimportant - who cares about your data if an asteroid takes out every datacenter on Earth at once?

gnulinux · on April 11, 2019

Nothing in engineering is 100%. As far as engineering goes 11 9s is pretty much as good as you can get. For comparison, AWS S3 and Glacier are 11 9s durability too.

remus · on April 11, 2019

It's worth bearing in mind the difference between durability and availability. Durability is roughly the chance of losing your data over a given time span (in this case a year), whereas availability is about how reliably you can access the data (and is almost certainly a lot lower than 11 9s). A service can be very durable but have very poor availability.

rsync · on April 11, 2019

"What is the meaning of the claim about "99.999999999% annual durability"?"

It has no meaning whatsoever. Someone on the marketing side of the team decided that was a "competitive" number to present, outwards, and someone in engineering was tasked with, working backward from that number, coming up with some plausible calculation that resulted in it.

In the real world, they, like Azure and Amazon, will have single point in time outages that will wipe that out for a year or more.

Here is what an honest assessment looks like:[1]

"Historically (updated April, 2019) we have maintained 99.95% or better availability. It is typical for our storage arrays to have 100+ day uptimes and entire years have passed without interruption to particular offsite storage arrays."

...

"In the event of a conflict between data integrity and uptime, rsync.net will ALWAYS choose data integrity."

[1] https://www.rsync.net/resources/notices/sla.html

jefftk · on April 11, 2019

> In the real world, they, like Azure and Amazon, will have single point in time outages that will wipe that out for a year or more.

An outage affects availability, but as long as it's not permanent it doesn't affect durability. For example, if I add a new backup provider that stores data on-premise I've added a (nearly) independent data store. This substantially decreases my risk of losing my data unrecoverably (increases durability) but if I don't set up any sort of automatic failover I'm still at risk for substantial outages (no practical increase in availability).

For example, I don't believe Amazon has ever lost any S3 data (https://www.quora.com/Has-Amazon-S3-ever-lost-data-permanent...), and if they did it would be a big deal. Same with the other major cloud storage providers.

> Someone on the marketing side of the team decided that was a "competitive" number to present, outwards, and someone in engineering was tasked with, working backward from that number, coming up with some plausible calculation that resulted in it.

I would be incredibly surprised if that happened. That's not the way I've seen anyone work here.

(Disclosure: I work at Google, though not in Cloud)

_0nac · on April 11, 2019

You are mixing availability (access at any given moment) with durability (not losing data). From the FAQ:

Cloud Storage is designed for 99.999999999% (11 9's) annual durability, which is appropriate for even primary storage and business-critical applications. This high durability level is achieved through erasure coding that stores data pieces redundantly across multiple devices located in multiple availability zones.

Disclaimer: I work at GCP, although not in GCS specifically.

rsync · on April 11, 2019

"You are mixing availability (access at any given moment) with durability (not losing data)."

You are correct - I misread that as availability even after quoting that very same line.

remus · on April 11, 2019

Anyone care to speculate on the technology that allows them to offer the fast retrieval times and low cost per GB?

puzzle · on April 11, 2019

"Fast" is relative here. It's fast compared to Glacier and others, but it's going to be slower than the more expensive tiers.

You might not need to speculate much about how it works, it's probably implemented as described by Google themselves in slides 22ff. here:

http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...

(In a nutshell, pack some hot data with a lot of cold data on many large drives, then put a Flash-based cache in front of them to get long tail performance predictability back.)

remus · on April 11, 2019

Thanks for the link! Interesting stuff.

puzzle · on April 11, 2019

There was also a talk about the low level storage service and the performance isolation work that allows it to mix batch and latency-sensitive traffic on the same drive, but it doesn't seem to have been recorded: http://www.pdl.cmu.edu/SDI/2012/101112.html

Gory details are in the patents, 9781054, 9262093 and 8612990, which I'm not linking directly, because your lawyers might not approve. There's even a follow up, 10257111. It's so new, from two days ago, that Google Patents can't find it, while Justia can.

whoisjuan · on April 11, 2019

You have to keep the data in that class for a certain period of time. That's the drawback. You can access the data at that price as long as you keep it there for a long time.

So I suspect this is not a fully cold storage. That's why they can retrieve the data faster. Seems more like an economics hack (Longer commitment to keep the data, allows them to buy and operate the storage hardware/software at a cost that can be amortized against those commitments)

OrgNet · on April 11, 2019

It's a pretty good price but assuming you are storing 8TB and you get your own drive, the drive would pay for itself in about 14 months... so you would basically get the next 4 years for free if you are willing to manage it...

dwild · on April 11, 2019

Will that storage have "11 9’s annual durability" and stored in multiple location?

Let say that you only need to write to it once, have 2 secure location available for free, that would still means that you need 2 of them which would pay for itself in 28 months then.

Sure it's "cheaper" but it's far from being as good and the price difference isn't that big.

votepaunchy · on April 11, 2019

Optimally running your own drives also assumes that you can fill the drives ... and the next byte doubles your cost.

fiatjaf · on April 11, 2019

Google offers some interesting services, but their API is always so awfully complicated and cumbersome that I've given up entirely trying to use anything.

_0nac · on April 11, 2019

If you can use cp to move files around, you can use gsutil to do the same for GCS.

https://cloud.google.com/storage/docs/gsutil

duxup · on April 11, 2019

I use Glacier as a sort of backup of my backups .... and was thinking about Glacier Deep, but this is tempting too.

lucb1e · on April 11, 2019

In case you're still thinking about options, I would be fine to host a variant of this for you: https://old.reddit.com/r/DataHoarder/comments/7rjcdn/home_ma...

scurvy · on April 11, 2019

While one major use of something like this would be backups, how does one handle these backup sets with respect to GDPR requests? The window to respond is 30 days, so keeping backups longer than say 25 days seems cumbersome. You would need hot access to the sets to load them up and delete the data.

jsnell · on April 11, 2019

Encrypt backup data with a per-user key, keep the keys only in hot storage, delete the key when a user is deleted.

anoncake · on April 12, 2019

Wont that make the backups useless in case of a data loss (i.e. always)?

jsnell · on April 12, 2019

You don't keep a single copy of each key, but store enough redundant copies to get the proper number of nines. Preferably that's redundant geographically, in terms of storage technology, and in write frequency.

The important part is just that the keys don't end up in long term cold storage. Either it's only retained for a short period (e.g. tape backups that get rotated after two weeks), or it supports live deletion.

tpetry · on April 11, 2019

Encrypt the backups and store the encryption key in a normal non-archival bucket.

patrickg_zill · on April 11, 2019

What are the transfer costs for storage and retrieval?

tsukurimashou · on April 11, 2019

I was also looking for that, the only piece of info about that was:

Unlike tape and other glacially slow equivalents, we have taken an approach that eliminates the need for a separate retrieval process and provides immediate, low-latency access to your content. Access and management are performed via the same consistent set of APIs used by our other storage classes, with full integration into object lifecycle management so that you can tier cold objects down to optimize your total cost of ownership.

tpetry · on April 11, 2019

It seems to have the same pricing like the other storage classes: No fees for accessing the files within the same region and the typical bandwidth fees if the backups will be downloaded to somewhere else.

couterSpell · on April 11, 2019

Intuitively there is some cost for retrieval. Otherwise you'd just use the cold storage to store everything.

tpetry · on April 11, 2019

Like every cloud storage file access costs money, so the operation to access it. But its so minimal its basically non existent for a backup solution.

votepaunchy · on April 11, 2019

Nearline and Coldline have a per-byte retrieval cost in addition to the increased-but-still-low cost per operation.

https://cloud.google.com/storage/pricing

trpc · on April 11, 2019

[flagged]

tpetry · on April 11, 2019

They downvoted you because it would be really easy for you to look the prices up by yourself. Will take only a few seconds:

* Go to https://cloud.google.com/compute/

* Scroll down to the pricing information

* Click on the link for the price list

mercwear · on April 11, 2019

What does the age of someone have to do with them downvoting you? You asked a question that you could have easily answered yourself.

JoblessWonder · on April 11, 2019

Google has burned people so many times with shuttering products with little to no warning I'd be hesitant to trust them with my long term data storage.

yjftsjthsd-h · on April 12, 2019

Eh... for consumer stuff, sure, and perhaps even for new/experimental GCP features. But this is storage, a core function, on GCP, an enterprise service with actual contracts and SLAs attached.

imagetic · on April 11, 2019

Just don't think of it as something you'll ever want to restore unless the building burns down and you've lost everything.

Glaciers restore costs had a lot of fees in my one experience. We could have bought several RAID units for the price of a fast restore. If you asked for it back over a long period of time, the price dropped dramatically.