This Summer in Redox: A Complete Rewrite of the Kernel

naasking · on Nov 1, 2016

Is there a documentation reference for "scheme-centric capabilities"? Most people who use the term "capabilities" use it incorrectly, or their capabilities have the wrong properties which destroys some of their benefits, so I'm curious what redox is doing here.

Skimming through the code in the fork [1], it looks like they place restrictions on sending capabilities in a misguided attempt to "enhance security" [1]. But this compromises the property that actually makes capabilities so much better: unrestricted delegation leads to the principle of least authority via fine-grained permissions. The fact that capabilities can be passed arbitrarily leads you naturally to defining fine-grained authorizations so that eventual abuse can be minimized, detected, and revoked.

If you have other "security" mechanisms restricting delegation, then you can get sloppy and start aggregating permissions into larger groups, thus making inevitable vulnerabilities much worse than they otherwise would be.

[1] https://github.com/ticki/kernel/blob/master/kernel/capabilit...

Yoric · on Nov 1, 2016

> But this compromises the property that actually makes capabilities so much better: unrestricted delegation leads to the principle of least authority via fine-grained permissions. The fact that capabilities can be passed arbitrarily leads you naturally to defining fine-grained authorizations so that eventual abuse can be minimized, detected, and revoked.

Really? I would have assumed that letting anyone send capabilities to anyone without going through the kernel was a highway towards privilege escalation.

Do you have documentation on this topic? I'm pretty interested – in particular since I'm hoping to contribute to this fork.

btilly · on Nov 1, 2016

It depends on what you mean by "capabilities".

My guess is that they consider capabilities to just be a fine-grained permission system. In which case they've lost the battle for a fundamentally improved security system before they've begun. In particular they can't solve the confused deputy problem. (What Linux calls capabilities aren't, for exactly this reason.)

The confused deputy problem happens when a process given a permission for one purpose, can be tricked into using that permission for something unrelated. If you attach permissions to processes, then this will happen because "I have permission to do this" is essentially a global variable.

What you should be thinking is that a capability is a handle, much like a filehandle or an object. You do things by making calls to that handle. If you don't have the handle, you can't make the call. If you have multiple very similar handles, the one you make the call on matters. That handle ties the attempt to take the action to whether you should have permission for it.

Now if you've got handles, you can do something very clever, which is create new handles out of old ones. So, for example, I can take my permissions to read the whole filesystem, and construct a handle that can read from only one directory. I can then pass this new fine grained permission to a process. If that is the only permission it was given, it can never read from anything except the directory. No privilege escalation is possible, ever. You want to be able to create arbitrary capabilities like this and pass them around. The result is that programs get launched not with, "You can do anything I can do" but with, "Here is everything that you will be ever able to do." And this can be verified without ever looking at the source code.

naasking · on Nov 1, 2016

> Really? I would have assumed that letting anyone send capabilities to anyone without going through the kernel was a highway towards privilege escalation.

"Unrestricted" in my post doesn't mean "you can send to anyone", it means "no restrictions on passing capabilities beyond that defined by the capability model".

The capability security model requires that you can only send capabilities via your other capabilities. So if you have a capability to another process, then you can send any of your other capabilities to that process. The Redox design appears to violate this.

The capability model is actually simple and surprisingly common. Take any memory-safe object-oriented language, and remove static mutable state and the file system and network APIs (all sources of "ambient authority"). An object reference is now a capability. Passing an object reference via method parameters is a capability send, and you can only pass around references you actually have, and only to objects to which you have references. That's capability security.

You can look at any context and see precisely what side-effects can occur simply by analyzing the set of object references available in that context. It's bad design to define functions that take a huge number of parameters, but it's also bad security. Capability security aligns maintainability concerns with security concerns thus yielding POLA.

Edit: as for documentation, I'm not sure if you want OS documentation, or general capability documentation. The seL4 and EROS papers are good for capabilities in operating systems, the E and Joe-E programming languages are good references for capabilities in general.

Yoric · on Nov 1, 2016

I am actually pretty familiar with (some instances of) the capability model, in particular process algebra-style capabilities, which pretty much don't limit sending capabilities, but also don't need to deal with Unix-style `fork`, for instance. I was just wondering if you had specific problems in mind.

In this case, how would you modify the design? Getting rid of `Kind`, I assume?

If so, how would you enable/disable passing capabilities? By having a meta-capability "I can pass capability foo"? That's pretty much how I would do it in a process algebra, but that doesn't look very usable in practice.

naasking · on Nov 1, 2016

> If so, how would you enable/disable passing capabilities? By having a meta-capability "I can pass capability foo"?

I would first question why you think you need to disable capability passing at all, which was the main thrust of my post [1]. What specific problem are you trying to solve by restricting capability passing?

Perhaps Unix-style 'fork' has some undesirable semantics, but I don't see how restricting capability delegation would solve them. POSIX sandboxes using capabilities have been done a few times, originally KeyKOS and then EROS. Perhaps the most accessible introduction for those coming from a UNIX introduction would be Capsicum [2].

[1] Yes, "kind" classifies capabilities into those that can be passed around without restriction, and those that can't, so I'm questioning the utility of that

[2] https://www.usenix.org/legacy/event/sec10/tech/full_papers/W...

Yoric · on Nov 1, 2016

> I would first question why you think you need to disable capability passing at all, which was the main thrust of my post [1]. What specific problem are you trying to solve by restricting capability passing?

Right now, my specific problems #1 (resp. #2) is understanding what you (resp ticki) have in mind :)

But I can also think of problems such as: 1/ limiting privilege escalation; 2/ somehow transitively revoking privileges that have been obtained through a compromised process.

Not sure if `Kind` has any effect on either, but I'll spend a few days thinking about these issues. I'd be happy to chat with you if you're interested (I'm Yoric on irc.mozilla.org).

naasking · on Nov 2, 2016

Revocation is a solved problem for pure capabilities [1,2]. Privilege escalation is typically not so important in capability systems because processes run with least privileges. There is a pattern whereby authority can be increased if you're granted a special type of capability -- the operation is called "rights amplification" in capability parlance, but these capabilities are closely held by the trusted base.

Finally, the reason restricting delegation doesn't generally work is that a process can just proxy access to the undelegable capability, which means you haven't actually attentuated authority in any meaningful way. At best, you've merely reduced the bandwidth, so overall performance suffers, and you've also sacrificed compositional reasoning.

Anyway, that's why I initially asked for any design documents for Redox's capability system, so I encourage ticki and any others working on this to write a rough draft of the ideas. I know plenty of capability people who would happily provide feedback.

[1] https://www.cs.ucsb.edu/~chris/teaching/cs290/doc/eros-sosp9...

[2] https://sel4.systems/Info/Docs/seL4-manual-2.0.0.pdf

snvzz · on Nov 1, 2016

I applaud lessons taken from Liedtke/L4, a major step in the right direction. Now I can look at that project seriously, rather than consider it yet another pre-Liedtke mistake.

I'm not so sure, however, about concurrency in the kernel. The locks and added complexity certainly aren't going to help keep WCET low and bound.

NICTA's SSRG did study this: https://ts.data61.csiro.au/publications/nictaabstracts/Peter...

mickronome · on Nov 1, 2016

I thinl the paper is a bit unclear about how one can come to that conclusion since concerning the Redis benchmark they state "The figure shows that throughput is independent of thekernel variant. Further investigation reveals that overall system throughput is limited by the network bandwidth, andthat the all cores have significant idle time."

If you are network limited you might not _really_ be testing in-kernel locking in the general case. But there might be something about the kernel that allow them to generalise even with low load, but I can't see anything obvious that would make a certain kernel immune to high load induced issues, but then I'm largely ignorant about L4 et al.

naasking · on Nov 1, 2016

> I can't see anything obvious that would make a certain kernel immune to high load induced issues, but then I'm largely ignorant about L4 et al.

The typical system call in a modern microkernel costs on the order of hundreds of cycles. Fine-grained locking would by itself probably double this cost for almost no gain. Just waiting the hundreds of cycles is likely to be just as good as a fine-grained lock in such kernels, and the complexity savings are significant.

valarauca1 · on Nov 1, 2016

I was midly in shock when I looked in to L4 message passing times [1]. 250-300 cycles on x86_64 processors, and ~150-300 on ARM.

As somebody interested in concurrency what is being done internally to pull off this speed, but maintain extra-thread consistency in a SMT-type system? As far as I know even a x86_64 memory fence will require around >2,000 cycles.

I understand you create a notification object to pass a message, but don't you have to collect these notification objects at some point?

Edit 1:

I'm reading the manual [2], but I'm just becoming more confused. If you are passing register files, swapping stacks normally takes ~3ns even with small stacks. Or is the 300 cycles just the send time?

[1] https://ts.data61.csiro.au/publications/nictaabstracts/8988....

[2] https://sel4.systems/Info/Docs/seL4-manual-2.0.0.pdf

naasking · on Nov 1, 2016

Notification objects are only for specific receive patterns IIRC. Simple message sends occur via endpoints, which I believe in seL4 you can think of sort of like a file descriptor, ie. each thread has a kernel-controlled address space of endpoints it can index to send messages.

The cycle count is for a one-way IPC. This roughly consists of entering kernel mode (~50-100 cycles), switch to kernel stack, finding the endpoint specified by system call (roughly: thread_control_block.endpoints[registerX], let's say ~10 cycles), switch to thread's address space (~100 cycles, maybe more for TLB flushes), switch to new thread's stack, return to user mode (~50-100 cycles). This is the fast path in which the whole message is passed in registers. The only memory that's touched are the original and target thread control blocks, so it's quite fast at ~300 cycles for x86_64.

These are all rough estimates of course, so don't quote me. Longer message sends require copying buffers which may also incur page mapping overheads and page faults. A page fault would abort the message send, notify the disk process to load the missing pages, and the send is retried once the page is mapped back into memory.

Edit: as noted in the L4 review paper you linked to, the majority of message sends are very small, and most simply pass all data in registers, so the fast path is the average cost.

Yoric · on Nov 1, 2016

Out of curiosity, do you have a reference to "lessons taken from Liedtke/L4"? Out of the blue, I have no clue what it means :)

snvzz · on Nov 1, 2016

I'm thinking mostly about the principle of minimality (what can be out of the kernel should be). But there's some more contributions from L4.

The ACM did a talk on those insights 20 years later: https://www.youtube.com/watch?v=RdoaFc5-1Rk

The associated NICTA SSRG paper: https://ts.data61.csiro.au/publications/nictaabstracts/Elphi...

A more recent NICTA SSRG paper: https://ts.data61.csiro.au/publications/nictaabstracts/Heise...

cm3 · on Nov 1, 2016

Yeah, I wish they had teamed up with Robigalia in order to build on seL4 or at least translated that 1:1 to Rust, foregoing the verified nature of seL4.

cmrx64 · on Nov 1, 2016

I wrote on their forum about why they "shouldn't" use seL4, and definitely shouldn't port it to Rust:

https://discourse.redox-os.org/t/why-not-port-sel4-microkern...

If the goal is trusthworthy, reliable software, then you should use seL4. If the goal is to yell about how great Rust is, you could port seL4 to Rust.

pjmlp · on Nov 1, 2016

Fully support your point of view.

What we always see on programming languages is that when shortcuts for arriving faster at our goals are taken, usually they give ammunition to remarks of the sort "language X cannot do Y".

Apparently most lack vision and need to be proven wrong.

So I hope they manage to just use Rust and Assembly.

EDIT: Improved the sentence

snvzz · on Nov 1, 2016

> foregoing the verified nature of seL4

If anything, seL4 is the fastest L4 out there, proving that they can be fast and proven at the same time.

So, why would anyone want to do that?

cm3 · on Nov 1, 2016

Because Redox's aim is to be as little non-Rust as possible. I agree it'd be ill-advised to rewrite seL4 rather than use it, even though it's been a while since a revision was verified, because it's been battle tested, is continually improved and small enough to trust.

cm3 · on Nov 1, 2016

We really need to have a mainstream desktop microkernel operating system because:

1. no need anymore for kernel bypass mechanisms (too many to name for linux, btw)

2. decoupling of drivers from the kernel

3. instead of waiting for a backport of a new syscall to an LTS kernel, something like that will be just part of your system's stdlib

4. extremely reduced need to reboot a machine

5. driver faults can be recovered from gracefully, just like in Erlang's supervisor scheme

6. seL4's capability system allows for a safer system overall, with less hacks

Seeing how even Intel is introducing a message passing network with growing number of cores in their mainstream CPU designs, we should be over the general objections regarding a message passing design. They call it QMD, and it's something they already have in their Xeon Phi line for some time for obvious reasons, just as countless other general purpose or special purpose chip designs have had for a long time.

I wonder if one could get funding for an effort like Robigalia to reach production quality and have Servo running on it. Given the feature set of Servo, it's a safe bet that the required features would provide a reasonably complete system to replace your Linux machine.

The bigger struggle would be getting hardware vendors to contribute to FreeBSD and Linux drivers to also provide some level of support for a non-C code base. Though, moving up to that abstraction and language level should make it easier to write working hardware drivers.

cmrx64 · on Nov 1, 2016

> I wonder if one could get funding for an effort like Robigalia to reach production quality and have Servo running on it. Given the feature set of Servo, it's a safe bet that the required features would provide a reasonably complete system to replace your Linux machine.

If I got funding, Robigalia would definitely progress at least 10x faster than it does currently :) If anyone is interested in that, email me: corey@octayn.net...

As an aside, Robigalia has been picking up steam. The first developer preview should be out by Robigalia 2017 (April 25). Design documents and prototypes should be out by December 26 (the project's anniversary).

cm3 · on Nov 1, 2016

Oh, and if you extend it with multi-kernel functionality, scaling to very large number of cores/chips will also be solved in a fundamentally cleaner way. I haven't looked into it but there's an seL4 branch that implements multi-kernel support. If you don't know what this is, it's simply a way to have one kernel instance per processor, making scalability easier. Fujitsu has a linux variant to achieve the same and I think it's a goal of Barrelfish as well. Similar solutions exist in certain linux and dragonflybsd subsystems where, say, the network stack uses one scheduler or queue per core, removing the need for synchronization in a coarse way. Then, of course, you have to be smarter about scheduling things, but if you bind certain hardware bits to cores long-term, it should avoid accidental overhead via random migration. Language runtime VMs do the same for green threads with job stealing designs and pinning schedulers to cores. None of it is bleeding edge research material, we just have to use what's been invented.

brandmeyer · on Nov 1, 2016

I'm not sure that you need a kernel per-core. However, it would be nice for the microkernel to provide reliably per-cpu data structures to userspace. For example, via per-cpu replicated receive capabilities. That way, a capability send request from a client program would get routed to the same CPU's capability in the server. That would provide some very cache-friendly superfast machine-local IPC.

johncolanduoni · on Nov 2, 2016

Completely eliminating shared-memory multiprocessing in userspace seems like a non-starter to me. Message passing is great, but there are plenty of cases where it's slower by many orders of magnitude. A good example is sparse matrix solvers; there's no way to get any kind of acceptable performance if you can't share the underlying data.

bitmapbrother · on Nov 1, 2016

Google is developing an open source microkernel based OS code named Fuchsia. It's for x86, ARM and ARM64.

https://en.wikipedia.org/wiki/Google_Fuchsia

https://fuchsia.googlesource.com/magenta/+/master/README.md

Yoric · on Nov 2, 2016

From the top of my head, a few differences:

* Fuchsia is C (for the actual microkernel) and C++ (for userspace) / Redox is Rust. * Fuchsia's main current target is smartphones & laptops / Redox' main current target is Rust. * Fuchsia is very much funded / Redox is pretty grassroot. * Fuchsia supports any language for app development but favors Dart / Redox supports any language for app developent but favors Rust.

Keep in mind that both projects are pretty young and that my points above involve some dose of mind-reading, so caveat lector.

xj9 · on Nov 1, 2016

Genode OS framework lets you use Linux drivers with L4 kernels, including seL4.

cmrx64 · on Nov 1, 2016

the master branch of sel4 is always passing verification, for the verified platform: armv7-a on a Sabre Lite.

the x86_64 verification is coming out soon according to the roadmap.

cm3 · on Nov 1, 2016

> In order to continue the development in a reasonable speed, we stop conforming to the specification of ZFS. As such, we can no longer call it a ZFS implementation.

They have their valid reasons, but it's a shame, because an implementation of the very complex ZFS filesystem in Rust would have been a great thing to have. Not to mention the ability to mount redox zfs images with illumos or FreeBSD or ZOL.

jackpot51 · on Nov 1, 2016

TFS will be mountable on any OS with FUSE - Linux, BSD, illumos, and OS X

Yoric · on Nov 1, 2016

Well, there's the problem of testing the implementation on every OS.

johncolanduoni · on Nov 2, 2016

I'm also disheartened by the description of TFS's layering. It seems like they go with block level encryption instead of something with authentication.

nwmcsween · on Nov 1, 2016

The kernel from this limited bit of info is almost an exokernel, it's missing the most important bits, an asyncish syscall api and proper capabilities. Batching is a nice mitigation but not ideal, an ideal way to mitigate context switches would be a data-driven api that let's the kernel (due to info the kernel has access to) decide. Both a data-driven api and exokernel with a proper language you could apply optimizations across the entire operating system.

mtgx · on Nov 1, 2016

Will the TFS also support native encryption like Apple's APFS?

Yoric · on Nov 1, 2016

I just talked to Ticki and he said he would support that.

As a personal note: is it really the right place to support encryption? I seem to remember that APFS's encryption had a pretty bad reputation in terms of actual security.

xenadu02 · on Nov 1, 2016

> I seem to remember that APFS's encryption had a pretty bad reputation in terms of actual security.

Based on what exactly? APFS actually has really good security. You can encrypt on a per-file or per-directory basis. Each per-file key can be encrypted with different levels (the shared key or per-user keys, so permission bypasses are useless).

Yoric · on Nov 1, 2016

As in I'm pretty sure that in a previous job, I received semi-official government-issued guidelines: « Don't use it, we know how to crack it, and we're pretty sure that we're not the only ones. If you value your data, use TrueCrypt instead. »

It was pretty thin on details, and the stuff may have been fixed in the meantime, but here you go.

majewsky · on Nov 1, 2016

Maybe there's a misunderstanding here, but how could you have gotten advice about APFS in a "previous job", when APFS was only announced a few months ago (and doesn't even have a stable release yet)?

johncolanduoni · on Nov 2, 2016

Are you sure you don't mean CoreStorage's encryption? APFS isn't even finished yet.

Yoric · on Nov 2, 2016

Ah, I'm probably confusing, then. Sorry for the noise.

nwmcsween · on Nov 1, 2016

That's actually a really bad form of encryption as you leak info on per file encryption.

johncolanduoni · on Nov 2, 2016

All the alternatives I've seen lose authentication, so it's at least not so clear cut.

glhaynes · on Nov 1, 2016

Are you maybe thinking of something else? APFS is officially "experimental" right now. Allegedly it'll be released in 2017. https://en.wikipedia.org/wiki/Apple_File_System

forgottenacc57 · on Nov 1, 2016

The server would be interesting if it implemented libvirt virtual network and virtual hard disk drivers so it could run on cloud hosts.

Does it?