1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2 % Copyright (c) 2011, ETH Zurich.
5 % This file is distributed under the terms in the attached LICENSE file.
6 % If you do not find this file, copies can be found by writing to:
7 % ETH Zurich D-INFK, Universitaetstr. 6, CH-8092 Zurich. Attn: Systems Group.
8 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
10 \documentclass[a4paper,twoside]{report} % for a report (default)
12 \usepackage{bftn} % You need this
13 \usepackage[pdftex]{graphicx}
14 \usepackage{color,listings,ctable}
16 \title{Capability Management in Barrelfish} % title of report
17 \author{Akhilesh Singhania, Ihor Kuz} % author
18 \tnnumber{013} % give the number of the tech report
19 \tnkey{Capability Management} % Short title, will appear in footer
21 % \date{Month Year} % Not needed - will be taken from version history
23 \newcommand{\note}[1]{[\textcolor{red}{\textit{#1}}]}
25 \lstdefinelanguage{Mackerel}{
26 morekeywords={datatype,device,register,regtype,constants,type,at,
30 morecomment=[s]{/*}{*/},
33 showstringspaces=false,
37 \newcommand{\noarginvocation}[1]{\paragraph{#1 invocation}}
38 \newcounter{invocArgCnt}
39 \newenvironment{invocation}[1]{%
42 \begin{list}{\emph{Argument~\arabic{invocArgCnt}:}}{%
43 \usecounter{invocArgCnt}%
44 \setlength{\rightmargin}{\leftmargin}%
45 \setlength{\itemsep}{0ex}%
47 \renewcommand{\arg}{\item}
56 % Include version history first
58 \begin{versionhistory}
59 \vhEntry{1.0}{08.03.2011}{AS}{Initial version}
62 % \intro{Abstract} % Insert abstract here
63 % \intro{Acknowledgements} % Uncomment (if needed) for acknowledgements
64 % \tableofcontents % Uncomment (if needed) for final draft
65 % \listoffigures % Uncomment (if needed) for final draft
66 % \listoftables % Uncomment (if needed) for final draft
68 \chapter{Introduction}
70 This document discusses the state of capabilities in the Barrelfish
73 Chapter \ref{chap:known_issues} lists the currently known issues with
74 capability management and \ref{chap:type_system} discusses the type
77 Chapter \ref{chap:current_state} discusses the current state of the
78 implementation in Barrelfish, chapter \ref{chap:db} discusses
79 different approaches for maintaining a multicore mapping database of
80 capabilities, chapter \ref{chap:solutions} discusses the requirements
81 from a correct solution and discusses four different solutions,
82 chapter \ref{chap:implementation} discusses some Barrelfish specific
83 challenges in implementing the solutions, and chapter \ref{chap:nyd}
84 highlights issues not yet covered in this document.
86 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
87 \chapter{Known Issues}\label{chap:known_issues}
90 \item Kernel operations should not take longer than $O(1)$. Some
91 capability operations are not constant time but can take $O(n)$
92 where $n$ is the size of the mapping database. Spending so much time
93 in the kernel is detrimental for good scheduling as when in the
94 kernel, interrupts are disabled. All non constant time operations
95 should be punted to the monitor.
97 \item When the last copy of a lmp capability or frame capability
98 backing ump channels is deleted, should it initiate a connection
101 \item We have no mechanics for tracking frames mapped into VNodes. So
102 that if either is deleted, the entries are properly cleaned up.
104 \item Multicore capability management. We have an implementation on
105 the tip. It is not enabled and has some shortcomings. Not all
106 shortcomings are not known however some missing details are
107 known. As discussed in chapter~\ref{chap:solutions}, all remote
108 relations including descendants, copies, and ancestors must be
109 tracked. The implementation only tracks ancestors.
111 \item Fragmentation of memory. Example: a ram capability of 4GB is
112 retyped into two capabilities of 2GB each. One of the 2GB capability
113 is deleted. The only way to get a capability to that 2GB back is to
114 also delete the other 2GB capability and then retype the 4GB
117 \item The mapping database is maintained as a doubly linked
118 list. Implementing Boolean operations like ``has\_descendants'' and
119 ``has\_copies'' require walking the database and testing type
120 specific data structures. We feel that these are not efficient or
121 robust. If we move the mapping database into the monitor, we think
122 that this should at worse become a performance issue and not a
125 \item IO space can be treated in a same way as physical memory. So
126 instead of using mint operations to update ranges, we use retype
127 operations. This is useful as different IO capabilities will have
128 the ancestor, descendant, and copy relationships rather than just a
129 copy relationship. The former is richer and can offer improved
132 \item When a new capability is introduced in a core via cross-core
133 send, the kernel must walk the entire mapping database to find the
134 appropriate place to insert the capability. The operation is linear
135 time. This operation must either happen in the monitor or must be
136 made more efficient. If we move the mapping database into the
137 monitor, we think that this should at worse become a performance
138 issue and not a correctness one.
140 \item Simon and I had envisioned a memory server mechanism where it
141 would delete all ram capabilities after handing them out. This
142 allows Simon to implement his version of domain shutdown (discussed
143 in a supporting document). In a nutshell, in his system, the spawn
144 daemon saves all user dispatcher capabilities and revokes them to
145 kill a user dispatcher and repossess its memory. However, there is a
146 subtle problem with this mechanism. A malicious dispatcher can
147 create loops in its CSpace. E.g. root cnode contains a capability to
148 cnode1, cnode1 contains a capability to cnode2, and cnode2 contains
149 another capability to cnode1. When a dispatcher does this and the OS
150 tries to kill it, root cnode will be deleted and all capabilities
151 within it as well. When the OS deletes cnode1, it will check if it
152 has copies, which it does so cnode1 will not be deleted. This will
153 leak memory in the system.
155 To avoid this problem, the OS must maintain ancestors of all
156 capabilities that are handed out to user applications and to kill
157 the user application, it must revoke the ancestor capabilities.
160 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
161 \input{type_system.tex}
163 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
165 \chapter{Current State}\label{chap:current_state}
167 This chapter will cover how capabilities are stored and what happens
168 on capability invocation.
171 For security reasons, capabilities are stored in kernel-space and
172 users are given pointers to them. Each capability is stored in two
176 \item Each dispatcher has an associated capability space that holds
177 all the capabilities it has. The capability space of a dispatcher
178 is implemented using the CNode type capability. Each dispatcher is
179 associated with a ``root CNode'' that contains all capabilities the
181 \item Each core has a mapping database that holds all the capabilities on the
182 core. The mapping database is implemented using a tree of the capabilities.
183 As discussed later in the chapter, the mapping database stores the
184 capabilities in a particular order to facilitate different capability
188 \section{Capability invocation}
189 When a dispatcher invokes a capability, it passes the kernel an
190 address of the capability in the CSpace it wishes to invoke. The
191 kernel locates the capability starting from the dispatcher's root
192 CNode (walks the capability space), verifies that the requested
193 operation can indeed be performed on the specified capability and then
194 performs the operation.
196 \section{Data structures}
197 Capabilities in Barrelfish are represented by the following data
203 struct cte *left, *right; // Links to the mapping database
208 paddr_t cnode; // Base address of CNode
209 uint8_t bits; // Size in number of bits
213 union capability_u { // Union of all types of capabilities
220 enum objtype type; // Type of capability
221 union capability_u u; // Union of the capability
225 struct capability cap; ///< The actual capability
226 struct mdbnode mdbnode; ///< MDB node for the cap
231 A capability, \verb|cte|, consists of the actual capability represented by the
232 ``capability'' structure and an entry in the mapping database represented by
233 the ``mdbnode'' structure. The capability structure contains the type specific
234 information and the mdbnode contains pointers for the tree representing the
237 Capabilities can be looked-up in two ways.
239 \item All capabilities on a core are stored in a mapping database. It
240 is possible to reach any capability on the core by traversing from
241 any other capability on the core.
243 \item Capabilities are also stored in the CNode type capability. The
244 area of memory identified by the CNode structure is actually an
245 array of capabilities. Starting from the ``root CNode'' of a
246 dispatcher, it is only possible to reach any capability the
250 \section{Terminology}
252 This section discusses some terminology to facilitate the discussion
253 of capability management.
257 A capability X is a copy of a capability Y if:
260 \item X was copied from Y
261 \item or Y was copied from X
262 \item or X was copied from Z and Z was copied from Y
265 \subsection{Descendants}
267 A capability X is a descendant of a capability Y if:
270 \item X was retyped from Y
272 \item or X is a descendant of Y1 and Y1 is a copy of Y
274 \item or X is a descendant of Z and Z is a descendant of Y
276 \item or X is a copy of X1 and X1 is a descendant of Y
279 \subsection{Ancestor}
281 A is a ancestor of B if B is a descendant of A.
283 \section{CNode invocations}
285 Most Capabilities have type specific invocations. Operations on the
286 CNode capability modifies the capability space of the system. We
287 discuss how these operations are implemented for a single core system
290 \note{Invocations on other capability types will probably also modify
291 the capability space but alas we don't know how those will work yet.}
295 Retyping a capability creates one or more descendants of the
296 capability. This operation will fail if the capability already has
297 descendants. The descendants are inserted into a CNode specified by
298 the operation and into the mapping database right after the retyped
301 When a dispatcher issues the retype invocation, the kernel must traverse the
302 mapping database to ensure that the capability has no descendants, create the
303 descendants capabilities, insert them in the specified CNode and in the mapping
308 Copying a capability creates a new copy of it. The kernel walks the capability
309 space to find the capability to be copied, creates the copy, and inserts it
310 into the specified CNode and mapping database.
314 Delete removes the specified capability from the CNode in which it
315 resides and from the mapping database. This operation cannot fail.
317 The kernel first walks the capability space to locate the capability
318 to delete. It then walks the mapping database to check if there still
319 exist copies of the deleted capability. If no copies are found, then
320 it performs certain operations based on the capability type.
324 Revoking a capability calls delete on all copies and descendants of
325 it. When the operation returns, the capability will not have any
326 copies or descendants.
328 The kernel walks the capability space to find the specified
329 capability, uses the mapping database to find all copies and
330 descendants of the specified capability and deletes them.
332 \subsection{Looking up local copies and descendants}
334 Due to the capability ordering used by the mapping database, copies are located
335 adjacant to a capabality and descendants immediately thereafter. Therefore, it
336 is easy to look up all related copies of a capability on the same core. This
337 facilitates revocation by looking up all copies and descendants, retypes by
338 checking for existing descendants, and deltes by checking for copies.
340 The following pseudo-code looks up all descendants and copies of a
341 capability given the existence of type specific is\_copy and
342 is\_descendant functions.
347 mdbnode *walk = successor(cap);
349 // Check if descendant
350 if (is_descendant(cap, walk)) {
351 // Found a descendant
356 if (is_copy(cap, walk)) {
361 // Cap is not a descendant or copy
365 walk = successor(walk);
368 // Traverse backwards
369 mdbnode *walk = predecessor(cap);
371 // Predecessors cannot be descendants
374 if (is_copy(cap, walk)) {
383 walk = predecessor(walk);
389 \section{Multicore extensions}
391 The model above only works for a single core system. We have already
392 extended it to work on multiple cores. Here we discuss these
395 \subsection{Cross-core transfer}
397 This is a special operation available only to the monitors. This sends
398 a capability from one core to another.
400 \subsection{Missing operations}
402 When a capability is sent to another core, no state is maintained
403 about it. When checking for capability relations in capability
404 operations, only the current core is checked. Therefore these
405 operations are not implemented correctly right now.
409 In this chapter, we presented a background and the current state of
410 capabilities management in Barrelfish. We can now discuss different
411 designs for multicore capability management.
414 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
416 \chapter{Maintaining the database}\label{chap:db}
417 We consider the following approaches for managing the mapping
420 \note{Use diagrams to better illustrate the discussion below.}
423 \item \textbf{No partition:} The mapping database and capability space
424 for all cores is maintained on a single centralized
425 coordinator. Accessing either the capability space or the mapping
426 database on any core requires communication with the coordinator.
428 \item \textbf{Partitioned data structure:} The mapping database for
429 all cores is maintained on a single centralized coordinator and the
430 capability space is maintained on the local cores. This implies that
431 the cte structure is split between the coordinator and the local
432 cores. Cores can access the capability space locally and have to
433 message the coordinator to access the mapping database. Note that
434 split ``capability'' and ``mdbnode'' structures need to maintain
435 references to each other.
437 \item \textbf{Centrally replicated space:} The mapping database and
438 the capability space is replicated between a single coordinator and
439 the local cores. This implies that two copies of each ``cte''
440 structure exist in the system, one on the local core and one on the
441 coordinator. Both local core and the coordinator can access the
442 mapping database and the capability space.
444 \item \textbf{Partitioned space:} The mapping database and the
445 capability space are maintained on the local cores. The entire
446 ``cte'' structure can be accessed locally but capability operations
447 will require coordination and communication with remote cores.
449 \item \textbf{Minimal replication:} The database is partitioned
450 between local cores as in the above approach and additionally for
451 each capability the cores maintain a cache of its remote relations.
452 This is discussed in more detail in section \ref{sec:cache}.
455 We qualitatively compare the above approaches and why the partitioned
456 space and minimal replication is the best.
458 \section{Comparison}\label{sec:comparison}
459 We compare the above approaches in this section.
461 \subsection{Efficiency of capability invocations}
462 When a dispatcher invokes a capability, the kernel has to look it up
463 in the capability space starting from the dispatcher's ``root'' CNode.
464 If the capability space is not local, then the core has to message the
465 coordinator to perform the look up. A round trip with the coordinator
466 is more expensive than a local look up and can become a scalability
467 bottleneck if we use a single coordinator.
469 The no partition approach does not maintain a local capability space
470 and therefore may suffer from poor performance. All other approaches
471 maintain the capability space locally.
473 \subsection{Local operations}
474 Approaches that enable more pure local operations will perform better
475 as they will reduce the amount of cross-core communication. In the no
476 partition, partitioned data structure, and replicated space
477 approaches, no operation can be performed locally. In the partitioned
478 and minimal replication approaches, certain operations such as copying
479 capabilities is purely local.
483 \begin{tabular}{|c|c|c|c|c|c|}
486 & Cap invocation & Local operations \\
489 No partition & - & - \\ \hline
490 Partitioned data structure & + & - \\ \hline
491 Centrally replicated space & + & - \\ \hline
492 Partitioned space & + & + \\ \hline
493 Minimal replication & + & + \\ \hline
497 \caption{\label{t:summary}Summary of the different approaches.}
500 \subsection{Discussion}
501 Table \ref{t:summary} summarizes our comparison of the five
502 approaches. A (+) indicates that the approach performs relatively well
503 on the given metric and a (-) indicates that the approach performs
504 relatively poorly. Based on the results, we choose to implement the
505 partitioned space and minimal replication approaches.
507 \section{Caching}\label{sec:cache}
508 The minimal replication approach is characterized as the partitioned
509 approach with caching the state of remote relations. Capabilities
510 without remote relations are marked as such and when performing
511 operations on these no remote communication is required.
513 When tracking remote relations, three types of relations must be
514 tracked: copies, descendants, and ancestors. Tracking of remote copies
515 and descendants is required so that revoke, retype, and delete
516 operations can be correctly implemented. And capabilities must track
517 their remote ancestors so if they are deleted, the remote ancestors
518 can be informed to update the state about their remote descendants.
520 \subsection{How to maintain the cache?}
522 \note{Once I discuss total order broadcast with caching, this
523 discussion will be revamped.}
525 There are two ways of maintaining the cache:
527 \item \textbf{Single bit:} A bit each for the three types of remote
528 relations is used. The bits merely indicate the presence of remote
529 relations but provide no further information such as which cores
530 have the remote relations.
532 \item \textbf{List:} A list each for the three types of remote
533 relations is used. The list contains exact information about which
534 cores the remote relations exist on. Uhlig et al [?]'s core mask
535 technique can be used to maintain the lists with fixed space
539 \subsubsection{Comparison}
540 \textbf{Informing existing relations:} When a capability that already
541 has remote relations is sent to another core, with the single-bit
542 approach, none of the existing cores with relations have to be
543 informed, but with the list approach, cores with existing relations
544 must be informed so they can update their lists.
546 With both approaches when the last local copy of a capability is
547 deleted, its remote relations should be informed. This is required in
548 the list approach so that the cores can update their list and is
549 required in the single-bit approach so that the cores can determine if
550 the last remote relation has been deleted and it can mark the
551 capability as not having the remote relation anymore.
553 \textbf{Number of cores communicated:} With the single bit approach,
554 all cores in the system must be contacted, while with the list
555 approach, just the cores of interest must be contacted.
557 \textbf{Discussion:} The single-bit approach will probably be more
558 efficient if the system has few cores and the capabilities that have
559 remote relations, tend to have them on many cores. The list approach
560 will have better performance if there are lots of cores in the system
561 and typically a capability only has relations on a few cores.
563 \subsection{Where to maintain the cache?}\label{subsec:where}
564 Imagine that there exists a capability with only local relations. Then
565 cross-core transfer it applied to it. Should we only mark that
566 capability as having a remote copy or should we mark all impacted
567 capabilities accordingly? This section will discuss and compare these
570 \subsubsection{Mark all}
571 When cross-core transfer is applied to a capability, it is marked has
572 having remote copies and all its local relations are also marked as
573 having the appropriate remote relations. Its local copies will be
574 marked as having remote copies, its local descendants will be marked
575 as having remote ancestors, and its local ancestors will be marked as
576 having remote descendants.
578 All capabilities maintain complete state of their remote relations at
581 \subsubsection{Mark just individual capability}
582 When cross-core transfer is applied to a capability, it is marked has
583 having remote copies and none of its local relations are updated.
585 Capabilities do not maintain the complete state of their remote
586 relations. Their local relations must be looked up to build the
587 complete state of their remote relations.
589 \subsubsection{Comparison}
590 Both approaches have their specific strengths and weaknesses discussed
593 \textbf{Cost of cross-core transferring:} When applying cross-core
594 transfer to a capability, in the mark all approach, all local
595 relations must be updated whereas in the mark just individual
596 capability approach, the local relations do not have to be updated.
597 However, the cross-core transfer operation needs to send full
598 information about the state of local relations so that the newly
599 created remote capability can setup its cache properly. Gathering this
600 information will require accessing all local relations so the mark all
601 approach represents a small in the operation that must be performed
604 \textbf{Cost of building state of remote relations:} Any capability
605 operation that requires information on the current state of remote
606 relations will have to build the full state of remote relations for
607 the capability. This is done trivially in the mark all approach as
608 each capability maintains the full state. In the mark just individual
609 capability approach, all local relations must be accessed to build the
612 \textbf{Discussion:} Given that the state of remote relations is
613 trivially constructed in the mark all approach and the cost of
614 cross-core transfer is only marginally higher than the bare minimum
615 cost already required, we conclude that the mark all approach is
616 superior to the mark just individual capability approach.
618 \subsection{Summary}\label{subsec:cache:summary}
619 In summary, we have come up with two approaches for the type of cache:
620 single-bit and list, and we have concluded that the mark all approach
621 is superior for maintaining the cache.
623 Maintaining a cache will require the following adjustment to the
624 capability operations presented above.
627 \item When cross-core transfer is applied to a capability, it is
628 marked as having remote copies. The operation sends information on
629 the state of the capability's local and remote relations.
631 \item When a capability is created due to the cross-core receive
632 operation, it incorporates the information about relations sent with
633 the cross-core transfer operation.
635 \item When a copy of a capability is marked as having remote
636 relations, the capability is marked as having the same remote
639 \item When a descendant of a capability is marked as having remote
640 relations, the capability is marked also marked as having remote
641 relations based on the following rules:
643 \item If the descendant has a remote copy, then capability has a
645 \item If the descendant has a remote descendant, then capability has
647 \item If the descendant has a remote ancestor, then capability
648 either has a remote copy or an ancestor depending on the outcome
649 from the type-specific is\_descendant function.
652 \item When an ancestor of a capability is marked as having remote
653 relations, the capability is marked also marked as having remote
654 relations based on the following rules:
656 \item If the ancestor has a remote copy, then capability has a
658 \item If the ancestor has a remote descendant, then capability
659 either has a remote copy or a remote descendant depending on the
660 outcome from the type-specific is\_descendant function.
661 \item If the ancestor has a remote ancestor, then capability
662 has remote ancestors.
665 \item When a capability is retyped:
667 \item Its remote copies are marked as having remote descendants
668 \item The descendants are marked as having remote ancestors if the
669 capability has remote copies.
672 \item When a capability is copied, the copy is marked as having the
673 same remote relations that the capability has.
675 \item When the last copy of a capability on the core is deleted, its
676 remote copies, descendants, ancestors are informed.
679 We now discuss some implications of maintaining a cache.
681 \textbf{Drawbacks:} Caching introduces the following drawbacks that
682 the non-caching approach does not suffer from.
685 \item The application dispatchers must first try to perform the
686 capability operations locally and if that fails, then communicate
687 with the monitors. If no caching were used, the application
688 dispatchers can always directly communicate with the monitor saving
689 one superfluous trip into the kernel.
691 \item As discussed above, caching requires additional overhead in
692 keeping the cache consistent.
694 \item Caching increases the space requirement for maintaining the
698 Caching can work if typically few capabilities are shared between
699 cores and can hurt if many capabilities are shared.
701 \textbf{Decide dynamically:} It is clear that application scenarios
702 will actually dictate if caching can help and if it does, which type
703 of caching helps. To support this, Barrelfish can allow applications
704 to specify which type of caching is well suited for it or the OS can
705 gather appropriate metrics on application execution and based on that
706 dynamically update the type of caching used. For this, we need to
707 identify the cross-over point where one approach is preferred over the
710 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
712 \chapter{Solutions}\label{chap:solutions}
713 In this chapter, we discuss mechanisms for implementing the capability
714 operations. In section \ref{sec:challenges}, we will first discuss the
715 challenges in correctly implementing the operations by discussing how
716 executing multiple related operations in parallel can lead to
717 conflicts and by discussing how an intuitive solution of using
718 acknowledgments fails to resolve the conflicts. We then present the
719 requirements from a correct solution in section \ref{sec:requirements}
720 and compare four different correct solution in section
723 \section{Challenges}\label{sec:challenges}
724 We first discuss the conflicts that can arise if two or more related
725 operations execute in parallel and then discuss an intuitive but
726 incorrect solution of using acknowledgments to resolve the conflicts.
728 \subsection{Conflicts}\label{subsec:conflicts}
729 Performing overlapping operations on related capability can lead to
730 conflicts. This section discusses the different types of conflicts
733 \textbf{Conflicts between two retypes, deletes, or revokes:} If two
734 different cores are trying to retype, delete, or revoke related
735 capabilities, then they can conflict. If two different cores try to
736 retype two copies of a capability, the correct behavior is for one to
737 succeed and for one to fail. If two cores try to delete the last two
738 copies of a capability, the correct behavior is for one to just delete
739 locally and for the other to perform type-specific operations required
740 when deleting the last copy of a capability. If two cores try to
741 revoke the same capability, one should succeed, the other should fail
742 and the capability it was trying to revoke should also be deleted.
744 \textbf{Conflict between revoke and cross-core transfer:} If a core is
745 trying to revoke a capability, a copy or descendant of which another
746 core is trying to transfer, the revoke operation should only finish
747 when the in-flight capability is also deleted.
749 \textbf{Conflict between revoke and retype:} If a core is trying to
750 revoke a capability, a copy or descendant of which another core is
751 trying to retype, then the retype should either succeed and the new
752 capabilities be deleted before revoke finishes or the retype should
755 \subsection{Non-conflicts}
756 This section discusses why it is safe to perform other operations in
759 \textbf{Delete and revoke:} Delete at least deletes one capability and
760 potentially more. If a revoke for a related capability is issued at
761 the same time, they will overlap and potentially try to perform
762 redundant operations but these are safe.
764 \textbf{Delete and transfer:} Delete will delete one capability and
765 perform a check for copies. The in-flight capability already has a
766 copy that can satisfy the requirements of delete. If the capability is
767 also being deleted, it is the same two deletes overlapping which
768 indeed is a conflict.
770 \textbf{Retype and delete:} A delete will not conflict with retypes
771 because it checks for the existence of copies which a successful
772 retype cannot possibly create. A retype will not conflict with deletes
773 because no matter when it performs the check for descendants, it will
774 either see descendants that are about to be deleted or not see them at
775 all, either of the outcomes is correct.
777 \textbf{Retype and transfer:} Retype cannot conflict with transfers
778 because in order to transfer a capability because the capability being
779 transferred is a copy of an existing capability which does not change
780 the existing descendants relationships.
782 \subsection{Incorrect solution}
783 Since the essential issue with the relation with cross-core transfer
784 operation is that in-flight capabilities do not exist anywhere, the
785 sending core can maintain a list of in-flight capabilities which are
786 garbage collected when the receiver acknowledges the reception of the
787 capability. This allows the sending core to include the in-flight
788 capabilities in the search for copies and descendants. As shown below,
789 this intuitive solution is actually incorrect.
792 \includegraphics[width=0.75\textwidth]{acks-problem.pdf}
793 \caption{Problem with using acknowledments}\label{fig:acks-problem}
796 Consider the scenario shown in figure \ref{fig:acks-problem}. The
797 system consists of three cores \{A, B, C\}, A is trying to revoke a
798 capability and B is trying to send a copy of the capability to C. A
799 sends a revoke request to B, C and waits for their acknowledgments. B
800 sends the capability to C and adds it to its list of in-flight
801 capabilities. C sees the revoke request first, performs the operation
802 locally and replies to A. Then it sees the capability sent from B,
803 inserts it locally and send an acknowledgment to B. B sees the
804 acknowledgment from C first garbage collecting its list of in-flight
805 capabilities and then sees the revoke request from A, performs the
806 revoke locally, and replies to A. A receives replies from both B and C
807 concluding the revoke operation. The revoke operation has finished and
808 C still has a copy of the revoked capability.
810 \section{Requirements from a correct solution}\label{sec:requirements}
811 A correct solution will enforce the following transactional like and
814 \subsection{Transactional properties}
815 \textbf{Atomic:} If one part of a capability operation fails, then the
816 entire operation must fail. For example, in a retype operation, the
817 descendant capabilities can be created in parallel with the check for
818 existing descendants. If existing descendants are found, then the
819 newly created descendants must be removed.
821 \textbf{Consistent:} Every capability operation must leave the system
822 in a consistent state. For example, if caching is used an operation
823 deletes the last copy of a capability on a core, then all its remote
824 relations should be informed and updated before the operation
827 \textbf{Isolation:} The data that has been modified during a
828 capability operation must not be accessible by other operations till
829 the operation finishes. For example, if a retype operation has eagerly
830 created descendants while checking for existing descendants in
831 parallel, no other capability operation should be able to access the
832 descendants created eagerly till the retype operation finishes.
834 \subsection{Ordering}
835 Section \ref{subsec:conflicts} discusses the two types of conflicts
836 that can arise when multiple cores try to perform operations on
837 related capabilities at the time. Correctness can be ensured if the
838 individual steps (messages) in operations must be executed in some
839 order. We provide some background on ordering messages before
840 discussing the actual ordering requirements.
842 \textbf{Background on ordering:} Four types of ordering are
843 possible. We discuss them from the cheapest to the most expensive.
846 \item No order: Messages can be delivered in any order. This cannot
847 happen on Barrelfish.
848 \item Single Source FIFO (SSF): Messages from the same sender are
849 delivered in the order they were sent. This is what currently is
850 available on Barrelfish by default.
851 \item Causal: Messages are delivered based on their partial order
852 given by the happens-before relationship. Vector clocks [?] can be
853 used to provide causal delivery of messages.
854 \item Total order: All receivers receive messages in the same
855 order. Note that any ordering function can be used to order the
856 messages. Further, total order does not imply causal order but if
857 the ordering function respects SSF, then total order does imply
861 \subsection{Conflicts with each-other}
864 \includegraphics[width=0.75\textwidth]{causal-problem.pdf}
865 \caption{Problem with using causal order
866 delivery}\label{fig:causal-problem}
869 The minimum ordering required for resolving conflicts between retype,
870 delete, and revoke operations is total order. Figure
871 \ref{fig:causal-problem} illustrates the reason causal ordering does
872 not work. Nodes \{A,B,C,D\} each contain copies of the same capability
873 that \{A,D\} wish to retype at the same time. B receives the message
874 from A before the message from D, and C receives the D's message
875 before A's. The messages were delivered in causal order, C will refuse
876 to A, B will refuse to D not allowing either retype operation to
877 succeed. Similar examples can be constructed for the delete and revoke
880 Total order delivery will deliver the messages to B and C in the same
881 order allowing just one and only one retype to succeed.
883 \subsection{Conflict between revoke and transfer}
884 Figure \ref{fig:acks-problem} illustrates the conflict between revoke
885 and cross-core transfer operation. C sees the message from A before
886 the message from B. Hence any message it sends to B causally depend
887 upon the message from A. If this was reflected in the message C sent
888 to B, then B could delay handling the message till it sees the revoke
889 request from A. When B receives the revoke request from A, it will
890 realize that A is trying to revoke an in-flight capability and take
891 appropriate actions to resolve the issue. The delaying of the message
892 can be guaranteed by enforcing causal delivery of messages.
894 \subsection{Conflict between revoke and retype}
895 The minimum ordering requirement is total. With less stricter
896 ordering, different cores might see the revoke request and retype
897 requests in different orders. The retyping core might delete the
898 source capability but end up creating descendants later.
900 \section{Solutions}\label{sec:solutions}
902 We will discuss and compare the following four solutions.
906 \item Total order broadcast
907 \item Two phase commit
911 \subsection{Locks}\label{subsec:locks}
912 Locks can be used to enforce the above transactional and ordering
913 policies. Critical sections can be used to resolve the conflicts
914 ensuring that \emph{Time-of-check-to-time-of-use} is not violated.
916 Below, we provide pseudo-code for how capability operations will be
917 implemented when using a centralized lock and not caching information
918 about remote relations.
920 \textbf{Copying a capability:} This operation is safe to be performed
921 without holding a lock.
924 cap_copy(struct cte *cte) {
929 \textbf{Retyping a capability:} Holding the lock is required to
930 prevent multiple cores from trying to retype copies of the same
931 capability. Acquiring the lock is a blocking call during which the
932 capability might have been deleted so it is important to check that
933 the capability still exists after the lock is acquired.
936 cap_retype(struct cte *cte) {
937 errval_t err = success;
939 if (cap_exists(cte) == false) {
943 if (has_descendants(cte) == true) {
947 create_descendants(cte);
954 \textbf{Deleting a capability:} If the capability has local copies,
955 the operation is purely local. Otherwise, holding the lock is required
956 to ensure that if the last copy of the capability in the system is
957 being deleted, then the type-specific cleanup operations are executed.
960 cap_delete(struct cte *cte) {
963 if (!requires_typed_ops(cte)) {
966 if (has_local_copies(cte)) {
970 errval_t err = success;
972 if (!cap_exists(cte) {
976 if (!has_copies(cte) {
977 type_specific_ops(cte);
986 \textbf{Revoking a capability:} Holding the lock is required to ensure
987 that if multiple cores are trying to revoke related capabilities, only
988 one succeeds. This is why the code below ensures that the capability
989 still exists after acquiring the lock.
992 cap_revoke(struct cte *cte) {
993 errval_t err = success;
995 if (!cap_exists(cte) {
1000 while(has_descendants(cte) {
1001 struct cte *dest = get_next_descendant(cte);
1004 while(has_copies(cte) {
1005 struct cte *dest = get_next_copy(cte);
1015 \textbf{Cross-core transferring a capability:} Holding the lock is
1016 required to conflicts between the above capability operations and
1017 cross-core transfer operation. The sender acquires the lock and sends
1018 the capability. The receiver creates the capability and releases the
1022 cap_send(struct cte *cte, coreid_t dest) {
1024 if (cap_exists(cte) == false) {
1027 send_cap(cte, dest);
1030 cap_receive(struct cte *cte) {
1033 return_success_to_app();
1037 \subsubsection{Caching remote relations}
1038 If remote relations are cached, then the cache has to be kept
1039 consistent when capability operations change the state of
1040 relations. Below we describe the modifications that will be required
1041 in the above pseudo-code to keep the cache consistent. Note that our
1042 cache can be seen as eager replication. As discussed in section
1043 \ref{subsec:where}, we will be using the mark all approach to maintain
1046 \textbf{Copying a capability:} When the new copy is created, its cache
1047 of remote relations is set equal to the cache from the capability it
1050 \textbf{Retyping a capability:} If the capability does not have remote
1051 copies and descendants, the operation is purely local and the lock is
1052 not required. If the operation is not local, then the lock is
1053 acquired, checks for existence of capability and for descendants and
1054 is made, the capability is retyped, and remote relations are updated
1055 based on the rules presented in section \ref{subsec:cache:summary}.
1058 cap_retype(struct cte *cte) {
1059 if (has_local_descendants(cte) == true) {
1063 if (!has_remote_copies(cte) && !has_remote_descendants(cte)) {
1064 create_descendants(cte);
1068 if (has_remote_descendants(cte)) {
1072 errval_t err = success;
1074 if (!cap_exists(cte)) {
1079 if (has_remote_descendants(cte)) {
1083 if (has_local_descendants(cte)) {
1088 create_descendants(cte);
1089 update_relations(cte);
1097 \textbf{Deleting a capability:} If the capability has local copies or
1098 no remote relations, the operation is purely local. Otherwise, the
1099 lock is required and the remote relations must be updated.
1102 cap_delete(struct cte *cte) {
1105 if (!requires_typed_ops(cte)) {
1108 if (has_local_copies(cte)) {
1112 if (!has_remote_relations(cte)) {
1113 type_specific_ops(cte);
1118 if (!cap_exists(cte)) {
1121 if (!has_copies(cte) {
1122 type_specific_ops(cte);
1124 update_remote_relations(cte);
1132 \textbf{Revoking a capability:} If the capability has no remote
1133 relations, the operation is purely local. Otherwise, the lock is
1137 cap_revoke(struct cte *cte) {
1138 if (!has_remote_relations(cte)) {
1139 while(has_descendants(cte) == true) {
1140 struct cte *dest = get_next_descendant(cte);
1143 while(has_copies(cte) == true) {
1144 struct cte *dest = get_next_copy(cte);
1150 errval_t err = success;
1152 if (!cap_exists(cte)) {
1157 while(has_descendants(cte) == true) {
1158 struct cte *dest = get_next_descendant(cte);
1161 while(has_copies(cte) == true) {
1162 struct cte *dest = get_next_copy(cte);
1171 \textbf{Cross-core transferring a capability:} The remote relations
1172 cache on local and remote capabilities is updated as presented in
1173 section \ref{subsec:cache:summary}.
1175 \subsubsection{Multiple locks}
1177 If a single lock for the entire system is used, only one capability
1178 operation can be performed at any given moment limiting the available
1179 potential for parallelism. By using different locks for unrelated
1180 capabilities, multiple capability operations can proceed in
1181 parallel. Using multiple locks will increase the space requirement but
1182 will also improve parallelism.
1184 \note{Multiple locks are not straight-forward. Ancestors reference
1185 more memory than descendants do.}
1187 \note{Can a copy lock for delete, a descendant lock for retype, and
1188 both for revoke work?}
1190 \subsection{Total order broadcast}
1191 \note{Use a single data structure for all pending cap operations.}
1193 The required transactional and ordering guarantees can be provided by
1194 ensuring that messages for related capabilities are delivered in the
1195 same order on all cores. Note that causal order delivery resolves the
1196 conflict between retype, delete, revoke and cross-core transfer
1197 operation but not within the retype, delete, revoke operations.
1199 Below we present pseudo-code for how the capability operations will be
1200 implemented when not caching information about remote relations.
1202 \textbf{Copying a capability:} This operation is safe to be performed
1203 without any ordering requirements.
1205 \textbf{Deleting a capability:} If the capability does not require
1206 type-specific operations or has local copies, the operation succeeds.
1207 Or else, local state is created and a message is broadcast to all
1211 cap_delete(struct cte *cte) {
1212 if (!requires_typed_ops(cte) || has_local_copies(cte)) {
1213 remove_cap(data->cte);
1217 struct delete_data *data = malloc(sizeof(struct delete_data));
1218 delete_data_initialize(data, cte);
1219 outstanding_delete_list->add(data);
1220 send_delete_request_bcast(TOTAL_ORDER, data);
1224 If a core receives a delete request broadcast and it was the sender of
1225 the message, it removes the capability and returns. If the core was
1226 not the sender of the message, then it replies with the state of local
1227 copies of the specified capability.
1230 delete_request_bcast(coreid_t from, struct delete_request *data) {
1231 if (from == my_core_id) {
1232 remove_cap(data->cte);
1236 send_delete_reply(from, data, has_local_copies(data->cte));
1240 Finally, when a core receives a reply from a delete request, if a
1241 remote copy is found, no type-specific cleanup is required and the
1242 operation finishes. If all replies have been aggregated and no copies
1243 were found, then the type-specific cleanups are performed and then the
1247 delete_reply(coreid_t from, bool has_copies) {
1248 struct cte *my_data = outstanding_deletes_list->get(data);
1254 outstanding_deletes_list->remove(my_data);
1258 increment_replies(my_data);
1259 if (seen_all_replies(my_data)) {
1260 type_specific_ops(cte);
1261 outstanding_deletes_list->remove(my_data);
1266 Since the deleting cores remove the capability when they receive the
1267 broadcast, when multiple cores are trying to delete copies, the last
1268 core's broadcast will see no copies while other cores will see copies.
1270 \note{Malloc probably means unbounded memory requirement.}
1272 \textbf{Retyping a capability:} If the capability has local
1273 descendants, then the operation fails. Else, a retype request
1277 cap_retype(struct cte *cte) {
1278 if (has_local_descendants(cte)) {
1282 struct retype_data *data = malloc(sizeof(struct retype_data));
1283 retype_data_initialize(data, cte);
1284 outstanding_retypes_list->add(data);
1285 send_retype_request_bcast(data);
1289 When a core receives a retype request broadcast and it was the sender
1290 of the message, one of two things may have happened. Either the core
1291 had already received a broadcast from another core trying to retype
1292 the same capability in which case the receiver's operation has failed
1293 and the state for the request has been removed or the core can succeed
1294 in retyping the capability and updates its state accordingly. If the
1295 core was not the sender of the broadcast, it sends a reply to the
1296 sender with the state of its local descendants. Then the core checks
1297 if it has an outstanding retype request for the capability. If it does
1298 and its request has not been delivered yet, its retype fails. An error
1299 is sent to the application and the state is garbage collected.
1302 retype_request_bcast(coreid_t from, struct retype_request *data) {
1303 if (from == my_core_id) {
1304 struct cte *my_data = outstanding_retypes_list->get(data);
1308 my_data->can_succeed_flag = true;
1312 send_retype_reply(from, data, has_local_descendants(data->cte));
1313 struct cte *my_data = outstanding_retypes_list->get(data);
1318 if (!my_data->success_flag) {
1319 outstanding_retypes_list->remove(my_data);
1320 return_failure_to_app();
1325 When a core receives a reply to the retype request broadcast, the
1326 operation may already have been failed, in which case the reply is
1327 ignored. If the operation has not been canceled yet, then if the reply
1328 indicates no descendants, the operation can still succeed. If all
1329 replies are seen then the retype operation succeeds.
1332 retype_reply(struct retype_request *data, bool has_descendants) {
1333 struct cte *my_data = outstanding_retypes_list->get(data);
1338 if (has_descendants) {
1339 outstanding_retypes_list->remove(my_data);
1340 return_failure_to_app();
1344 increment_replies(my_data);
1345 if (seen_all_replies(my_data)) {
1346 create_descendants();
1347 outstanding_retypes_list->remove(my_data);
1348 return_success_to_app();
1353 This resolves the conflicts between two retypes. The conflict between
1354 retype and revoke are discussed below.
1356 \note{Malloc probably means unbounded memory requirement.}
1358 \textbf{Revoking a capability:} The core initializes some local state
1359 and then broadcasts a revoke request to all cores in the system.
1362 cap_revoke(struct cte *cte) {
1363 struct revoke_data *data = malloc(sizeof(struct revoke_data));
1364 revoke_data_initialize(data, cte);
1365 outstanding_revokes_list->add(data);
1366 send_revoke_request_bcast(TOTAL_ORDER, data);
1370 When a core receives a revoke request broadcast, if the core was
1371 trying to retype a related capability, then it fails the retype
1372 operation. If it is not trying to revoke the capability itself, it
1373 simply revokes the capability locally and sends a reply. If the core
1374 is trying to revoke the same capability but it was not the sender of
1375 the broadcast, then this core's revoke operation fails.
1378 revoke_request_bcast(coreid_t from, struct revoke_request *data) {
1380 if (related_retype_in_progress(data->cte)) {
1381 outstanding_retypes_list->remove(data);
1382 return_fail_to_app();
1385 struct cte *my_data = outstanding_revokes_list->get(data);
1387 revoke_locally(data->cte);
1388 send_revoke_reply(from, data);
1392 if (from != my_core_id && !my_data->success_flag) {
1393 outstanding_revokes_list->remove(my_data);
1394 send_revoke_reply(from, data);
1395 return_failure_to_app();
1399 my_data->success_flag = true;
1403 When a core receives a reply from the broadcast, it aggregates them
1404 till it has heard back from all cores in the system and then sends a
1405 success to the application.
1408 revoke_reply(struct revoke_request *data) {
1409 struct cte *my_data = outstanding_revokes_list->get(data);
1414 increment_replies(my_data);
1415 if (seen_all_replies(my_data) && !my_data->sent_transfer_cap_delete_flag) {
1416 outstanding_revokes_list->remove(my_data);
1417 return_success_to_app();
1422 \textbf{cross core transfer:} When sending a capability to another
1423 core, the sending core creates local states and broadcasts the send
1424 message to all cores in the system.
1427 cap_transfer(struct cte *cte, coreid_t to) {
1428 struct transfer_data *data = malloc(sizeof(struct transfer_data));
1429 transfer_data_initialize(data, cte);
1430 outstanding_transfers_list->add(data);
1431 send_transfer_request_bcast(TOTAL_ORDER, data, to);
1435 When a core receives the capability transfer broadcast, it checks
1436 against the state of current capability operations in progress and
1437 takes appropriate actions if they are related.
1439 If a revoke operation is in progress that is trying to revoke a copy
1440 or an ancestor of the capability being transferred, the broadcast will
1441 indicate that the system indeed has more copies and descendant of the
1442 capability being revoked which must be deleted. If the receiver was
1443 also the recipient of the capability, it does not create it and
1444 returns an error to the sender of the capability or else the receiver
1445 sends a message to the core to which the capability is being
1446 transferred to requesting it to delete the capability.
1449 transfer_request_bcast(coreid_t from, struct transfer_data *data, coreid_t to) {
1450 struct cte *my_data;
1451 my_data = outstanding_revokes_list->get(data);
1453 if (to == my_core_id) {
1454 send_transfer_reply(from, FAILURE_REVOKED, data);
1458 my_data->sent_transfer_cap_delete_flag = true;
1459 send_delete_transferred_cap_request(to, data);
1462 if (to != my_core_id) {
1466 my_data = pending_delete_for_transferred_cap_list->get(data);
1468 pending_delete_for_transferred_cap_list->remove(my_data);
1469 send_delete_transferred_cap_reply(my_data->from);
1470 send_transfer_reply(from, FAILURE_REVOKED, data);
1473 cap_create(data->cte);
1474 send_transfer_reply(from, SUCCESS);
1478 When the sender of the capability gets a reply from the receiver, it
1479 forwards the error code to the application and garbage collects its
1483 transfer_reply(errval_t err, struct transfer_request *data) {
1484 my_data = outstanding_transfers_list->get(data);
1485 outstanding_transfers_list->remove(data);
1486 return_err_to_app(my_data->app, err);
1490 If a revoke was in progress during the transfer of the capability, the
1491 core revoking the capability sends a message to the receiver of the
1492 transferred capability to delete it. When the receiver receives this
1493 message, it may or may not have received the capability yet. If it has
1494 already received the capability, it deletes or else it establishes
1495 some state to delete when it is later received.
1498 delete_transferred_cap_request(coreid_t from, struct data *data) {
1499 if (cap_exists(data->cte)) {
1501 send_delete_transferred_cap_reply(from);
1505 struct pending_delete_for_transferred_cap *my_data =
1506 malloc(sizeof(struct pending_delete_for_transferred_cap));
1507 pending_delete_for_transferred_cap_initialize(my_data);
1508 pending_delete_for_transferred_cap_list->add(my_data);
1513 delete_transferred_cap_reply(struct data *data) {
1514 my_data = outstanding_revokes_list->get(data);
1519 my_data->sent_transfer_cap_delete_flag = false;
1520 if (seen_all_replies(my_data)) {
1521 outstanding_revokes_list->remove(my_data);
1522 return_success_to_app();
1529 \subsection{Two phase commit}
1533 \subsection{Sequencer}
1535 Use a sequencer. This will order whole operations.
1537 \subsection{Comparison}
1538 Compare the approaches
1540 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1542 \chapter{Implementation details}\label{chap:implementation}
1546 \item Using RPC between app and monitor
1547 \item Using RPC between monitors
1548 \item Everything having ancestors on memserv
1549 \item Optimization: monitor caches root cnodes
1551 \item How to put enum objtype in interface file?
1552 \item Two types of ifdefs: type of caching (none, list, or bits) and
1556 \section{Performing capability operations}
1557 If caching is not used, then the application knows for which
1558 operations it must contact the monitor and does so directly without
1559 requesting the kernel to try to perform the operation first.
1561 If caching is used, then the application should try to perform the
1562 operation via the kernel first. If the capability does have remote
1563 relations, the kernel returns an appropriate error to the application
1564 in which case it contacts the monitor.
1566 \section{Sharing mdb between kernel and monitor}
1567 When the application wants the monitor to perform an operation for it,
1568 it passes the monitor its root CNode and all the required parameters.
1570 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1572 \chapter{Not Yet Discussed}\label{chap:nyd}
1574 Things that I know that I need to discuss.
1578 \item The OS does not guarantee which order the operations will be
1579 performed in. The user must enforce the ordering herself.
1581 \item Partitioned approach with caching is eager replication. Consider
1585 \bibliographystyle{plain}
1586 \bibliography{defs,barrelfish}