Difference between revisions of "User:Schuetzm/scope"

From D Wiki
Jump to: navigation, search
(Owner tracking)
(Owner deduction)
Line 169: Line 169:
  
 
As an optimization, only the owner with the shortest lifetime needs to be kept.
 
As an optimization, only the owner with the shortest lifetime needs to be kept.
 
=== Owner deduction ===
 
 
The compiler sometimes infers the type of a variable automatically. This happens when the type is not specified in a variable declaration, when a functions return type is unspecified, and for templates (IFTI). All three of them work by determining the type of an expression: the right-hand side of an initialization, the returned value(s) in a function, or a value that is passed to a function.
 
 
Type deduction does never add '''scope''' or an owner by itself. It always starts with the exact same type of the expression. Only if '''scope''' is used in the declaration, owner information is added to the variable. (This works in the same way as '''const'''.) In this case, it will also add the owners of the source expression:
 
 
<source lang="D">
 
int a        = 42;
 
/* owner deduction */
 
scope c      = &a;          // scope!a(int*)
 
scope!b d    = &a;          // scope!(a,b)(int*)
 
/* no owner deduction */
 
auto b        = &a;          // int*
 
scope(int*) e = &a;          // scope(int*)
 
                            // alternatively, add owner `a` here
 
 
/* analogously for functions */
 
auto foo(int* a) {          // int*
 
    return a;
 
}
 
scope foo(int* a) {          // scope!a(int*)
 
    return a;
 
}
 
scope foo(int* a, int* b) {  // scope!(a,b)(int*)
 
    if(...)
 
        return a;
 
    else
 
        return b;
 
}
 
</source>
 
 
The last example wouldn't compile according to the current rules for return type deduction, because the types of all return statements need to match exactly<ref>http://dlang.org/function#auto-functions</ref>. The same can be expected to apply to owner deduction, analogously. If possible, this restriction should be lifted to permit the above code. If this isn't deemed acceptable for performance reasons (the compiler only needs to look at the first return in a function to determine the return type and owner), it's probably still not too much trouble for the user to specify only the scope explicitly (without the actual type).
 
 
No further '''scope''' attribute and owner inference in templates is done. It might be considered as an extension, but it needs to be specified carefully depending on how much flow control analysis we want to require compliant compilers to implement.
 
  
 
=== Transitivity ===
 
=== Transitivity ===

Revision as of 18:02, 23 October 2014

What's this about?

Taking addresses of local variables is currently not allowed in @safe code. [1] This is, however, a very broad restriction, that disallows many useful idioms. Examples include slicing of local arrays, passing local structures by reference for efficiency reasons, out parameters, and allocators with limited lifetimes. Additionally, GC avoidance techniques like reference counting and unique/owned objects need some kind of borrowing to work efficiently, to avoid the costs of reference incrementing/decrementing or move semantics, while at the same time still being provably memory safe.

The language already designates the scope concept for that purpose, but as of today it is unimplemented, and the original concept is generally seen as insufficient. This proposal intends to extend the design of scope and define it's semantics to be usable for the above-mentioned purposes.

Lifetimes

An important concept is that of the lifetime. Any object exists for a specific period of time, with a well defined beginning and end point: from the point it is created (constructed), to the point it is released. A reference that points to an object whose lifetime has ended is a dangling reference. Use of such references can cause all kinds of errors, and must therefore be prevented.

Because the lifetimes of actual manually managed objects are complex and unpredictable, a different concept of lifetime is hereby introduced, that only applies to named variables and is based purely on their lexical scope and order of declaration. By the following rules, a hierarchy of lifetimes is defined:

  • A variable's lifetime starts at the point of its declaration, and ends with the lexical scope it is defined in.
  • An (rvalue) expression's lifetime is temporary; it lives till the end of the statement that it appears in. (FIXME: provide a reference)
  • The lifetime of A is higher than that of B, if A appears in a higher scope than B, or if both appear in the same scope, but A comes lexically before B. This matches the order of destruction of local variables.
  • The lifetime of a function parameter is higher than that of that function's local variables, but lower than any variables in higher scopes. (FIXME: relative lifetimes among function parameters == order of destruction)

Ownership and borrowing

Taking a reference to a variable is called borrowing. The original variable is called the owner of the reference.

In @safe code, taking a reference to a local variable will be allowed, but the resulting reference will contain information about the owner which is used by the compiler to decide whether assigning the reference to a particular variable is permissible or not. Assignment in this context means copying the reference, be it by assignment to another variable, constructing another value from it, passing it to a function, returning it from a function, using it in a closure, or by throwing it as an exception. In general, scoped references may only be assigned to variables with lower (= shorter) lifetimes than their lifetime owner.

These restrictions apply only to reference values, i.e. pointers, slices, classes, and ref or out parameters. Non-reference values may be freely copied, because no memory-safety issues can arise. If the contain scoped variables though, these need to be treated as such, of course. (Alternatively, treating references and non-references alike with respect to scope can be considered, too.)

Bare scope

This is the "normal", usual syntax, which will be used in most cases. Examples:

@safe:

int global_var;

void bar(scope(int*) input);

void foo() {
    scope(int*) a;
    a = &global_var;       // OK, `global_var` has higher lifetime than `a`
    scope b = &global_var; // OK, type deduction
    int c;

    if(...) {
        scope x = a;       // OK, copy of reference,`x` has shorter lifetime than `a`
        scope y = &c;      // OK, borrowing
        int z;
        b = &z;            // ERROR: `b` will outlive `z`
        int* d = a;        // ERROR: `d` is unscoped, but `a` is scoped
    }

    bar(a);                // OK, scoped reference is passed to scoped parameter
    bar(&c);               // OK, borrowing
    int* e;
    a = e;                 // OK, implicit conversion to '''scope'''
}

Because we don't know anything about the owner of a bare scoped reference (except that it lives longer than the reference), we have to assume its lifetime to be the same as the reference itself. The owner likely has a longer lifetime, but we need to assume the worst case in order to be on the safe side. It can therefore effectively be treated as "self-owned".

scope with owner(s)

Bare scope is already quite powerful, but for certain things, it still cannot be used. Typical examples are haystack-needle type functions that are passed slices as input (the haystack and the needle), and return a slice of the haystack:

string findSubstring(scope(string) haystack, scope(string) needle) {
    // ... do the actual searching ...
    return haystack[found .. $];  // ERROR: needs to return `string`, not `scope(string)`
}

Changing the return value to `scope(string)` doesn't help, because it is a temporary that could not be stored anywhere, as temporaries' lifetimes are restricted to the current statement. Therefore, returning a bare scope(T) from a function doesn't make sense, and is disallowed, in order to avoid having to specify its behaviour.

The solution is to specify the owner explicity:

scope!haystack(string) findSubstring(scope(string) haystack, scope(string) needle) {
    // ... do the actual searching ...
    return haystack[found .. $];  // OK
}

The signature of findSubstring says: "This function will return a reference to the same object that is referenced by the parameter haystack."

Multiple owners can be specified, too:

scope!(a, b)(string) chooseStringAtRandom(scope(string) a, scope(string) b) {
    return random() % 2 == 0 ? a : b;
}

scope!(identifier1, identifier2, ...) means that assignment (with the meaning as above) from any of the listed identifiers is allowed, and also from anything that specifies all or a subset of the listed identifiers as its owners.

From the caller's point of view:

@safe:

void foo() {
    string[13] text = "Hello, world!";
    scope(string) a;
    a = findSubstring(text, "world");               // OK
    global_string = findSubstring(text, "world");   // ERROR
    string[5] s1 = "Hello", s2 = "world";
    scope(string) b = chooseStringAtRandom(s1, s2); // OK
}

What happens at the call site is that the compiler matches up the owners in the parameter list that refer to other parameters, with the owners actual arguments. `scope!haystack` therefore gets (internally) turned into `scope!text`, as `text` is the variable that gets passed as the parameter `haystack`. When there are multiple owners in the return value, the compiler chooses the one whose matching argument has the shortest lifetime, because it needs to assume the worst case. In our example, this would be `scope!s2`, because `s2`, while living in the same lexical scope as `s1`, is declared later, and is therefore destroyed earlier. For this purpose, `this` in methods is treated as a parameter, and can be referred to by `scope!this`.

ref and out parameters work analogously to return values.

Owners can only be explicitly specified in function signatures. This determines which values they may be assigned inside the function. The lifetimes of parameters are ordered among each other: ref and out have higher lifetimes than value parameters, and the parameters in general have a specified order in which they are destroyed when the function returns, which determines their relative lifetimes.

If an aggregate member is declared is scope, its owner is implicitly the aggregate itself. Only values whose owners are known not to outlive the aggregate may be assigned to it.

Everywhere else, only bare scope can be used.

Implicit conversion and casting

Non-scoped references are implicitly convertible to scope. Also, any scoped reference is implicitly convertible to a scoped reference with an owner with a shorter lifetime. As described above, these checks are done on assignment of a scoped reference.

Because scope is a type modifier, it can be cast away in the normal manner, analogously to const, immutable and shared. In order to avoid accidentally stripping other type modifiers at the same time, a helper function `std.typecons.unscoped()` should be provided.

Owner tracking

All reference expressions (not only scoped ones) internally have a set of owners. This owner is propagated through the various operations that can be applied to a reference:

Operation Description Owners
&a taking a reference only a itself
*a dereferencing owners specified in a', none if it's not a reference
a[n] indexing owners specified in a', none if it's not a reference
a + n pointer arithmetic owners of a
a[], a[n .. m] slicing owners of a
x ? a : b ternary operator union of owners of a and b
cast(T) a casting owners of a
a anything else owners of a

This effectively tracks the origin of a pointer in an expression. The internal owner set is, however, only used for verifying the lifetimes on assignment. After a value has been assigned, it's owner information is dropped. This is in analogy to uniqueness tracking and value range propagation, which work on an expression level, too.

As an optimization, only the owner with the shortest lifetime needs to be kept.

Transitivity

scope is not transitive. This means that references reachable through a scoped reference are not automatically scoped. There are multiple reasons for this:

  • Ownership is usually defined by the implementer of a type. For example, a structure or class is written in a way that either assumes that it owns an embedded reference, and therefore needs to manage it by itself (be it via manual memory management, a unique type, reference counting, or the garbage collector), or it doesn't, and leaves management to its users. It is therefore not really possible for a type's user to change that decision. What the user can decide, however, is the ownership of instances of the type itself. Therefore, transitivity doesn't make any sense.
  • Besides that, it would probably be really complicated to even define the exact semantics, let alone to implement them. For example, it might involves unions of disjunct lifetimes.
  • There is also no known great advantage of making scope transitive. This is in contrast to const and shared, where we can prove many interesting properties about functions and types because of transitivity.
  • Even if such an advantage should be discovered in the future, introspection will make it possible to detect whether a given type is actually transitively scoped, simply by following all the referenced types.

Overloading

Functions can be overloaded on scope. They will occupy an additional level of matching between exact match and match with conversion to const[2] (FIXME: to be debated). In @safe code, the scoped version are however to be preferred. This is to allow for providing overloads of existing functions in Phobos that can either return scoped or non-scoped references in a backward compatible way. Ideally, there would be support for overloading based on the scope of the returned values, but this is probably too complicated to implement.

It is debatable whether there needs to be more fine-grained matching among scoped versions, depending on the lifetimes of the individual owners. To simplify implementation, such a feature is probably expendable.

Optional enhancements

scope!(const ...)

An interesting extension of the concept of borrowing is constant borrowing: as long as there are borrowed references to a variable, that variable is treated as const. Casually spoken, "while you've given something away, you can not change it". The syntax would be as the above scope with owners, but with some of the identifiers prefixed by the keyword const.

To show why this can be useful, consider the as of yet unsolved problem of the so called transient ranges. [3][4] These are ranges which return references to data that can change when popFront() is called on them. The most well-known instance of those is std.stdio.File.byLine, which reuses the same internal buffer for each line that is read.

To my knowledge, only unsatisfying ad-hoc workarounds like byLineCopy and manually duping the individual lines have been proposed as of yet. But consider this:

struct ByLineImpl(Char, Terminator) {
private:
    Char[] line;
    // ...

public:
    // - return value must not outlive `this` (i.e. the range)
    // - as long as the return value exists, `this` will be const
    @property scope!(const this)(Char[]) front() const {
        return line;
    }
    void popFront() { // not `const`, of course
        // ...
    }
    // ...
}

void main() {
    alias Line = const(char)[];
    auto byline = stdin.byLine();
    foreach(line; byline) {
        write(line); // OK, `write` takes its parameters as scope
        // (assuming the widespread use of scope throughout Phobos)
    }
    Line[] lines;
    foreach(line; byline) {
        lines ~= line;
        // ERROR: cannot assign owned `line` to unowned `lines`
    }
    // let's try to work around it:
    scope!(byline)(Line)[] olines;
    foreach(line; byline) {
        olines ~= line;
        // ERROR: cannot assign `line` to `lines`, because owners don't match (const vs. non-const)
    }
    // next try:
    scope!(const byline)(Line)[] clines;
    foreach(line; byline) {     // ERROR: `byline` is const
        clines ~= line;
    }
    // => nope, won't work
    // another example, to show how it works:
    auto tmp = byline.front;    // OK
    // `byline` is const as long as `tmp` exists
    write(byline.front);        // OK, `front` is const
    byline.popFront();          // ERROR: `byline` is const
}

Here, the compiler matches up `this` with `byline` as described above. `byline` is then treated as const for the lifetime of the variables that refer to the individual lines. In other words, as long as a variable with the owner `const ident` exists, `ident` will be const.

Const borrowing is also a requirement to implement safe borrowing for types that implement move semantics, because moving can also occur across different levels of the ownership hierarchy that is used for borrowing.

Automatic borrowing for pure functions

In many cases, bare scope suffices, and scope with owner is not needed. There is, however, an opportunity to allow borrowing without any explicit annotations at all, not even bare scope. When we're dealing with pure functions, we can sometimes guarantee that a borrowed reference cannot escape the function. pure functions cannot access global variables directly, so the only way a reference can come out of a pure function is by its return value or by one of its parameters. By looking at the function signature, we can in many cases proof that this cannot happen:

void foo(int[] p) pure;         // the function has no opportunity to keep a reference to `p`
int bar(int[] p) pure;          // returns an `int` but that's a value type, so it's ok
int[] baz(const(int)[] p) pure; // the return type is not `const` and thus cannot come from `p`

This is a nice opportunity to reduce the need for explicit annotations, which are seen as burdensome by the users. Because purity is inferred for template functions, the opportunities for application of this technique are increased.

Allow references to locals in @safe code

Currently, @safe code is not allowed to take the addresses of local variables or parameters. This restriction could be lifted. Instead, the address operator & would return a scoped expression with its operand as the owner. The scope attribute then guarantees that the resulting pointer will not be escaped or outlive its owner.

@safe functions can the use local buffers to avoid GC allocation (or manual memory management), without having to resort to @trusted code.

This change will not break existing code, because it only allows certain operations that were previously disallowed.

Array literals

For safety reasons, array literals are usually allocated on the GC by the compiler. This happens implicitly, and is a hard-to-detect source of GC allocations. With scope, the compiler can safely place such literals onto the stack, so that no GC memory allocation is necessary. If they aren't mutable and their values are know at compile time, they can also be placed into a read-only section of the executable.

Additionally, array literals can then be used together with @nogc.

scope inference in templates

For template functions (and maybe nested functions) the compiler might try to infer the scope of their arguments, including this. To avoid having to apply control flow analysis, a conservative method can be applied, that only checks whether a particular parameters can ever be assigned (or passed) as non-scope.

Care must however be taken because this change might break code when the return value or ref parameters are inferred as scoped.

scope for non-references

The concept of borrowing may be of use for resources other than memory. For example, a file handle can only be used while it is open. With the proposal as described here, this can only be achieved by borrowing a reference to the file handle, which implies an unnecessary indirection (file handles are simple integer values on most platforms, and can simply be copied).

The proposal can probably be modified without trouble to allow scoped values, too. This might even simplify the implementation, because then references don't need special treatment.

Another application would be an efficient implementation of reference counting.

rvalue references with scope ref

Because temporaries exist only for the duration of the statement they appear in, it is currently not allowed pass them by reference, because doing so would be unsafe. Using scope ref or in ref, it will be possible to pass them safely. This makes it possible for functions to accept lvalues and rvalues alike without resorting to auto ref, which again reduces template bloat.

(There are already several other proposals related to rvalue references: DIP25, DIP35, DIP36, DIP38, DIP39.)

Applications

Garbage collection has many well-known disadvantages, including unpredictable pauses (which are unacceptable in code with low-latency requirements), higher memory usage (unused memory is not released immediately), and unreliable destruction of objects (destructors aren't guaranteed to be called and are restricted in what they are allowed to do). Therefore, developers use various techniques to avoid the GC, many of which however have high performance overhead, or are inherently unsafe.

scope allows these techniques to be used safely and efficiently, by abstracting away the underlying owner and memory management strategy, similar to how const makes it possible to write functions that work independently of the actual mutability of their arguments. This makes it unnecessary to use templates in many cases, which would otherwise be required in order to support the different wrapper types that would have to be used, and thereby reduces the template bloat problem.

Owned types

These types are often wrappers around other types, and they "own" the memory that their object lives in. An example is std.typecons.Unique. When a variable of this type goes out of scope, it calls the contained type's destructor and releases its memory.

For this to be safe, the wrapper needs to make sure that it owns the only reference to its managed object. This is achieved by allocating and constructing the object inside the wrapper's constructor, destroying and releasing it when it goes out of scope, disallow copying, and to make sure that no reference to the objects can ever escape. (Sometimes there is a way to construct the wrapper from an existing object, but the user is then responsible for ensuring uniqueness.)

The restriction that no reference may escape is often impossible to guarantee, especially if the managed objects has methods. One possible workaround is move semantics, where the ownership of the managed object is transferred to a different instance of the wrapper, and the original instance is left in an unusable state (e.g. nulling the pointer to the object). It is however very tedious and expensive to move objects back and forth for each function call, and it requires the functions to have special support for it, i.e. it cannot be used with most of the functions in the standard library or other general-purpose libraries.

With scope, it is possible to implement most of the requirements safely and efficiently:

struct Owned(T)
if(is(T == class)) // only for classes, for simplicity
{
private:

    T _owned;

public:

    this(Args...)(Args args) {
        // instead of using `new`, the object could also be emplaced
        this._owned = new T(args);
    }

    @disable this(this);

    ~this() {
        if(this._owned)
            delete this._owned;
        this._owned = null;
    }

    private @property scope!this(T) borrow() {
        return this._owned;
    }
    alias borrow this;
}

In this example, Owned!T implicitly converts to its payload type because of the alias this, but will then be scoped to the wrapper variable. This means, that it cannot be stored in variables that live longer than the wrapper, and that it's also not possible to call methods on it that aren't explicitly marked as scope. This guarantees that no methods can escape references either.

Reference counting

This is similar in requirements to owned types, except that the wrappers can be copied, in which case a reference count is incremented. When the wrapper goes out of scope, it decrements the reference counter, and only destroys and releases its object if the reference counter drops to 0. An example is std.typecons.RefCounted.

Reference counted objects can be passed to other functions, but this has additional costs: Again, the functions need to accept the wrapper type in the first place, and the reference counter needs to be incremented and decremented. This is expensive, because the wrapper has twice the size (it needs an additional pointer to the reference count), which can cause spilling of registers, the reference count often resides in a different location in memory than the managed object itself (bad for the cache), and the decrementing also needs to check whether the object needs to be released, which is a conditional branch that can stall the processor's pipeline. For shared objects, atomic operations are required, which are expensive, too. In addition, reference counting needs to be exception safe: If the called function can throw, the call needs to be wrapped in an exception handler to ensure the reference is properly decremented. This, and the fact that the decrementing needs to happen at all will usually prevent certain optimizations, e.g. tail-call optimization, elision of stack frames, etc.

However, in many cases this cost is completely unnecessary, as most functions that are called will not keep a reference to the object around after they return. To detect this situation, the compiler would however have to do complex flow analysis, usually also whole program optimization. With scope, it is again easy to achieve efficient passing by implementing implicit borrowing as above.

In some cases, a function may want to make a copy, and therefore needs to accept the wrapper struct, instead of borrowing the payload. However, for passing that wrapper, the refcount still does not need to be incremented. To achieve that, the RC wrapper itself can be borrowed. We can then define scoped overloads of its postblit and destructor that don't update the reference count. Incomplete example, showing just the important bits (requires scope with value types):

struct RC(T) {
private:
    T _payload;

public:
    ~this() {
        // decrement refcount, destroy if necessary
    }

    this(scope this) {
        // increment refcount
    }

    ~this() scope {
        // do NOT decrement refcount
    }

    this(scope this) scope {
        // do NOT increment refcount
    }

    scope!(const this) T borrow() {
        return _payload;
    }

    alias borrow this;
}

Region allocators

A region allocator allocates memory that is to be used only in limited region of a program. When the region is left, all memory can be released efficiently at once. It is evident that again, no references to the allocated objects may outlive the region itself.

struct RegionAllocator {
private:
    // ...

public:
    @disable this(this);

    ~this() {
        // destroy objects and release all memory
    }

    scope!this(T*) make(T, Args...)(Args args) {
        // allocate memory, construct an object, and return it
    }
}

The objects returned by the region allocator can not be stored in a variable that outlives the allocator itself. Therefore, no dangling pointers can be created.

GC zones

For achieving good GC performance, it is essential to restrict the amount of memory a GC has to scan, as well as the number of threads it has to stop. Ownership annotations might be utilized to provide information about the places where the GC has to look for pointers, although it isn't clear whether they are sufficient for this purpose.

Move semantics

Move semantics is useful in combination with unique and reference-counted types. It allows transfer of ownership of the wrapper's payload to a different wrapper, either by assigning to an existing wrapper, or by constructing a new one from it. Because the new owner can be shorter lived than the old one, it reduces the actual lifetime of the payload. The concept of borrowing proposed herein cannot model this, because it is based on the syntactical hierarchy. (The underlying cause is that move semantics is actually a kind of manual ownership management.)

Example:

struct Owned(T) if(is(T == class)) {
private:
    T payload;

public:
    this(@unique T object) {
        payload = object;
    }

    ~this() {
        if(payload)
            delete payload;
    }

    @disable this(this);

    @unique T release() {
        T tmp = payload;
        payload = null;
        return tmp;
    }

    scope!this(inout(T)) borrow() inout {
        return payload;
    }

    alias borrow this;
}

Owned!MyClass my_object = new MyClass();
scope MyClass my_borrowed_object = my_object;
{
    Owned!MyClass my_other_object = my_object.release();
    // the original object has been moved into `my_other_object`
}   // `my_other_object` goes out of scope and destroys its payload
my_borrowed_object.do_something(); // whoops

(I'm using the hypothetical @unique here to signify that a parameter may only accept a unique expression, or that a function returns one. This is to make the example implementation nicer, and is not really necessary to show my point.)

As can be seen, this can produce dangling references. However, with const borrowing, this can be avoided:

struct Owned(T) if(is(T == class)) {
    // as above, but with const borrowing
    scope!(const this)(inout(T)) borrow() inout {
        return payload;
    }
}

Owned!MyClass my_object = new MyClass();
scope MyClass my_borrowed_object = my_object;
{
    // ERROR: cannot call non-const `release()` on const `my_object`
    Owned!MyClass my_other_object = my_object.release();
}

Implementation

Compiler

Implementation of this feature is possible without doing flow control or interprocedural analysis. All checks that need to be made can be done locally at the assignment or call site. The same is true for matching owners to parameters.

Considerations for the implementation

  • When multiple owners are present, the resulting lifetime is their intersection (= minimum, because the owners are in a hierarchy). During evaluation of an expression, these owners are known, so the compiler may (as an optimization) propagate only the owner with the shortest lifetime through the expression. This will simplify the implementation and make it more efficient. In function signatures however, the owners may refer to parameters, so multiple owners need to be supported there. (This suggestion as such is incompatible with the scope!(const ...) extension, but it is still valid for non-const owners.)
  • To make handling of bare-scoped and owner-scoped values more uniform, the former can be treated as having themselves as their owner.

Phobos & libraries

For scope to be maximally useful, the functions in the standard library need to be annotated with scope wherever necessary. As much of Phobos consists of templates, this may however not be as much work as it seems at first glance, because attribute inference can do this automatically in many cases (especially because purity is already inferred for templates). For the remaining cases, owner deduction can help, too.

There is a certain potential for breakage when function are changed to return a scoped value, that previous returned an unscoped one. This value will not be assignable to non-scoped variables. Because of the participation of scope in overloading, there can be additional overloads that return scoped values. However, the functions that are affected are unsafe and may well be worth deprecating. (The language might also support this by allowing assignment of scoped return values to non-scoped variables for a transition period, but printing a deprecation warning.)

References