Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge recent dev to main #2063

Merged
merged 6,092 commits into from
Mar 20, 2024
Merged

Merge recent dev to main #2063

merged 6,092 commits into from
Mar 20, 2024

Conversation

bsdjhb
Copy link
Collaborator

@bsdjhb bsdjhb commented Mar 20, 2024

This stops just before the ABI change for many arguments but is post all of the FreeBSD merges. It includes all of the async revocation changes, though not the commit to enable it by default. More recent c18n changes in dev are not yet in this merge.

kevans91 and others added 30 commits March 14, 2024 10:09
Previously we used a mix of perror(3) + exit(3) and err(3); standardize
on the latter instead.  This does remove one free() in an error path,
because we're decidedly leaking a lot more than just the loader name
there (loader handle, vcpu, vmctx...) anyways.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D43331
Only try sending more data on pure ACKs when there is
more data available in the send buffer.

In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.

Reported by:           [email protected]
Reviewed by:           cc, tuexen, #transport
MFC after:             1 week
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343
PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.

Reviewed by:           tuexen, cc, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170
Reviewed by:	markj, melifaro
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D43368
PR:		253069
MFC after:	1 week
Sponsored by:	Chelsio Communications
Reviewed by:	brooks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D43371
This matches what Linux does.

Reviewed by:		melifaro, tuexen
Differential Revision:	https://reviews.freebsd.org/D43366
This is a quick plug to fix panic with Netlink which has protocol specific
buffers.  Note that PF_UNIX/SOCK_DGRAM, which also has its own buffers,
avoids the panic due to being SOCK_DGRAM.  A correct but more complicated
fix that needs to be done is to merge pr_shutdown, pr_flush and dom_dispose
into one protocol method that may call sorflush for generic sockets or do
their own stuff for protocol which has own buffers.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43367
Reported-by: [email protected]
In the case of hostbase_fd, this is infact a bug fix; we have a seek
callback that the host: filesystem may use in loader, and we really
don't have a good excuse to break it.

bootfd-derived fds will only be used with fdlopen(3) and rtld doesn't
seem to need pread / lseek at all for it today, but there's no reason to
break if it finds a good reason to later.

Suggested by:	markj
CC/CXX/CPP/LD may all have arguments supplied in various circumstances,
which break the logic here.  We only need to determine which of these
tools we're expecting to invoke from PATH, which just requires
examination of the first word.  Limit our scope to exactly that.

Patch suggested by:	jrtc27
Reviewed by:		imp, jrtc27
Differential Revision:	https://reviews.freebsd.org/D43372
I forgot to do it when making the commit, so hat-tip to asomers@

Reported by:	asomers@
Fixes:		fbbdfa2 (nfsv4(4): mention the nfsv4_server_only..)
MFC with:	fbbdfa2
In belatedly fixing a mistake made in fbbdfa2, I noticed that igor
and mandoc -Tlint had a few more things to say.

As such, I'm reflowing a few lines and fixing a contraction.

MFC with:	fbbdfa2
The change of argument for sizeof() (from a type to an object) is to be
consistent with the change done for the malloc() code just above in the
preceding commit touching this file.

Consider bit flags as integers and test whether they are set with an
explicit comparison with 0.

Use an explicit flag value (PTHREAD_SCOPE_SYSTEM) in place of a variable
that has this value at point of substitution.

All other changes are straightforward.

Suggested by:           kib
Reviewed by:            kib
Approved by:            emaste (mentor)
MFC after:              2 weeks
Differential Revision:  https://reviews.freebsd.org/D43327
Using calloc() instead of malloc() is useless here since the allocated
memory is to be wholly crushed by the memcpy() call that follows.

Suggested by:           kib
Reviewed by:            emaste, kib
Approved by:            emaste (mentor)
MFC after:              2 weeks
Differential Revision:  https://reviews.freebsd.org/D43328
Issue data claim command after every chunk of data programmed,
so we can reuse the SVC buffer for the next chunk.

Tested on Terasic DE10 Pro.

Sponsored by: UKRI
The standard is somewhat unclear, but on the balance, I believe that the
phrase “the rest of the input line” should be interpreted to mean the
rest of the input line including the terminating newline if and only if
there is one.  This means the current implementation is incorrect on two
points:

- First, it suppresses the previous line's newline in the '1' case.

- Second, it unconditionally emits a newline at the end of the output
  for non-empty input, even if the input did not end with a newline.

Resolve this by rewriting the main loop.  Instead of special-casing the
first line and then assuming that every line ends with a newline, we
remember how each line ends and emit that either at the beginning of
the next line or at the end of the file except in the one case ('+')
where the standard explicitly says not to.

While here, try to reduce diff to upstream a little and update their
RCS tag to reflect the fact that while we've diverged significantly
from them, we've incorporated all their changes.  Remove the useless
second RCS tag.

We also update the tests to account for the change in interpretation
of the '1' case and add a test case for unterminated input.

MFC after:	1 week
Sponsored by:	Klara, Inc.
Reviewed by:	kevans
Differential Revision:	https://reviews.freebsd.org/D43326
This allows writing setup scripts that contain lines starting with
"#!", e.g., a shebang when creating a shell script using cat:

    #!/bin/sh
    echo "Populate rc.local"
    cat >/etc/rc.local<<EOF
    #!/bin/sh
    echo booted | logger -s -t 'example'
    EOF

Prevent accidentally running a setup script left behind by a
previous invocation of bsdinstall.

Reviewed by:	imp, jrtc27
Differential Revision:	https://reviews.freebsd.org/D43350
Existing powerpc kernels include additional sections beyond .dynamic
in the PT_DYNAMIC segment.  Relax the requirement for an exact size
match of the section and segment for PowerPC files as a workaround.

Reported by:	jrtc27
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D43123
These loops already handled a NULL return from mbufq_dequeue when the
queue was empty, so remove a redundant check of mbufq_len before
dequeueing.

Reviewed by:	bz
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43336
Complement to the existing mbufq_full

Reviewed by:	bz
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43337
Reviewed by:	bz, emaste
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43338
…ffer

memdesc_alloc_ext_mbufs constructs a chain of external (M_EXT or
M_EXTPG) mbufs backed by a data buffer described by a memory
descriptor.

Since memory descriptors are not an actual buffer just a description
of a buffer, the caller is required to supply a couple of helper
routines to manage allocation of the raw mbufs and associating them
with a reference to the underlying buffer.

Reviewed by:	markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D42933
In particular, don't reserve resources added by drivers via other
means (e.g. acpi_bus_alloc_gas which calls bus_alloc_resource
right after adding the resource).

The intention of reserved resources is to ensure that a resource range
that a bus driver knows is assigned to a device is reserved by the
system even if no driver is attached to the device.  This prevents
other "wildcard" resource requests from conflicting with these
resources.  For ACPI, the only resources the bus driver knows about
for unattached devices are the resources returned from _CRS.  All of
these resources are already reserved now via acpi_reserve_resources
called from acpi_probe_children.

As such, remove the logic from acpi_set_resource to try to reserve
resources when they are set.  This permits RF_SHAREABLE to work with
acpi_bus_alloc_gas without requiring hacks like the current one for
CPU device resources in acpi_set_resource.

Reported by:	gallatin (RF_SHAREABLE not working)
Diagnosed by:	jrtc27
…sind=1)

This applies the fix to powerpc's pmap as was done in commit aa3bcaa
and d0941ed for amd64 and riscv pmaps, respectively.

Reported by:    Jenkins
Reviewed by:	bojan.novkovic_fer.hr, markj
Fixes: e407849
Differential Revision:	https://reviews.freebsd.org/D43339
Refactoring of argument list to nl_send_one() led to derefercing
wrong union member.  Rename nl_send_one() to a more generic name,
isolate anew nl_send_one() as the callback only for the normal
writer and provide correct argument to nl_send() from nl_send_group().

Fixes:	ff5ad90
Work around vendors who use the same address for multiple
ReadAckRegisters in their ACPI HEST table.  This
allows apei to attach cleanly on Ampere Altra servers.
Note the issue is not specific to Ampere, I've run into
it with at least one other vendor (whose server is not
yet released).

Sponsored by: Netflix
Reviewed by: jhb
The logic analyzer in the T6 CIM block has a different capture size than
previous chips.

MFC after:	1 week
Sponsored by:	Chelsio Communications
The global hw.em.rx_process_limit knob has been replaced by the device-
specific dev.IF.N.iflib.rx_budget along with the conversion to iflib(4).

While at it, remove the - besides initialization of tx_process_limit -
unused {r,t}x_process_limit members.
markjdb and others added 26 commits March 15, 2024 12:52
If in the future it becomes useful to plumb flags through all of these layers,
we can add a field to struct vm_cheri_revoke_cookie.

No functional change intended.
- Sync the syscall names and parameters with reality.
- Use style(9) for code examples.
- Change the wording in a couple of places where I found the text a bit
  unclear.
- Remove the $FreeBSD$ tag.
Most of these flags had no effect, and they weren't documented and
weren't used by any code in the base system.  Remove them so that it's a
bit easier to decide how async caprevoke scanning will fit into the
revocation state machine.

Avoid shuffling the values to avoid binary compatibility problems.
To handle CHERI_REVOKE_IGNORE_START, we use the kernel's notion of the
current epoch rather than userspace's.  Be sure to acquire the VM map lock
first though, to avoid races which make the code harder to reason about.
I suspect that this was done to handle the possibility of td != curthread,
since otherwise the vmspace stays live by virtue of being used by the
currently executing thread.  However, in practice, td == curthread, and
some parts of the revocation scan assume that it is operating on the current
vmspace, so acquiring a reference is not sufficient.

Thus, simply stop acquiring a reference.  This gives us less work to do in
cleanup paths.
As archaic as it may seem, I'm quite used to being able to find a function
definition with a search for "^free".
quarantine_flush() both invokes cheri_revoke() and pushes formerly
quarantined memory into the backing allocator.  With an asynchronous
revocation scan, we need to defer the latter operation, so split it out into
a separate function.

Use this opportunity to deindent the inner loop and reduce ifdefery by
inverting a test.

No functional change intended.
The new probes give us an easy way to compare time spent in cheri_revoke()
with time spent flushing quarantined memory.
mips is gone.  No functional change intended.
We should never be operating on a system map, so assert that every time
kern_cheri_revoke() is called.
No functional change intended.
While the intention behind it is good, it is not currently needed and
complicates control flow in the implementation of asynchronous revocation
passes in kern_cheri_revoke(), which is already too complicated.  Let's
remove it for now.

No functional change intended.
This helps keep caprevoke state machine logic internal to
kern_cheri_revoke.c.

No functional change intended.
Previously, cheri_revoke() was always synchronous.  Because it scans the
entire mapped address space of the process, it can take a very long time to
complete.  This work is foisted onto an unsuspecting thread which happens to
call into the allocator at the wrong time, so can result in undesirable
stalls.  Thus it becomes desirable to support asynchronous revocation.

Asynchronous revocation can be implemented in userspace, but this has some
complications:
- Creating threads from libc is somewhat tricky.
- We may wish to support multiple processes sharing an address space, in
  which case it's easier to keep track of async revocation state in the
  kernel.
- Before the vmspace scan, we have to stop the process anyway, and it
  doesn't make a lot of sense to offload that to a separate thread in a
  single-threaded process.

Asynchonous revocation in the kernel has the downside of not handling the
work of flushing previously quarantined arenas and updating the bitmap.
This work can be substantial.  However, we can easily rate-limit it, and
it's still much less expensive than scanning in general.

One other downside of this approach is increased lock contention, since the
caprevoke scan holds the VM map lock for long periods.  Though, this is
otherwise already a problem in multi-threaded processes.

This implementation adds a new cheri_revoke(2) flag, CHERI_REVOKE_ASYNC,
which is mutually exclusive with CHERI_REVOKE_LAST_PASS.  When set, and no
epoch is open, it behaves like cheri_revoke() with no flags set up until it
comes time to scan the vmspace.  Then, a request is submitted to a kernel
worker process, which calls vm_cheri_revoke_pass() on the submitter's
behalf.  A subsequent call to cheri_revoke(2) with CHERI_REVOKE_ASYNC set
will check the status of the scan and update the epoch accordingly if the
scan was successful.  If it was not successful, a new request is submitted
in the hopes that the failure was transient.

Some new fields are added to the VM map:
- vm_cheri_async_revoke_st maintains the current asynchronous revocation
  state.  It borrows the states from cheri_revoke_state_t; comments in
  revoke.h explain the state machine.
- vm_cheri_async_revoke_status stores the return value of
  vm_cheri_revoke_pass().  Its value is valid only when the async state is
  CHERI_REVOKE_ST_CLOSING.
- vm_cheri_async_revoke_shadow stores a capability for the userspace shadow
  bitmap, since the worker process cannot easily derive this capability.
All are synchronized by the VM map lock.
Add a new environment variable, _RUNTIME_REVOCATION_ASYNC_REVOKE, which,
when set, enables the use of asynchronous revocation scans in the kernel.
By default it is not set.

When asynchronous revocation is configured, MRS maintains at least two
application quarantine arenas, plus a list of available arenas (i.e., those
whose contents have been revoked and drained) and a list of arenas currently
undergoing revocation.  When the active application arena triggers
revocation, it is swapped with a free arena and
cheri_revoke(CHERI_REVOKE_ASYNC) is called to begin revocation.  Subsequent
calls poll the current revocation state and conditionally drain arenas as
they become eligible.

malloc_revoke() is implemented synchronously - it still needs to know
whether MRS is running in async mode or not in order to figure out which
arenas to revoke.

This scheme is simple but has some caveats which need thought:
- There is currently no backpressure, i.e., an application can in principle
  free memory faster than the revoker can scan, and thus it's possible for
  the application to accumulate an unbounded amount of quarantined memory.
  We should consider the conditions under which we might want to fall back
  to a blocking cheri_revoke() system call.
- Application threads are still responsible for flushing previously
  quarantined memory.  This can take a long time.  We could perhaps divide
  this into smaller amounts of work.
- snmalloc integration isn't implemented yet, as I don't know how it works.
  In particular, I'm not sure when to call snmalloc_flush_message_queue().
- Once all of the applications arenas "fill up", threads will acquire
  the app quarantine mutex upon each call to malloc() etc., which of
  course would scale poorly.  We can make this condition harder to reach
  by providing more arenas, effectively increasing the amount of memory
  that we're allowed to keep in quarantine.  Since mrs_free() already
  has to acquire the mutex upon each call, this problem is not the
  highest priority.
Following the example of other VM subsystems, put some counters under vm.stats
so that it's easy to tell at a glance whether, for example, vm_fault() has
returned errors.  We'll likely add more counters over time.  This is pretty
rudimentary but doesn't add much overhead and can be helpful when digging into
sources of anomalous behaviour.
The concern which motivated this check is the possibility of a map entry
upgrading its protections to include READ_CAP, and thus getting missed
by the revoker.  However, the revocation scan only skips entries with
READ_CAP in the max protections.

In practice, vm_map_protect() is still blocked during a scan because it
waits for the map to be unbusied.
Akin to other revocation policy sysctls, this one sets a flag which is
used by MRS to decide whether to use asynchronous revocation.  In
non-setuid/gid processes, the _RUNTIME_REVOCATION_SYNC_REVOKE and
_RUNTIME_REVOCATION_ASYNC_REVOKE environment variables can be used to
override the global policy.
It is unnecessary, and allows us to avoid having to revoke capabilities
in the would-be-mapped pages.
This catches a bug that arose in an early implementation of the
asynchronous revocation mode.
@bsdjhb bsdjhb requested review from brooksdavis and gvnn3 March 20, 2024 17:26
Copy link
Member

@brooksdavis brooksdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable point to merge

@bsdjhb bsdjhb merged commit 9322a79 into CTSRD-CHERI:main Mar 20, 2024
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.