The update descriptor is used to store in flat memory a large chunk of
staging data used to update descriptor sets through templates. It
provides a push interface to easily insert descriptors following the
current pipeline. The order used in the descriptor update template has
to be implicitly followed. We can catch bugs here using validation
layers.
The stream buffer before this commit once it was full (no more bytes to
write before looping) waiting for all previous operations to finish.
This was a temporary solution and had a noticeable performance penalty
in performance (from what a profiler showed).
To avoid this mark with fences usages of the stream buffer and once it
loops wait for them to be signaled. On average this will never wait.
Each fence knows where its usage finishes, resulting in a non-paged
stream buffer.
On the other side, the buffer cache is reimplemented using the generic
buffer cache. It makes use of the staging buffer pool and the new
stream buffer.
* Allocate memory in discrete exponentially increasing chunks until the
128 MiB threshold. Allocations larger thant that increase linearly by
256 MiB (depending on the required size). This allows to use small
allocations for small resources.
* Move memory maps to a RAII abstraction. To optimize for debugging
tools (like RenderDoc) users will map/unmap on usage. If this ever
becomes a noticeable overhead (from my profiling it doesn't) we can
transparently move to persistent memory maps without harming the API,
getting optimal performance for both gameplay and debugging.
* Improve messages on exceptional situations.
* Fix typos "requeriments" -> "requirements".
* Small style changes.
MakeCurrent is a costly (according to Nsight's profiler it takes a tenth
of a millisecond to complete), and we don't have a reason to call it
because:
- Qt no longer signals a warning if it's not called
- yuzu no longer supports macOS
Create a large descriptor pool where we allocate all our descriptors
from. It has to be wide enough to support any pipeline, hence its large
numbers.
If the descritor pool is filled, we allocate more memory at that moment.
This way we can take advantage of permissive drivers like Nvidia's that
allocate more descriptors than what the spec requires.
This commit introduces a mechanism by which shader IR code can be
amended and extended. This useful for track algorithms where certain
information can derived from before the track such as indexes to array
samplers.
This function is called rarely and blocks quite often for a long time.
So don't waste power and let the CPU sleep.
This might also increase the performance as the other cores might be allowed to clock higher.
The job of this abstraction is to provide staging buffers for temporary
operations. Think of image uploads or buffer uploads to device memory.
It automatically deletes unused buffers.
This object's job is to contain an image and manage its transitions.
Since Nvidia hardware doesn't know what a transition is but Vulkan
requires them anyway, we have to state track image subresources
individually.
To avoid the overhead of tracking each subresource in images with many
subresources (think of cubemap arrays with several mipmaps), this commit
tracks when subresources have diverged. As long as this doesn't happen
we can check the state of the first subresource (that will be shared
with all subresources) and update accordingly.
Image transitions are deferred to the scheduler command buffer.
The intention behind this hasheable structure is to describe the state
of fixed function pipeline state that gets compiled to a single graphics
pipeline state object. This is all dynamic state in OpenGL but Vulkan
wants it in an immutable state, even if hardware can edit it freely.
In this commit the structure is defined in an optimized state (it uses
booleans, has paddings and many data entries that can be packed to
single integers). This is intentional as an initial implementation that
is easier to debug, implement and review. It will be optimized in later
stages, or it might change if Vulkan gets more dynamic states.
This commit adds a series of HLE methods for handling 3D textures in
general. This helps games that generate 3D textures on every frame and
may reduce loading times for certain games.
Remove false commentary. Not dividing by 4 the size of shared memory is
not a hack; it describes the number of integers, not bytes.
While we are at it sort the generated code to put preprocessor lines on
the top.
ExprCondCode visit implements the generic Visit. Use this instead of
that one.
As an intended side effect this fixes unwritten memory usages in cases
when a negation of a condition code is used.
This allows us to put VKFenceWatch inside a std::vector without storing
it in heap. On move we have to signal the fences where the new protected
resource is, adding some overhead.
VK_NV_device_diagnostic_checkpoints allows us to push data to a Vulkan
queue and then query it even after a device loss. This allows us to push
the current pipeline object and see what was the call that killed the
device.
Some games write from fragment shaders to an unexistant framebuffer
attachment or they don't write to one when it exists in the framebuffer.
Fix this by skipping writes or adding zeroes.
RASTERIZE_ENABLE is the opposite of GL_RASTERIZER_DISCARD. Implement it
naturally using this.
NVN games expect rasterize to be enabled by default, reflect that in our
initial GPU state.
LDG can load single bytes instead of full integers or packs of integers.
These have the advantage of loading bytes that are not aligned to 4
bytes.
To emulate these this commit gets the byte being referenced (by doing
"address & 3" and then using that to extract the byte from the loaded
integer:
result = bitfieldExtract(loaded_integer, (address % 4) * 8, 8)
I2F's byte selector is used to choose what bytes to convert to float.
e.g. if the input is 0xaabbccdd and the selector is ".B3" it will
convert 0xaa. The default (when it's not shown in nvdisasm) is ".B0", in
that example the default would convert 0xdd to float.
When a image format mismatches we were inserting zeroes to the texture
itself. This was not handling cases were the mismatch uses less
coordinates than the guest shader code. Address that by resizing the
vector.
These shaders are used to specify code that is not dynamically generated
in the Vulkan backend. Instead of packing it inside the build system,
it's manually built and copied to the C++ file to avoid adding
unnecessary build time dependencies.
quad_array should be dropped in the future since it can be emulated with
a memory pool generated from the CPU.
Add an extra argument to query device capabilities in the future. The
intention behind this is to use native quads, quad strips, line loops
and polygons if these are released for Vulkan.
The OpenGL spec defines GL_CLAMP's formula similarly to CLAMP_TO_EDGE
and CLAMP_TO_BORDER depending on the filter mode used. It doesn't
exactly behave like this, but it's the closest we can get with what
Vulkan offers without emulating it by injecting shader code.
Introduce a worker thread approach for delegating Vulkan work derived
from dxvk's approach. https://github.com/doitsujin/dxvk
Now that the scheduler is what handles all Vulkan work related to
command streaming, store state tracking in itself. This way we can know
when to reupload Vulkan dynamic state to the queue (since this one is
invalidated between command buffers unlike NVN). We can also store the
renderpass state and graphics pipeline bound to avoid redundant binds
and renderpass begins/ends.
* Kernel: Correct behavior of Address Arbiter threads.
This corrects arbitration threads to behave just like in Horizon OS.
They are added into a container and released according to what priority
they had when added. Horizon OS does not reorder them if their priority
changes.
* Kernel: Address Feedback.
Previously we naively checked for "Intel" in GL_VENDOR, but this
includes both Intel's proprietary driver and the mesa driver. Re-enable
compute shaders for mesa.
Add missing new-line. This caused shaders using local memory and shared
memory to inject a preprocessor GLSL line after an expression (resulting
in invalid code).
It looked like this:
shared uint smem[8];#define LOCAL_MEMORY_SIZE 16
It should look like this (addressed by this commit):
shared uint smem[8];
\#define LOCAL_MEMORY_SIZE 16
Update Sirit and its usage in vk_shader_decompiler. Highlights:
- Implement tessellation shaders
- Implement geometry shaders
- Implement some missing features
- Use native half float instructions when available.
- Setup more features and requirements.
- Improve logging for missing features.
- Collect telemetry parameters.
- Add queries for more image formats.
- Query push constants limits.
- Optionally enable some extensions.
Over the course of the changes to the kernel code, a few includes are no
longer necessary, particularly with the change over to std::shared_ptr
from Boost's intrusive_ptr.
These are fairly trivial to implement, we can just do nothing. This also
provides a spot for us to potentially dump out any relevant info in the
future (e.g. for debugging purposes with homebrew, etc).
While we're at it, we can also correct the names of both of these
supervisor calls.
This commit corrects an error in which a Core could remain with an
exclusive state after running, leaving space for possible race
conditions between changing cores.
Some texture views were being created out of bounds (with more layers or
mipmaps than what the original texture has). This is because of a
miscalculation in mipmap bounding. end_layer and end_mipmap are out of
bounds (e.g. layer 6 in a cubemap), there's no need to add one more
there.
Fixes OpenGL errors and Vulkan crashes on Splatoon 2.
Pack color attachment enumerations into a single u32. To determine the
number of buffers, the highest color attachment with a shared pointer
that doesn't point to null is used.
Now that literally every other API function is converted over to the
Memory class, we can just move the file-local page table into the Memory
implementation class, finally getting rid of global state within the
memory code.
Now that everything else is migrated over, this is essentially just code
relocation and conversion of a global accessor to the class member
variable.
All that remains is to migrate over the page table.
The Write functions are used slightly less than the Read functions,
which make these a bit nicer to move over.
The only adjustments we really need to make here are to Dynarmic's
exclusive monitor instance. We need to keep a reference to the currently
active memory instance to perform exclusive read/write operations.
With all of the trivial parts of the memory interface moved over, we can
get right into moving over the bits that are used.
Note that this does require the use of GetInstance from the global
system instance to be used within hle_ipc.cpp and the gdbstub. This is
fine for the time being, as they both already rely on the global system
instance in other functions. These will be removed in a change directed
at both of these respectively.
For now, it's sufficient, as it still accomplishes the goal of
de-globalizing the memory code.
Amends a few interfaces to be able to handle the migration over to the
new Memory class by passing the class by reference as a function
parameter where necessary.
Notably, within the filesystem services, this eliminates two ReadBlock()
calls by using the helper functions of HLERequestContext to do that for
us.
These will eventually be migrated into the main Memory class, but for
now, we put them in an anonymous namespace, so that the other functions
that use them, can be migrated over separately.
A fairly straightforward migration. These member functions can just be
mostly moved verbatim with minor changes. We already have the necessary
plumbing in places that they're used.
IsKernelVirtualAddress() can remain a non-member function, since it
doesn't rely on class state in any form.
Migrates all of the direct mapping facilities over to the new memory
class. In the process, this also obsoletes the need for memory_setup.h,
so we can remove it entirely from the project.
Currently, the main memory management code is one of the remaining
places where we have global state. The next series of changes will aim
to rectify this.
This change simply introduces the main skeleton of the class that will
contain all the necessary state.
* core_timing: Use better reference tracking for EventType.
- Moves ownership of the event to the caller, ensuring we don't fire events for destroyed objects.
- Removes need for unique names - we won't be using this for save states anyways.
The heuristic to detect AMD's driver was not working properly since it
also included Intel. Instead of using heuristics to detect it, compare
the GL_VENDOR string.
We relies on UNREACHABLE's noreturn attribute to eliminate parent's "no return value" warning. However, this was wrapped in a `if(!false)` block, which compilers may not unfold to recognize the noreturn nature.
SSBOs and other resources are limited per pipeline on Intel and AMD.
Heuristically reserve resources per stage having in mind the reported
OpenGL limits.
The current shared memory size seems to be smaller than what the game
actually uses. This makes Nvidia's driver consistently blow up; in the
case of FE3H it made it explode on Qt's SwapBuffers while SDL2 worked
just fine. For now keep this hack since it's still progress over the
previous hardcoded shared memory size.
Drop the usage of ARB_compute_variable_group_size and specialize compute
shaders instead. This permits compute to run on AMD and Intel
proprietary drivers.
Some games like "Fire Emblem: Three Houses" bind 2D textures to offsets
used by instructions of 1D textures. To handle the discrepancy this
commit uses the the texture type from the binding and modifies the
emitted code IR to build a valid backend expression.
E.g.: Bound texture is 2D and instruction is 1D, the emitted IR samples
a 2D texture in the coordinate ivec2(X, 0).
This commit ensures cond var threads act exactly as they do in the real
console. The original implementation uses an RBTree and the behavior of
cond var threads is that at the same priority level they act like a
FIFO.
This commit aims to redo the full setup of invalid textures and
guarantee correct behavior across backends in the case of finding one by
using black dummy textures that match the target of the expected
texture.
While DEPBAR is stubbed it doesn't change anything from our end. Shading
languages handle what this instruction does implicitly. We are not
getting anything out fo this log except noise.
Nvidia has sane default output values for varyings, but the other
vendors don't apply these. To properly emulate this we would have to
analyze the shader header. For the time being, apply the same default
Nvidia applies so we get the same behaviour on non-Nvidia drivers.
This commit corrects the behavior of cancel synchronization when the
thread is running/ready and ensures the next wait is cancelled as it's
suppose to.
format_lookup_table: Drop bitfields
format_lookup_table: Use std::array for definition table
format_lookup_table: Include <limits> instead of <numeric>
Use a large flat array to look up texture formats. This allows us to
properly implement formats with different component types. It should
also be faster.
Abstracted ComponentType was not being used in a meaningful way.
This commit drops its usage.
There is one place where it was being used to test compatibility between
two cached surfaces, but this one is implied in the pixel format.
Removing the component type test doesn't change the behaviour.
Maintains implementation parity between QueryApplicationPlayStatistics
and QueryApplicationPlayStatisticsByUid.
These function the same behaviorally underneath the hood, with the only
difference being that one allows specifying a UID.
This properly handles unicode-based paths on Windows, while opening a
raw stream doesn't out-of-the-box.
Prevents file creation from potentially failing on Windows PCs that make
use of unicode characters in their save paths (e.g. writing to a user's
AppData folder, where the user has a name with non-ASCII characters).
Since the introduction of this library, numerous improvements have been
made. Notably, many of the warnings we would get by simply including the
library header have now been fixed. This makes it much easier to make
conversion warning an error.
Uncovered a bug within Thread's SetCoreAndAffinityMask() where an
unsigned variable (ideal_core) was being compared against "< 0", which
would always be a false condition.
We can also get rid of an unused function (GetNextProcessorId) which contained a sign
mismatch warning.
Quite frequently there have been cases where code has been merged into
the core that produces warning. In order to prevent this from occurring,
we can make the compiler flag these cases and allow our CI to flag down
any code that would generate these warnings.
This is beneficial given silent conversions from signed/unsigned can
result in logic bugs. This forces one writing changes to be explicit
about when signedness conversions are desirable, rather than leaving it
up to readers' interpretation.
Currently the codebase isn't in a state where it will build successfully
with this change applied, but this will be addressed in subsequent
follow-up changes. This set of changes will focus on making it build
properly with these changes for MSVC as a starting point for basic
coverage.
`boost::make_iterator_range` is available when `boost/range/iterator_range.hpp` is included.
Also include `boost/icl/interval_map.hpp` and `boost/icl/interval_set.hpp`.
Avoids potential allocations due to the usage of std::string on strings
that we know at compile time. Most of these might fit in SSO, but it
adds complexity that can be easily avoided with string views.
Emulates negative y viewports with ARB_clip_control. This allows us to
more easily emulated pipelines with tessellation and/or geometry shader
stages. It also avoids corrupting games with transform feedbacks and
negative viewports (gl_Position.y was being modified).
Update src/video_core/shader/control_flow.cpp
Co-Authored-By: Mat M. <mathew1800@gmail.com>
Update src/video_core/shader/control_flow.cpp
Co-Authored-By: Mat M. <mathew1800@gmail.com>
Update src/video_core/shader/control_flow.cpp
Co-Authored-By: Mat M. <mathew1800@gmail.com>
Update src/video_core/shader/control_flow.cpp
Co-Authored-By: Mat M. <mathew1800@gmail.com>
Update src/video_core/shader/control_flow.cpp
Co-Authored-By: Mat M. <mathew1800@gmail.com>
Update src/video_core/shader/control_flow.cpp
Co-Authored-By: Mat M. <mathew1800@gmail.com>
- This does not actually seem to exist in the real kernel - games reset these automatically.
# Conflicts:
# src/core/hle/service/am/applets/applets.cpp
# src/core/hle/service/filesystem/fsp_srv.cpp
Nvidia's OpenGL driver maps gl(Named)BufferSubData with some requirements
to a fast. This path has an extra memcpy but updates the buffer without
orphaning or waiting for previous calls. It can be seen as a better
model for "push constants" that can upload a whole UBO instead of 256
bytes.
This path has some requirements established here:
http://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdf#page=24
Instead of using the stream buffer, this commits moves constant buffers
uploads to calls of glNamedBufferSubData and from my testing it brings a
performance improvement. This is disabled when the vendor is not Nvidia
since it brings performance regressions.
Originally on the last commit I thought TLD4 acted the same as TLD4S and
didn't have a mask. It actually does have a component mask. This commit
corrects that.
This commit fixes an issue where not all 4 results of tld4 were being
written, the color component was defaulted to red, among other things.
It also implements the bindless variant.
Bindless textures were using u64 to pack the buffer and offset from
where they come from. Drop this in favor of separated entries in the
struct.
Remove the usage of std::set in favor of std::list (it's not std::vector
to avoid reference invalidations) for samplers and images.
Greatly shrinks the amount of generated code for GetDecodeTable().
Collapses an assembly output of 9000+ lines down to ~3621 with Clang,
and 6513 down to ~2616 with GCC, given it's now allowed to construct all
the entries as a sequence of constant data.
Given the overall size of the maps are very small, we can use arrays of
pairs here instead of always heap allocating a new map every time the
functions are called. Given the small size of the maps, the difference
in container lookups are negligible, especially given the entries are
already sorted.
After further hardware investigation, it appears that some games, perhaps those more lazily coded, will not call EnsureSaveData, meaning that they expect the normal (current) save to be automatically made. Additionally, some games do not create a cache or temporary save before use.
In these 3 specific instances, the save is created automatically for the game if it doesn't exist.
TLD4S always outputs 4 values, the previous code checked a component
mask and omitted those values that weren't part of it. This commit
corrects that and makes sure all 4 values are set.
Ignore global memory operations instead of invoking undefined behaviour
when constant buffer tracking fails and we are blasting through asserts,
ignore the operation.
In the case of LDG this means filling the destination registers with
zeroes; for STG this means ignore the instruction as a whole.
The default behaviour is still to abort execution on failure.
The returned string is simply a substring of our constexpr tabs
string_view, so we can just use a string_view here as well, since the
original string_view is guaranteed to always exist.
Now the function is fully non-allocating.
While not an issue, it does prevent fallthrough from occurring if
anything is ever added after this case (unlikely to occur, but this
turns a trivial "should not cause issues" into a definite "won't cause
issues).
While a map is an OK way to do lookups (and usually recommended in most
cases), this is a map that lives for the entire duration of the program
and only deallocates its contents when the program terminates.
Given the total size of the map is quite small, we can simply use a
std::array of pairs and utilize std::find_if to perform the same
behavior without loss of performance.
This eliminates a static constructor and places the data into the
read-only segment.
While we're at it, we can also handle malformed inputs instead of
directly dereferencing the resulting iterator.
Normaly OpenGL does not care if the areas exceed the texture regions but
other backends such as Vulkan do care about the limits of this areas.
This PR crops the areas of the blit in order that they don't surpass the
limits of the textures. This should help Vulkan and faulty OpenGL
drivers
This can be trivially fixed by making the input size a size_t.
CFGRebuildState's constructor parameter is already a std::size_t, so
this just makes the size type fully conform with it.
This allows the function to be completely non-allocating for inputs of
all sizes (i.e. there's no heap cost for an input to convert to a
std::string_view).
This only ever queries if the type exists within the variant, but
doesn't actually do anything with the return value. We can just use
std::holds_alternative for this use case.
MetaImage contains a std::vector, so copying here could result in
unnecessary reallocations. Given the operation lives throughout the
entire scope, this is safe to do.
Amends the doxygen comments so that they properly resolve. While we're
at it, we can correct some typos and fix up some of the comments'
formatting in order to make them slightly nicer to read.
Makes the header more general for other potential algorithms in the
future. While we're at it, include a missing <functional> include to
satisfy the use of std::less.
In case of redundant yields, the scheduler will now idle the core for
it's timeslice, in order to avoid continuously yielding the same thing
over and over.