# The reason why Rusticl behaves so weirdly with USE_HOST_PTR ``` [18:21] == rusticluser [~oftc-webi@2803:1500:c00:eb3:c450:9864:8f21:f2fb] has joined #rusticl [18:22] Hey guys, I have questions about the implementation of clEnqueueMapBuffer/clEnqueueUnmapMemObject in Rusticl. [18:22] This webpage says I should ping karolherbst [18:22] https://docs.mesa3d.org/rusticl.html [18:23] I am finding some very odd behaviour on the Raspberry Pi 5, when using the v3d GPU via Rusticl [18:24] (Gimme a bit to write up my questions) [18:25] == pbrobinson [~pbrobinso@2001:8b0:fb11:2681:e9:f8b:31b:f797] has joined #rusticl [18:29] Here's a dump of the output from running `RUSTICL_ENABLE=v3d clinfo` on my Raspberry Pi 5: https://gist.github.com/latentPrion/9843ff5b98f21b20b9f6d5bce43006b3 [18:30] Of particular note is that it says that the V3D GPU has a unified memory architecture with the main ARM CPU complex: [18:30] > Unified memory for Host and Device Yes [18:32] Because all of my target platforms seem to have unified memory with the CL GPUs, I decided that I would aim to optimize my program by using CL_MEM_USE_HOST_PTR, and avoiding using clEnqueueRead/WriteBuffer. I have indeed got it working on both the RPi5 and on my x86 laptop, but some of the things that were required to get it working on the RPi5+Rusticl implementation are a bit odd, and I wanted to confirm whether these behaviours and apparent eccentricities are [18:32] intentional [18:34] Here is my code, for your perusal and reference. [18:34] https://gist.github.com/latentPrion/d9fb3f0604a957d2055786a118072482 [18:36] So: the long and short of it is: I have an input buffer (called "assemblyBuffer") that was filled with data by io_uring. I create an openCL buffer for assemblyBuff, using CL_MEM_USE_HOST_PTR. I then want to pass this assemblyBuffer into an OpenCL kernel. [18:37] The OpenCL kernel doesn't see the data that was written into the buffer unless I use CL_MAP_WRITE_INVALIDATE. I can understand the reasoning behind this, if the reasoning is that the cache invalidation op is performed on the GPU side. [18:38] That makes sense because the GPU's caches may hold stale data that prevent it from seeing the data I put into the HOST_PTR buffer. So the need to invalidate the GPU's caches makes perfect sense and I'm not complaining about this. [18:39] It's the next bit that is a bit confusing to me, and which I suspect is a bug in RustIcl or the MESA driver behind it. [18:40] I have a 2nd buffer, called the "collateBuffer", which is distinct from the "assemblyBuffer". I run a 2nd kernel after the first kernel, which takes the assemblyBuffer as input, and produces its output into the collationBuffer. [18:42] Now, since the 1st kernel wrote its output data into the assemblyBuffer, this should mean that the GPU's caches should be up to date with the data that was just written into the assemblyBuffer by the 1st kernel -- because it was the GPU itself which wrote that data into the assembyBuffer [18:43] Yet, for some reason, I'm still required to remap the assemblyBuffer with CL_MEM_WRITE_INVALIDATE_REGION when I want to run the 2nd kernel. [18:43] 1. I have not modified the assemblyBuffer's data at all on the host CPU. The data in the assemblyBuffer is exactly what was written into it by the 1st kernel when it was running on the GPU. [18:44] 2. The 2nd kernel doesn't write into, or modify the assemblyBuffer at all in any way. The 2nd kernel uses the assemblyBuffer as input *ONLY*. [18:44] [18:46] I guess my question is: why am I required to first map and unmap the assemblyBuffer as CL_MAP_WRITE_INVALIDATE_REGION before the GPU can see the contents of the assemblyBuffer, even though the GPU itself just wrote that data into it, and the GPU's caches should be in sync with it? [18:47] (You can see the remapping with CL_MAP_WRITE_INVALIDATE_REGION for the 2nd kernel's execution here: https://gist.github.com/latentPrion/d9fb3f0604a957d2055786a118072482#file-openclcollatingandmeshingengine-cpp-L343) [18:48] Technically, I should be able to just map it as CL_MAP_WRITE without needing to specify INVALIDATE_REGION -- am I incorrect? [18:49] Basically what you see in that pasted gist is what is required to get this to work on the RPi5, so any decisions you see in the code are constrained by either (1) Rusticl, (2) MESA drivers, (3) the RPi5 hardware [18:51] I downloaded the Mesa source code and asked Cursor to scan it and find out what's going on (I don't know Rust, so I can't read the code myself very well) and Cursor says that there's an interediate layer of "shadow buffering" implemented by Rusticl between the host and GPU [18:52] And that this intermediate shadow buffering layer is the source of the unexpected behaviours [19:11] rusticluser: launching kernels on mapped buffers is undefined behavior [19:14] though not sure if that's what you run into, just sounded like it [19:17] karolherbst: Yea, but I don't keep them mapped -- notice that I map and then immediately unmap [19:18] Literally: mapBuffer(); unmapBuffer() back to back lol -- good pointer though [19:18] I'm a bit confused by the code, how do you verify that the GPU is or isn't reading the correct data? [19:18] or do you access it through the host pointer directly? [19:19] karolherbst: I check using printf() (OpenCL 1.2 extension) inside of the running kernel, and also I check the resulting output after the kernel has been executed [19:19] ahh [19:19] Would you like to see the kernels? They're just clutter for your headspace, but maybe they might give you some kind of information I don't know about [19:19] USE_HOST_PTR is a bit weird, because it doesn't guarnatee coherency [19:20] Yea -- I can understand that: the real thing that a developer who's using USE_HOST_PTR wants from the underlying implementation is something like this workflow: [19:22] (1) clEnqueueMapBuffer(CL_MAP_WRITE) => /* (2) I write stuff into the buffer */ => (3) clEnqueueUnmapMemObject() /* At this point, during the unmap operation, the CL implementation is expected to write-back the host CPU's caches to main memory, and then invalidate the GPU's caches so that the GPU can see the writes that were stored to main memory [19:23] And for the read-side, the workflow that the developer intuitively expects is: [19:25] (1) clEnqueueMapBuffer(CL_MAP_READ) /* This mapping call should cause the GPU to write-back to main memory, and should cause the host CPU to invalidate its caches so it can see what was written by the GPU */ => (2) /* I read the stuff from the buffer */ => (3) clEnqueueUnmapMemObject() /* No special maintenance required here */ [19:26] right.. I think it's potentially also an issue with the rpi driver. It's not really well tested, so random bugs could always exist there. Might want to verify that your application behaves correctly on other GPUs [19:27] Yea -- I only have this RPi5 as an ARM testbed, sadly. The other test machine I have is this shitty Intel Core I5 laptop with an Intel HD GPU. The Intel HD GPU doesn't require any mapping/unmapping of any kind -- the cache coherency domain seems to fully cover the GPU on the Intel laptop [19:28] Idk, maybe it's a bug, maybe it's not -- I guess I was checking to see if the behaviour I was seeing was intentional and I just didn't properly understand the memory/execution model of OpenCL; or whether it's actually a bug somewhere in the underlying implementation's stack [19:28] the intel is the only driver that ever added support for actually mapping host memory into the GPU when it's not page aligned [19:29] Ah -- my HOST_PTRs are aligned to _SC_PAGE_SIZE [19:29] I don't think the rpi driver supports mapping host memory at all [19:29] :( [19:29] yeah... [19:30] not sure if it's because of missing kernel interfaces or what's the reason there [19:30] How can I check and see? I have no understanding of GFX drivers and I hear they're a real domain-specific kind of mess to read; [19:30] At minimum, which "module" in the mesa code provides the RPi5 opencl driver/support? [19:31] well I know that it doesn't support it on the mesa side, but I haven't checked if there is in theory a kernel interface for it or not [19:31] `src/gallium/drivers/v3d/` is the drive inside mesa [19:31] *nod*, thanks [19:31] Is this worth filing a ticket/issue for? [19:32] _not_ sure. Maybe if there is a strong interest to also implement the GL/vulkan features allowing for mapping host memory [19:33] Alright -- I'll just keep my eye on it and if it becomes an unmanageable problem, I'll file a ticket and probably also try to add the support myself [19:34] These new LLMs really enable you to extend yourself into new domains and contribute to stuff you otherwise wouldn't have the time/insight to be able to, so if it really becomes unmanageable, I'll probably be able to just fix it and submit a patch [19:34] though it should still work in theory, so not really sure what's going wrong there [19:34] It should lol -- the purpose of the clEnqueueMap/Unmap calls isn't to actually "map" anything -- it's purely to manage the cache synchronization between the host CPU complex and the GPU [19:35] but I'd verify if your application behaves as expected on other hardware/drivers as well, maybe even on discrete GPUs [19:35] AFAICT, it's probably just a bug in the cache management [19:35] It definitely won't work on a GPU that doesn't have shared memory because the design is explicitly for USE_HOST_PTR [19:36] then it's broken also for shared memory systems [19:36] Hmmm -- could you elaborate on that? [19:37] USE_HOST_PTR doesn't really allow for different use csaes as it doesn't really gurantee anything except that the pointer returned by mapBuffer matches the host pointer [19:37] aand that's all the additional guarantee it gives you [19:38] you still have to use it as if it wouldn't be a host ptr allocation, because synchronization points are the same as with non host ptr allocations [19:38] Yes, indeed: but it's also explicitly different from CL_MEM_ALLOC_HOST_PTR, I think? The difference is that CL_MEM_ALLOC_HOST_PTR is likely to be mapping in device MMIO registers [19:38] alloc host ptr just means that the allocation is done in host memory instead of VRAM [19:38] maybe [19:38] it's just a hint [19:39] like it uses GART infrastructure and the GPU just accesses memory over PCIe (if a discrete GPU) [19:39] for unified memory GPU it shouldn't make any difference [19:39] I'm sorry -- am I wrong? CL_MEM_ALLOC_HOST_PTR means only that the buffer returned will be *ACCESSIBLE* by the host. This means that the buffer could be MMIO mapped registers, or some other such memory range [19:40] It doesn't actually mean that the buffer is allocated from host mem [19:40] It just means that the buffer will be *ACCESSIBLE* from host mem, __POTENTIALLY__ without a copy [19:40] it has nothing to do with access [19:41] https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/clCreateBuffer.html: [19:41] sure, but it means something else [19:41] > This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory. CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR are mutually exclusive. [19:41] Ah ok lol [19:41] like you can't access the memory allocation either way directly, because you have to map [19:42] though CL_MEM_ALLOC_HOST_PTR is more of a "please don't use VRAM, so that reading out the memory on the host is quick" [19:42] It seems like the reason why they say that ALLOC_HOST_PTR and USE_HOST_PTR are mutually exclusive is *precisely because* ALLOC_HOST_PTR is not guaranteed to be allocated within host memory lol [19:42] well.. you have no control over what address the mapping will have [19:43] USE_HOST_PTR already uses host memory, so alloc_host_ptr is meaningless [19:43] I am fairly certain that MEM_ALLOC_HOST_PTR means, "You may use VRAM if you wish, but ensure that it's a portion of your internal VRAM that can be exposed and mapped as MMIO. You may also use host RAM if you wish -- both are fine" [19:43] [19:43] USE_HOST_PTR already uses host memory, so alloc_host_ptr is meaningless [19:43] ^ Absolutely correct [19:43] Wait whoa no [19:44] VRAM can always be mapped into host memory, it's just slow [19:44] and you have to fight with PCI bar sizes [19:44] though you can also set different caching hints etc.. [19:45] When I say "VRAM" here, I was mimicking your language, but a more accurate term would be "device memory" because there's no guarantee that the OpenCL device is indeed a GPU, or that it exposes all of its global, local or private memory in an MMIO or host-accessible fashion lol [19:45] Ok errm, I don't think arguing over this will go very far lol [19:46] But I really appreciate your pointers -- I'll look for another test board [19:46] Really appreciate your time -- I know this is a volunteer effort on your part [19:47] Heheh. Pointers. [19:47] == rusticluser [~oftc-webi@2803:1500:c00:eb3:c450:9864:8f21:f2fb] [19:47] == realname : OFTC WebIRC Client [19:47] == channels : #rusticl [19:47] == server : weber.oftc.net [Newark, NJ, USA] [19:47] == realhost : [ip: actually using host] [19:47] == idle : 0 days 0 hours 1 minutes 20 seconds [connected: Wed Nov 12 18:21:37 2025] [19:47] == End of WHOIS [19:49] yeah anyway.. on the rpi5 driver might as well not use use_host_ptr because rusticl will have to copy things around to fake host_ptr support anyway. So might as well then not use it. But I also wanted to implement more optimized map/unmap paths for single device context with unified memory, because atm it's asuming worst case and isn't really [19:49] optimized very well anyway [19:49] but those optimizations will also paper over correctness issues [19:51] though I'm also not convinced that the emulation code is 100% correct... [19:52] there _might_ be a bug if the mapping has different accesses, but I never found anything that ran into issues here [19:54] you could run with `RUSTICL_DEBUG=memory` and see if the prints make any sense. It should tell when the memory content is migrated and moved around [20:00] karolherbst: Ah that's awesome info, thanks [20:01] It would be really useful to have an explicit confirmation of whether I'm actually getting zero-copy ```