nbohr1more Posted November 27, 2017 Report Posted November 27, 2017 Crash is resolved No performance benefits noticed in any of the pathological scenes though. I'm probably less draw batch bound than just plain CPU bound on the skinning calcs? I didn't test with shadow maps yet but I don't expect the story to change there from my previous observations. Edit: r_shadows 2 (shadow maps) == crash: idVertexCache::Position: bad vertexCache_t 0 Spoke too soon. Penny Dreadful 2 is now crashing with the same vertex cache position error. Quote Please visit TDM's IndieDB site and help promote the mod: http://www.indiedb.com/mods/the-dark-mod (Yeah, shameless promotion... but traffic is traffic folks...)
duzenko Posted November 27, 2017 Report Posted November 27, 2017 Spoke too soon. Penny Dreadful 2 is now crashing with the same vertex cache position error.Moved this to a non-default cvar, will revisit laterThanks for the testing. Quote
cabalistic Posted November 27, 2017 Author Report Posted November 27, 2017 Hey, no sweat. To be honest, I'm not exactly an expert on this stuff, either. I'm just reading through BFG code in parallel to TDM code and try to adapt as best as I can. Unfortunately, I didn't have too much time in the past few weeks, so the vertexcache adjustment is in limbo a bit. It's working in principle (and is a bit faster), but there's still the blocker found by nbohr1more and two other issues I haven't been able to properly track down, yet. I'm hoping to do that over the Christmas holidays, ideally. That being said, I think we should be careful of moving more stuff into the backend. Even though the frontend is the current bottleneck, we don't have that much headroom on the backend, either. So instead we should try to reduce load in the frontend either by makings its operations simpler or by adding additional parallelisation. BFG does both, and I'm hoping that with the vertexcache adjustment the extra parallelisation should become simpler. If you are not in a hurry, I would suggest to wait a bit until I've had time to finalise the vertexcache. Any quick fix in the meantime is going to make merging and adjusting it more difficult. How are things with BFG-style vertex cache anyway? Been quiet on that front for a while. I have no idea how to approach it with this linked list that controls it.However we all know that current version is cpu-limited (read frontend limited). I'm just doing what I can to lift the weight off frontend. We can certainly not control video driver when it comes to vertex uploads so our only options are map entire VB to CPU address space at frame start or move vertex uploads to backend. I'm merely doing the one I know how to do. I wish I knew how to do mapping the right way but I don't. I can't see any fps gain so far but it paves way to worker jobs for e.g. tangent derive, etc. (Yes, I know we'd be better off with GPU skinning, but again I don't know how to approach that ATM). EDITThere is of course an option of moving vertex uploads to yet another thread along with derive tangents, etc.I kinda like that even more because it's easier to control a single thread than N workers.Back to benchmark I can see CPU mostly sitting in shadow calc (the open space in Rightful Property). 1 Quote
duzenko Posted November 29, 2017 Report Posted November 29, 2017 I added some queuing in vertex cache _temporarily_ until you sort out the BFG style issues. This is for me to analyze frontend bottlenecks. Quote
cabalistic Posted December 9, 2017 Author Report Posted December 9, 2017 @nbohr1more: I think I finally tracked down the decal issue. The decals used memory for vertices and indexes that weren't 16-byte aligned. However, the BFG-style vertex caches expect the data to be passed in 16-byte aligned pointers, so that led to a crash. There is one other issue I'm currently trying to track down. Occasionally, the game hangs and begins to allocate insane amounts of memory until it finally cannot allocate any more memory and crashes. I don't know what's causing it yet, but it does not appear to happen on trunk, so it's probably related to my changes in some way. In any case, perhaps you could give the updated version another spin and verify the decal issue is gone for you, too?https://github.com/fholger/thedarkmod/tree/vertexcache Quote
duzenko Posted December 9, 2017 Report Posted December 9, 2017 @nbohr1more: I think I finally tracked down the decal issue. The decals used memory for vertices and indexes that weren't 16-byte aligned. However, the BFG-style vertex caches expect the data to be passed in 16-byte aligned pointers, so that led to a crash. There is one other issue I'm currently trying to track down. Occasionally, the game hangs and begins to allocate insane amounts of memory until it finally cannot allocate any more memory and crashes. I don't know what's causing it yet, but it does not appear to happen on trunk, so it's probably related to my changes in some way. In any case, perhaps you could give the updated version another spin and verify the decal issue is gone for you, too?https://github.com/fholger/thedarkmod/tree/vertexcache@stgatilov removed idheap in trunk recently so try to merge that and see if it helps with memory allocation Quote
cabalistic Posted December 9, 2017 Author Report Posted December 9, 2017 (edited) I already rebased on trunk. The issue I described above is a curious infinite loop when trying to bind an image that's not loaded. It's trying to load a compressed version of that image, during which it eventually calls bind again, which leads into the endless loop. Each call allocates some memory, until eventually there's none left. I'm currently trying to figure out why this only appears to happen on my branch and what's the right way to break it. Edited December 9, 2017 by cabalistic Quote
cabalistic Posted December 9, 2017 Author Report Posted December 9, 2017 So, essentially the root of the problem is that it's trying to load an image on the frontend, which in turn calls glGenTexture. However, since in my branch I removed the GL context on the frontend, the call fails (silently) and leads into that endless loop. The image load itself is triggered by a (static) model load during animation processing. I'll need to understand why that happens; the BFG style assumes that all static models are generated during level load. So either I missed something during porting, or The Dark Mod does something that doesn't fit into this picture. 2 Quote
nbohr1more Posted December 10, 2017 Report Posted December 10, 2017 @nbohr1more: I think I finally tracked down the decal issue. The decals used memory for vertices and indexes that weren't 16-byte aligned. However, the BFG-style vertex caches expect the data to be passed in 16-byte aligned pointers, so that led to a crash. There is one other issue I'm currently trying to track down. Occasionally, the game hangs and begins to allocate insane amounts of memory until it finally cannot allocate any more memory and crashes. I don't know what's causing it yet, but it does not appear to happen on trunk, so it's probably related to my changes in some way. In any case, perhaps you could give the updated version another spin and verify the decal issue is gone for you, too?https://github.com/fholger/thedarkmod/tree/vertexcache Nice! I gave it a whirl... Closemouthed Shadows no longer crashes. Briarwood Manor and Rightful Property both encountered: R_StaticAlloc failed on 87528 bytes errors after some amount of play. I presume this is related to the infinite loop bug you've been discussing? When these were working I saw some nice FPS boosts. Briarwood Manor went from 35FPS to 42.Rightful Properties (with shadow mapping) went from 74 to 80. Quote Please visit TDM's IndieDB site and help promote the mod: http://www.indiedb.com/mods/the-dark-mod (Yeah, shameless promotion... but traffic is traffic folks...)
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 Yep, that's the infinite loop bug. In Briarwood Manor, what happens is essentially this: eventually, one of the guards outside decides to take a leak and enters a urinating animation. This, in turn, triggers some material to be parsed, which then tries to load an image and generate a new texture (calling glGenTextures). As GL calls are no longer supposed to be possible in the frontend, it crashes. I could reenable the GL context for the frontend, of course. However, I would like to understand why this particular image was not preloaded during level load. My rudimentary understanding of the engine is that we would normally preload all the images that we expect to need.Does someone have any deeper insights into resource loading? Is this a potential oversight, or is there a reason that some resources should not be preloaded? Perhaps in this particular instance it's just well-hidden in some obscure animation that we just don't know about at level load? Quote
duzenko Posted December 10, 2017 Report Posted December 10, 2017 Maybe add some kind of texture load queue that will be processed at some point by the backend? Surely it will cause a visual glitch, but for one frame only.We already have one-frame glitches with FBO so we can live with that.Maybe if we do the loading at the right point, we can go without glitches here. I am very excited about the context-less frontend because it paves way for more CPU threads and less driver sync overhead.On the other hand with shadow maps and no interaction tri culling I am already backend-limited in Rightful Property... Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 Yeah, I'm currently trying a simple conditional check in idImage::ActuallyLoadImage which just returns if it's accessed from the frontend thread. Image loading should then be retriggered from the backend when it actually needs the image. Since this seems to be a pretty rare case, I'm hopeful this won't have too many downsides. Getting rid of the context would indeed be great to further parallelize the frontend. Although, if we really are already approaching backend bottlenecks, I should probably compare how backend processing differs in BFG. Perhaps there are a few quick wins there, too? I'll test your scenario to see how my PC handles it 1 Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 (edited) Alright, pushed the aforementioned fix to https://github.com/fholger/thedarkmod/tree/vertexcache. It's really only that one image, far as I can tell, so this should be fine. Please give it a spin and let me know if there are any other issues I might have missed Edited December 10, 2017 by cabalistic 1 Quote
duzenko Posted December 10, 2017 Report Posted December 10, 2017 Getting rid of the context would indeed be great to further parallelize the frontend. Although, if we really are already approaching backend bottlenecks, I should probably compare how backend processing differs in BFG. Perhaps there are a few quick wins there, too? I'll test your scenario to see how my PC handles it I don't expect much difference, except d3 levels were designed for 2004 hardware and probably hand-tuned AND don't usually have forced ambient like we do.TDM maps are orders of magnitude more complex in both draw count and poly count. Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 You're probably right, but it's worth a look, anyway In Rightful Property, I also approach parity between frontend and backend. Would have to optimize boths of them to get any further gains... Not so sure if the backend is GPU limited, though. I'll have to do some more profiling. Quote
duzenko Posted December 10, 2017 Report Posted December 10, 2017 You're probably right, but it's worth a look, anyway In Rightful Property, I also approach parity between frontend and backend. Would have to optimize boths of them to get any further gains... Not so sure if the backend is GPU limited, though. I'll have to do some more profiling.Yeah, r_showsmp prints only dots but afterburner shows gpu usage around 80%.Could be driver overhead.One thing we could do with backend, is since we now have this big single vertex buffer, we might want to longer call the five vertex attribs per surface like we do now.Instead pass the calculated surface data offset in the big buffer to glDrawElements? Also, maybe cache uniforms on client, but driver might already be doing that... Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 So, Afterburner says ~50% (or less) GPU usage in Rightful Property. To be fair, though, I have a reasonably powerful GPU (GTX 1070) and fairly low settings. Profiling does not reveal any obvious optimization potential in the backend path. The significant portion of CPU time is already taken by the Nvidia GL driver, which means that we would have to optimize/reduce the number of GL calls to get any significant gain. I'm not entirely certain I understand what you are suggesting, but I think BFG does indeed do something like that, so it might be worth a try. Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 HOLY SHIT! I just found the most ridiculous bottleneck in the backend, thanks to nSight! It's glGetErrors. I'm serious. All those GL_CheckErrors() calls are incredibly costly. I just commented out the entire implementation of GL_CheckErrors, and in one of my go-to bottleneck scenes, the backend rendering dropped from taking 8 ms down to 2 ms!! as reported by r_logSmpTimings. Didn't increase the framerate, because the frontend is still blocking, but now parallelizing the frontend is actually going to be worthwhile. We should probably hide the implementation of GL_CheckErrors behind either a cvar, or a compiler flag... This is absurd. I wonder if this is an Nvidia issue? Would be interested to hear if it has a similar impact with your Intel chip. 3 Quote
duzenko Posted December 10, 2017 Report Posted December 10, 2017 HOLY SHIT! I just found the most ridiculous bottleneck in the backend, thanks to nSight! It's glGetErrors. I'm serious. All those GL_CheckErrors() calls are incredibly costly. I just commented out the entire implementation of GL_CheckErrors, and in one of my go-to bottleneck scenes, the backend rendering dropped from taking 8 ms down to 2 ms!! as reported by r_logSmpTimings. Didn't increase the framerate, because the frontend is still blocking, but now parallelizing the frontend is actually going to be worthwhile. We should probably hide the implementation of GL_CheckErrors behind either a cvar, or a compiler flag... This is absurd. I wonder if this is an Nvidia issue? Would be interested to hear if it has a similar impact with your Intel chip.We have r_ignoreglerrors, should convert it from bool to int.It kinda makes sense and goes along with the "one-way flow" concept the opengl drivers seem to follow recently. Send commands one way, and don't try to ask driver for any results unless absolutely necessary...From my experience out of the three vendors AMD does the worst job when it comes to caching commands and background execution. Intel is better but it's usually limited by hardware. nVidia has better (faster) opengl drivers than AMD and better hardware than Intel.I may have time tomorrow evening to test on Intel. I also have an AMD 460 at work but it's paired with a 4.5GHz CPU and is usually limited by fillrate (in 1440p) Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 In any case, we are now back to the frontend being the bottleneck, which is good. I did some experiments today trying to parallelize R_AddModelSurfaces (which in my profiling is the largest time consumer). However, all my attempts have led to exactly zero performance improvements. I don't know why; might be cache congestion or generally memory access. I read that one of BFG's design changes was to store less in memory and (re-)compute it when needed. I guess we have a lot of work ahead of us... Quote
duzenko Posted December 10, 2017 Report Posted December 10, 2017 Try shadow maps on and interaction tri culling off.Where are you testing this BTW? Quote
cabalistic Posted December 10, 2017 Author Report Posted December 10, 2017 Which cvar is interaction tri culling? Shadow maps don't seem to make much of a difference to the frame rate. I'm currently testing with Rightful Property, Briarwood Mansion and A New Job. Quote
duzenko Posted December 10, 2017 Report Posted December 10, 2017 The one added in the last commitr_useInteractionTriCulling The two culling functions take a lot of time for me in RP.One is shadow volume calc. Another is interaction tri culling.Both are toggleable now. Quote
Diego Posted December 11, 2017 Report Posted December 11, 2017 I recently got an oculus, now I'm super excited about this! Couldn't make it work though, moved all the files and all, nothing happened. But don't waste time helping me! I'm here just to share enthusiasm and encourage the work. Quote
cabalistic Posted December 11, 2017 Author Report Posted December 11, 2017 Thanks Diego. I'm afraid a truly playable version is still far away, but I'll do my best If you still want to try and get it work, remember you'll need SteamVR running (no native Oculus support atm) and pay attention to the readme and the entry to autoexec.cfg mentioned there. @duzenko: I tested briefly with the interaction tri culling, but for me it does not seem to make a discernable difference. It didn't show up in my profiling before, either. It's curious that there's such a difference in what requires processing time...Also, I encountered a couple of crashes that seem to be related to shadow mapping. I guess I'll need to double-check that the vertexcache changes don't break shadow mapping. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.