64-bit patch for the Dark Mod

SteveL · August 15, 2015

@motorsep: I'd try the compiler switches that NagaHuntress has already suggested. That looks almost impossible to track down any further. It's one bad result in a billion calculations that use epsilons, and it took you a while to rep it in the vid so it'd be very hard if not downright impossible to isolate the bug in the code.

NagaHuntress · August 15, 2015

@NagaHuntress: Sorry, was at work. Here is the test case: https://drive.google.com/open?id=0BwE6dxM0O2PsQUNncWw0aGFuZms

I haven't had much time to investigate this yet, but I tried that test map and managed to get stuck within seconds. Looking at with the debugger indicates that the code in idPhysics_Player::CheckGround() thinks the player is on a slope. While on a slope the player is consider to be not walking, so is moving by air move.

I still need to delve deeper into this, but I suspect that acceleration of the 'slope' is being canceled by geometry or ground friction, and acceleration for the player control (which while 'air moving' is a tenth of walking) is being cancel by ground friction. The net effect is the player becomes locked in place.

@motorsep: I'd try the compiler switches that NagaHuntress has already suggested. That looks almost impossible to track down any further. It's one bad result in a billion calculations that use epsilons, and it took you a while to rep it in the vid so it'd be very hard if not downright impossible to isolate the bug in the code.

A test case is a test case. I managed to replicate this bug while using my build with 387 FPU instructions, so it was worth trying. Plus even if I didn't replicate it then, I would have given it a go with an SSE FPU build, as fixing these single precision only floating point bugs is something that should be done eventually.

SteveL · August 15, 2015

A test case is a test case. I managed to replicate this bug while using my build with 387 FPU instructions, so it was worth trying.

Good luck, I'll be delighted if you find a fix. My pessimism was only at the thought that it could be a tuning-of-constants thing rather than a logic error that you can properly fix.

Plus even if I didn't replicate it then, I would have given it a go with an SSE FPU build, as fixing these single precision only floating point bugs is something that should be done eventually.

We'd like to make all floating point precision explicit of course, so we're not constrained on compiler setup. Any idea how we could track down all the places in the code that rely on greater intermediate precision? Or any other way round it? The resulting bugs are hard to rep reliably and tough to pin down to a single line of code. We'd not find them all by pre-release play testing.

NagaHuntress · August 15, 2015

I haven't had much time to investigate this yet, but I tried that test map and managed to get stuck within seconds. Looking at with the debugger indicates that the code in idPhysics_Player::CheckGround() thinks the player is on a slope.

I have looked a bit more into this and have found the root cause of getting stuck in geomtry on the test map. When it finds that two edges have come into contact with each other it tries to make a normal vector for the plane those two line segments form. It then tries to make sure this normal vector is pointing in the right direction, and flipping it if it's not. It does this by testing if one of the vertexes one of the edges are on the right side of the plane. I suspect this trick works fairly often or at least there are other vertex and edge contacts to compensate when it doesn't, but if they're aren't the player can get stuck in geometry.

The attached patch contains a couple of ways to fix this bug. The first method (which is enabled) makes sure the normal is pointed in the towards the direction collision test is coming from. The second method makes sure the normal is pointed in the same direction as the normal for the polygon that is stationary. Both methods seem to work equally as well and I wasn't able to replicate getting stuck again with either.

I suspect there are other oddities going on with the test map, as when I walk along the ridge in the map (the edge where the two 1024x1024 textures meet) I sometimes "stumble". This stumble is just a small loss in momentum and the footstep pace resetting. I'm not sure if it's due to geometry problems, or if some of the maths doesn't quite work right at such large map coordinates. I might look more into it when I get time.

tdm--edge-contact-fix.txt

motorsep · August 16, 2015

@NagaHuntress: Thanks a bunch - the fix works! The only issue, as you stated, it stuttering. I don't think it happens in FPS view (maybe I simply didn't notice as I didn't walk long enough on the edge), but it's very obvious in TPS view:

(I used second method in the build I was testing). Hopefully you can figure this one out too

I believe it was never an issue in Doom 3 simply because maps were smaller, and were mostly made of basic brushes - no slopes and such, just flat floors.

I am thinking that since animations are played via script, the game still "thinks" player is on the ground and then momentarily in the air, and they again on the ground. So it switched anims crazy fast, and thus we see stuttering. I might be totally wrong, and even if I am not, I have no clue how to fix that.

Edited August 16, 2015 by motorsep

NagaHuntress · August 16, 2015

@NagaHuntress: Thanks a bunch - the fix works!

I've done some more testing and found that I could get in the Thief's Den when using the polygon plane method. I switched to the direction of movement method I didn't get stuck. (The point I got stuck on was on top of Creep's house, right before crossing over the peak of the roof top and falling off of the map.)

The only issue, as you stated, it stuttering. I don't think it happens in FPS view (maybe I simply didn't notice as I didn't walk long enough on the edge), but it's very obvious in TPS view:

(I used second method in the build I was testing). Hopefully you can figure this one out too

I believe it was never an issue in Doom 3 simply because maps were smaller, and were mostly made of basic brushes - no slopes and such, just flat floors.

I think this bug manifests where there are only (or mostly) edge on edge collisions. The only time you see that normally is when you have a topside edge, like in the ridge of the test map.

I am thinking that since animations are played via script, the game still "thinks" player is on the ground and then momentarily in the air, and they again on the ground. So it switched anims crazy fast, and thus we see stuttering. I might be totally wrong, and even if I am not, I have no clue how to fix that.

I haven't debugged it deeper yet, but that does seem to be a likely cause. The other possibility that come to mind is that the collision seems momentarily like a wall, so it tries to come to a stop. Another thing to try is the same geometry, but moving closer to the origin, and see if movement glitches while moving on the ridge.

NagaHuntress · August 16, 2015

I am thinking that since animations are played via script, the game still "thinks" player is on the ground and then momentarily in the air, and they again on the ground. So it switched anims crazy fast, and thus we see stuttering. I might be totally wrong, and even if I am not, I have no clue how to fix that.

Okay, I've looked at it a bit more, and certainly stemming from it jumping back and forth between thinking it's on the ground and in the air.

I've fixed this behaviour in the test map by changing "CONTACT_EPSILON" to "CONTACT_EPSILON * 4" in "game/physics/Physics_Base.cpp". This makes it test a bit furter downwards for gravity based contacts on the ground. I've play-tested the Thief's Den with this change and have not observed any problems due to it.

I'm not sure if it's the proper fix, as it might only patch over this test case but fail again at even larger coordinates, or under different geometry, and it's possibly it has undiscovered sideffects. Plus the magic "* 4" needs needs to be removed, and to do so it needs to be decided if constant CONTACT_EPSILON should be multiplied by 4 and affect all other related uses of it, or if a seperate constant should be used instead for that particular instance.

motorsep · August 16, 2015

I've done some more testing and found that I could get in the Thief's Den when using the polygon plane method. I switched to the direction of movement method I didn't get stuck. (The point I got stuck on was on top of Creep's house, right before crossing over the peak of the roof top and falling off of the map.)

Is the direction of movement method in second (disabled in your patch) #elif block ?

I also wonder if this fix affects non-player characters and AI (I haven't tested Doom 3 AI against that test case :blush: ).

NagaHuntress · August 16, 2015

Is the direction of movement method in second (disabled in your patch) #elif block ?

It's the one that's enabled in the patch ("// make sure the collision plane faces the direction of the trace").

I also wonder if this fix affects non-player characters and AI (I haven't tested Doom 3 AI against that test case ).

I haven't witnessed any problems with AI in regular FMs with these changes, but I haven't tried testing anything against the ridge. I suspect that the AI would have been vulnerable to original problem of getting stuck. I'm not sure if they would have stumbled, as I think player movement is handled by a different class from AI movement, which might not react to these glitches.

motorsep · August 16, 2015

Tested and indeed, it all works like a charm! Great job you've done, NagaHuntress, thank you!

I am gonna see about trying to test AI/NPC and physics objects on that edge. Hopefully none will get stuck

NagaHuntress · August 17, 2015

Tested and indeed, it all works like a charm! Great job you've done, NagaHuntress, thank you!

I've gone and tested that map with everything moved to around 0 in the X and Y coordinates (Z remains unchanged), and have witnessed no problems, even after reverting both the CONTACT_EPSILON and the edge normal fixes described previously. This does indicate that the problem ultimately stems from distortions introduced by such large coordinates and the limited storage precision used.

I am gonna see about trying to test AI/NPC and physics objects on that edge. Hopefully none will get stuck

I'm more worried about them falling off that small platform.

motorsep · August 17, 2015

This does indicate that the problem ultimately stems from distortions introduced by such large coordinates and the limited storage precision used.

What would be the fix for this?

NagaHuntress · August 17, 2015

Good luck, I'll be delighted if you find a fix. My pessimism was only at the thought that it could be a tuning-of-constants thing rather than a logic error that you can properly fix.

Well, as you can see above, the fixes used were a bit of logic and a bit of constant tuning. However, I think the proper fix is adjusting the coordinate space to be closer to zero when doing a collision trace, but such a fix is likely to mean extra calculation steps, and will require touching a lot of code to make sure it's done right. Plus in the back of my mind, I have a nagging thought that there maybe lingering artifacts due to distorted normals and plane equations being calculated at such extreme coordinates.

We'd like to make all floating point precision explicit of course, so we're not constrained on compiler setup. Any idea how we could track down all the places in the code that rely on greater intermediate precision? Or any other way round it? The resulting bugs are hard to rep reliably and tough to pin down to a single line of code. We'd not find them all by pre-release play testing.

I've been thinking on this and been trying to figure out a good way to solve it.

The simplest method would be to admit "defeat" and just convert almost all floats to doubles. It means extra memory being used for storage and a possible performance penalty due to extra data shovel, but the calculation speed should be largely unchanged, as it already uses lond doubles to due calculations.

An alternative would be to create typedefs like:

typedef float tdmFloat;
typedef double tdmDouble;

Which would replace regular usage of float and double in the code. These typedef would used in release code, but in development code they can be substituted with something like:

class tdmFloat {
protected:
	float value;
public:
	tdmIntermediateFloat operator*tdmFloat &a);
/*rest of operator overloads go here*/
}
class tdmIntermediateFloat {
protected:
	long double value;
public:
	tdmIntermediateFloat operator*tdmFloat &a);
	tdmIntermediateFloat operator*tdmIntermediateFloat &a);
/*rest of operator overloads go here*/
}
class tdmDouble {
protected:
	double value;
public:
	tdmIntermediateFloat operator*tdmFloat &a);
	tdmIntermediateFloat operator*tdmIntermediateFloat &a);
	tdmIntermediateFloat operator*tdmDouble &a);
	tdmIntermediateFloat operator*tdmIntermediateDouble &a);
/*rest of operator overloads go here*/
}
class tdmIntermediateDouble {
protected:
	long double value;
public:
/*operator overloads go here*/
}

With this automated cancellation detection could be implemented like so:

tdmIntermediateFloat tdmIntermediateFloat::operation- (tdmIntermediateFloat &a) {
	tdmIntermediateFloat r;
	int expv, expa, expr;
	/*Extract the exponent component of the input float.*/
	frexp(value, &expv);
	frexp(a.value, &expa);
	/*Compute the result.*/
	r.value = value - a.value;
	/*Extract the exponent component of the result.*/
	frexp(r.value, &expr);
	/*Check if too many bits were cancelled in the add/subtract.*/
	assert(r.value != 0 && ((expv > expa) ? expv : expa) - 20 < expr)
	/*Return the result.*/
	return r;
}

This would allow the checking for cancellation anywhere tdmFloat and kin are used without the need to add special logic or checks where they're considered.

Advantages:

Can be used by just add tdmFloat where needed.
No change in logic to use but may need add function/macro calls to cast too and from float and tdmFloat.

Disadvantages:

The classes tdmFloat and kin need to be written and tested.
Lots of floats will need to be changed to tdmFloat.
It might be that some sections expect cancellation normally, and excpetion mechanisms would need to be added for them.
An obvious performance penalty when used, but should be restricted to development builds.

If it's possible to get away with storing extra data in the classes, it might be possible to add meta data to help detect cases where desired precision is lost or degraded.

Beyond the above suggestions, I'm still thinking about the problem.

Tels · August 17, 2015

@NagaHuntress: Sorry, was at work. Here is the test case: https://drive.google.com/open?id=0BwE6dxM0O2PsQUNncWw0aGFuZms

This is a video of me getting stuck:

Heh, that video looks like many failures I had in maps of mine, mostly in case of raps and floors that are not exactly flat. Often one would get stuck or fall through the floor.

motorsep · August 17, 2015

@NagaHuntress: I recall we tried a few years back (with Doom 3 engine) to move to double floats and performance penalty was very harsh (and I don't recall if it solved the issue of getting stuck)

I wonder how CryEngine solves that issue with its massive outdoor levels. I've heard of a trick where player is always at 0 0 0 of the world, and the world moves around, unlike in conventional game engines, where the world is static, and player moves across the world.

SteveL · August 17, 2015

Beyond the above suggestions, I'm still thinking about the problem.

That'd help us catch all examples where a float gets truncated, but wouldn't tell us which ones are significant. We probably have no way to know which are significant unless we see bugs caused by them.

ID_INLINE idVec3 &idVec3::Cross( const idVec3 &a, const idVec3 &b ) {  
  x = a.y * b.z - a.z * b.y;  
  y = a.z * b.x - a.x * b.z;  
  z = a.x * b.y - a.y * b.x;  
  return *this;  
}

That's the function that my example came from. The tiny difference in outcome quickly multiples into something significant because the cross product is used to define a plane that then gets fit through a distant point.

Thinking about it, only a small subset of float calculations could ever suffer from loss of 80-bit intermediate precision. Only those that (1) have multiple steps in the same line, and (2) involve a subtraction. I'm not 100% sure about (2), but I should think you could add or multiply as many floats as you like, and as long as you don't compare them, the loss of intermediate precision won't be significant. A subtraction is effectively a comparison, and that's what's giving us the different outcomes with lower-precision intermediates.

NagaHuntress · August 17, 2015

@NagaHuntress: I recall we tried a few years back (with Doom 3 engine) to move to double floats and performance penalty was very harsh (and I don't recall if it solved the issue of getting stuck)

Interesting. I wouldn't have expected the performance penalty to of been sever, as it's already working double precision already but just reducing its results down to floats when done. I suppose the extra data transfers from the larger data types are causing a performance hit.

I wonder how CryEngine solves that issue with its massive outdoor levels. I've heard of a trick where player is always at 0 0 0 of the world, and the world moves around, unlike in conventional game engines, where the world is static, and player moves across the world.

Moving the whole world around the player strikes me as a rather processor intensive way to do things (unless they're referring to moving it in blocks, which is not quite the same as being at 0,0,0 all the time). I imagine what's more likely happening is that collision and other localised geomtry operations are moved to around 0,0,0 to extract the collision, visibility, etc. data and then displace it back to it's world coordinates.

That'd help us catch all examples where a float gets truncated, but wouldn't tell us which ones are significant. We probably have no way to know which are significant unless we see bugs caused by them.

My approach would be to assume they're all significant until proven otherwise. However that means potentially wading through and suppressing a lot of false positives before the interesting ones make themselves known. The other problem is that as the idea stands, it only catches cancellation failures.

ID_INLINE idVec3 &idVec3::Cross( const idVec3 &a, const idVec3 &b ) {  
  x = a.y * b.z - a.z * b.y;  
  y = a.z * b.x - a.x * b.z;  
  z = a.x * b.y - a.y * b.x;  
  return *this;  
} 
That's the function that my example came from. The tiny difference in outcome quickly multiples into something significant because the cross product is used to define a plane that then gets fit through a distant point.

Thinking about it, only a small subset of float calculations could ever suffer from loss of 80-bit intermediate precision. Only those that (1) have multiple steps in the same line, and (2) involve a subtraction. I'm not 100% sure about (2), but I should think you could add or multiply as many floats as you like, and as long as you don't compare them, the loss of intermediate precision won't be significant. A subtraction is effectively a comparison, and that's what's giving us the different outcomes with lower-precision intermediates.

I've found an interesting article that discusses how MSVC (and GCC) handles intermediate precision. https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/

When you subtract two floating point number of about the same value, you can run into the problem where they cancel significant digits. If both share 20 significant bits, then a subtract will cancel those 20 bits, leaving you with the remaing bits in the mantissa. For regular floats that's 4 bits (24 - 4) for doubles it would be 33 bits (53 - 20). The use of double for intermediate calculations does stave off the problems with cancellation by giving you more digits you can safely lose, as witnessed in your example.

Cancellation issues aside, the getting stuck problem witnessed in motorsep's example I suspect is due to quantization, which is brought on by operating at such large coordinates, which tie up most of a float's significant bits. The way to solve that is either move the coordinates to near 0,0,0 and do collision detection there, so most of it's bit are available for collision detection, before moving it back; or to bump up the important data types to double so nothing important is lost in the process.

motorsep · August 17, 2015

The way to solve that is either move the coordinates to near 0,0,0 and do collision detection there, so most of it's bit are available for collision detection, before moving it back; or to bump up the important data types to double so nothing important is lost in the process.

Would you only need to increase precision for collisions part of the code or for the entire codebase ?

NagaHuntress · August 18, 2015

Would you only need to increase precision for collisions part of the code or for the entire codebase ?

Ideally you'd increase precision for just the collision, but the difficulty is keeping all the extra precision isolated to just that section without lots of re-engineering. You might get improvement by changing idPluecker to use doubles and its use seems to be isolated to just collision detection. Some floats in cm/CollisionModel_* files could be modified to use doubles instead of float. However, it's possible that won't be enough and you'll need to start changing Vec3 classes, which can have carry on effects to the rest of the code base, or creating a "DoubleVec3" and add extra code to convert to and from Vec3 in collision handling.

motorsep · August 18, 2015

From what I understood, renderer uses same math classes as the rest of the code. If you would switch everything to doubles, there will be issues with rendering. But perhaps I misunderstood that.

NagaHuntress · August 18, 2015

From what I understood, renderer uses same math classes as the rest of the code. If you would switch everything to doubles, there will be issues with rendering. But perhaps I misunderstood that.

I haven't looked closely at how the renderer is structured yet, but ultimately things will need to be converted to single precision floats by the time it hits OpenGL. Unique structures that are used for just the render can be kept as float, but if they use Vec3 and other similar structures then those uses will need to be replaced with a float only equivalent.

SteveL · August 18, 2015

idVec3 exposes its internal data through its ToFloatPtr() method. That gets passed to openGL a lot, so clearly the code wasn't designed with portability in mind. It's pretty much locked down to the original compiler architecture.

The map size issue isn't a problem for TDM, where our biggest map extends less than 10k units from the origin.

@motorsep do you get the same issues with patches in your huge maps, or is it just brushes? Patches use vertex coords instead of plane-normal format, so they should be unaffected by distance precision issues, but it might all be the same once they've been converted to collision models.

motorsep · August 18, 2015

@motorsep do you get the same issues with patches in your huge maps, or is it just brushes? Patches use vertex coords instead of plane-normal format, so they should be unaffected by distance precision issues, but it might all be the same once they've been converted to collision models.

I don't use patches a lot and I never made terrain with patches. However, player was getting stuck in the mesh made terrain. So I am guessing it's safe to say patches would have same issues as mesh/model.

motorsep · August 18, 2015

@NagaHuntress: Why is the fix you already implemented is not enough? It seems to be working fine (I am still testing large 120k x 120k terrain, but so far so good)

SteveL · August 18, 2015

I've found an interesting article that discusses how MSVC (and GCC) handles intermediate precision. https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/

That's the same article that I discussed with grayman and greebo when we were first discussing the bugs in our MSVC2013 port back in January. Except that the article's had more info added since then. It didn't have any mention of _controlfp_s or FLT_EVAL_METHOD at the time. I suspect those could fix our 2013 compile without needing us to rewrite any code. I don't know whether those switches can help us with our 64-bit build in MSVC -- the chart in the article suggests not -- so we might still have to bite the bullet and track down the sensitive lines of code. That'd be something I can at least help with... I don't know enough about compilers and OS environments to try to tackle it from the other end.

64-bit patch for the Dark Mod

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

NagaHuntress

NagaHuntress

NagaHuntress

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recent Status Updates