Jump to content
The Dark Mod Forums

Recommended Posts

1 hour ago, stgatilov said:

UPDATE: Given that backend most of the time generates machine code for the template instantiations, I bet the real slowdown can be even 2.5-3 times. If someone remembers when Eigen was integrated, it can be checked directly.

I think 356dfb4e05f1ff6ea7f570376e6a2b4692ad581a was the commit that didn't have Eigen yet. I merged it in the commit right after that.

Link to comment
Share on other sites

I think some more context is needed here.

Is the build noticeably slow (in terms of wall clock time), and impacting development, or is this purely a theoretical concern based on profiling data?

Assuming that the data correctly identifies Eigen template compilation as being slow, does this affect all kinds of change, or does the slowdown only happen when you change something fundamental like the Matrix4 header?

I'd be perfectly happy to use pre-compiled headers to avoid compilation cost (especially with the maths classes which don't change very much), but I have no experience with this technique (I think it's largely a Windows thing).

Link to comment
Share on other sites

20 minutes ago, OrbWeaver said:

Is the build noticeably slow (in terms of wall clock time), and impacting development, or is this purely a theoretical concern based on profiling data?

Given that I do not work with DR on daily basis, for me all DR-related concerns are purely theoretical.
Compared to some bad projects on my daily job (hint: better do not combine templates with automatic code generation) it is fast anyway.
Do not consider this post as a complaint, better treat it as an interesting piece of information. It is up to you what to do with it.

I will measure wall time with and without Eigen at the moment when it was integrated.

Quote

Assuming that the data correctly identifies Eigen template compilation as being slow, does this affect all kinds of change, or does the slowdown only happen when you change something fundamental like the Matrix4 header?

I measured time of full clean build.

Of course, more typical incremental builds should be much faster, but:

  1. Since vector math is almost everywhere, it would be correct to expect incremental compilation to become slower by about the same ratio.
  2. Linking time becomes a bigger problem for incremental builds. Many template instantiations make it slower too. One indirect way to estimate it is too look at the total size of .obj files.
Quote

I'd be perfectly happy to use pre-compiled headers to avoid compilation cost (especially with the maths classes which don't change very much), but I have no experience with this technique (I think it's largely a Windows thing).

Precompiled headers will win back at most 10% of the time (at most 3% if only considering Eigen).
The real problem is template instantiations everywhere.

The only way to fix it is:

  1. Return back simple vector/matrix classes with trivial implementations of simple arithmetic operations straight in header (they should be inlineable).
  2. Include Eigen headers in only one cpp file and use it to implement whatever complicated operations you need. Expose such operations to headers as non-inlineable methods.

In other words, either don't use Eigen, or set a compilation firewall between it and the rest of the codebase.
Without compilation firewall, you won't get rid of the slowdown.

  • Thanks 1
Link to comment
Share on other sites

25 minutes ago, stgatilov said:

Given that I do not work with DR on daily basis, for me all DR-related concerns are purely theoretical.
Compared to some bad projects on my daily job (hint: better do not combine templates with automatic code generation) it is fast anyway.
Do not consider this post as a complaint, better treat it as an interesting piece of information. It is up to you what to do with it.

Fair enough. Thanks for going to the trouble of producing the analysis.

25 minutes ago, stgatilov said:

Precompiled headers will win back at most 10% of the time (at most 3% if only considering Eigen).

The real problem is template instantiations everywhere.

Presumably the problem is the complexity of the templates, rather than the mere existence of templates? After all our original Vector classes were already templated on the element type, although we only ever instantiated them with <double>. Eigen's templates are considerably more complicated with their MatrixBase, DenseBase and other helper parent classes.

25 minutes ago, stgatilov said:

The only way to fix it is:

  1. Return back simple vector/matrix classes with trivial implementations of simple arithmetic operations straight in header (they should be inlineable).

It's odd that there isn't a way to tell the compiler "instantiate this complex template once, then use it everywhere else as a simple inlined class". I'm pretty sure that once all the helper templates are processed, the Eigen code must reduce to the same sequence of basic multiplications; you'd think there would be a way to get to the end point without having to deal with the semantics of template parsing each and every time (which is what I assume causes slow compilation).

25 minutes ago, stgatilov said:
  1. Include Eigen headers in only one cpp file and use it to implement whatever complicated operations you need. Expose such operations to headers as non-inlineable methods.

I suppose this would trade compilation speed for application speed (since you'd need actual function calls instead of inlined code for even simple operations) so would probably be a pessimisation from the user perspective.

Link to comment
Share on other sites

5 minutes ago, stgatilov said:

Measured it with timewatch.
Full build took 2:41 before Eigen, and took 3:15 after Eigen.
I guess I should dig deeper what the numbers mean 🥺

That is actually impressively fast. I think it might even be faster than my Linux build.

My guess is that confusion probably arises from two things:

  • CPU time versus physical time — 600s could be 100s on 6 processor cores in parallel.
  • Overlapping parallel processes incorrectly interpreted as being summed together — a 600s process feeding data into a 500s process might result in a a wall time of 600s rather than 1100s.
Link to comment
Share on other sites

1 hour ago, OrbWeaver said:

Presumably the problem is the complexity of the templates, rather than the mere existence of templates? After all our original Vector classes were already templated on the element type, although we only ever instantiated them with <double>. Eigen's templates are considerably more complicated with their MatrixBase, DenseBase and other helper parent classes.

Yes, simple templates usually don't cause much trouble.
But Eigen is most likely designed for large matrices, and all the complexity is worth it when you deal with 500 x 500 matrices.

Quote

It's odd that there isn't a way to tell the compiler "instantiate this complex template once, then use it everywhere else as a simple inlined class". I'm pretty sure that once all the helper templates are processed, the Eigen code must reduce to the same sequence of basic multiplications; you'd think there would be a way to get to the end point without having to deal with the semantics of template parsing each and every time (which is what I assume causes slow compilation).

Instantiated templates are not reparsed today, although MSVC did it for many years.

However, every call site where template code is inlined has to be compiled again and again, there is nothing to reuse there.
Also, they have to be recompiled in every translation unit, because that's what separate compilation model requires.

It is possible to instantiate template code only once using extern template. However, you lose inlining this way too.

Quote

I suppose this would trade compilation speed for application speed (since you'd need actual function calls instead of inlined code for even simple operations) so would probably be a pessimisation from the user perspective.

You don't need inlining for SVD and for decomposing matrix into translate + scale + rotate. These operations are slow anyway, lack of inlining whole be noticeable.
Inlining is very important for trivial things like adding, multiplying, dot products, etc. Just write three additions in header and everything would be OK. Don't use Eigen for that.

Link to comment
Share on other sites

1 hour ago, OrbWeaver said:

That is actually impressively fast. I think it might even be faster than my Linux build.

VC++ is not a slow compiler, I think it's actually doing pretty good. When I switched from MinGW to VC++ my life became a lot easier. Build times went down even more (by almost one order of magnitude, iirc) when I added the precompiled headers to the heavier projects, like the DarkRadiant main binary, the S/R and Objectives plugins, and the Scripting plugin. It really pays off, you can literally see how it chews through the compilation units much faster than before.

Precompiled headers are possible in gcc too, and we should see the same difference when we manage to add it to the CMakeLists. The Linux compilations in my VMs are awfully slow compared to what I'm used to in Windows.

  • Like 2
Link to comment
Share on other sites

1 hour ago, greebo said:

Precompiled headers are possible in gcc too, and we should see the same difference when we manage to add it to the CMakeLists. The Linux compilations in my VMs are awfully slow compared to what I'm used to in Windows.

Perhaps PCH in GCC are simply worse.

Precompiled header in MSVC is implemented via memory dump of compiler done at the end of processing the header.
When it is used, this saved state is simply loaded from disk (most likely mapped) and processing continues from that point.

That's rather barbaric approach, and it does not work for C++ modules (which act like modular PCH), but it should be perfect in terms of performance.

Link to comment
Share on other sites

1 hour ago, stgatilov said:

Perhaps PCH in GCC are simply worse.

We don't currently have PCH at all on Linux, although in theory it should be possible with CMake 3.16 (which has a new directive specifically to support it).

Link to comment
Share on other sites

Getting this to work would pay off big time. I recall doing that once for the TDM source code in Linux, and it was a huge improvement there too. But we were using Scons back then which caused the build to always think it was out of date and one had to recompile everything even if just changing a non-header file - understandably annoying, even with PCH.

  • Like 1
Link to comment
Share on other sites

I have inspected the revision just before Eigen was added the same way I did with master.

As expected, parsing takes 128s instead of 172s, and template instantiation takes 121s instead of 605s.
However, exclusive duration for C1DLL goes down from 785s to 514s and CPU time goes down from 608s to 473s. The expected -500s difference is not here. Moreover, the version without Eigen has 5% difference between CPU time and exclusive duration, while the version with Eigen has 20% difference.

I suspect that some data in the story cannot be added together, e.g. time for template instantiations.
And now there is question what took half of frontend time before Eigen, given that Parsing and Template Instantiations summed together only take half of it.

Most likely I'm doing something wrong.
I already have an answer, now I need to find the question 😁

  • Like 1
Link to comment
Share on other sites

5 hours ago, greebo said:

Getting this to work would pay off big time. I recall doing that once for the TDM source code in Linux, and it was a huge improvement there too. But we were using Scons back then which caused the build to always think it was out of date and one had to recompile everything even if just changing a non-header file - understandably annoying, even with PCH.

Well, the good news is that this turned out to be really easy to set up. It's one line of CMake which Just Works, although I wrapped in a CMAKE_VERSION check to make sure the build won't break for those who don't have CMake >= 3.16.

The not so good news is that this only seems to shave about 5 seconds off the compile time, from 3:44 down to 3:39. Perhaps it delivers more benefit when you're doing an incremental re-compilation after some code change.

Link to comment
Share on other sites

On 6/9/2021 at 2:30 PM, greebo said:

I recall doing that once for the TDM source code in Linux, and it was a huge improvement there too. But we were using Scons back then which caused the build to always think it was out of date and one had to recompile everything even if just changing a non-header file - understandably annoying, even with PCH.

This reminded me about the time when I first got involved with TDM (Dec 2016), in the era of TDM 2.04/2.05-beta, building under Linux natively (i.e. no VM).

It should be noted that the incredibly annoying issue with SCons rebuilding everything all the time was easily fixed with essentially a 1-line tweak.

Reading my notes from back then, I see that when building "from scratch", use of PCH nicely lowered the build time (from 16m39s without PCH to 9m01s with PCH), but only for the 1st build.  Subsequent builds' speed improvement with PCH would obviously depend somewhat on what changed in the code, but the improvement was never anywhere near as dramatic, often showing no measurable improvement.

Fixing that SCons issue saved me about 9 minutes every single time I'd build TDM.  Using PCH saved me a little less than 8 minutes, but only on the very 1st build and never much again.

Link to comment
Share on other sites

Some news about Eigen and build time.

I realized rather quickly why my original analysis was wrong.
The "Duration" shown for template instantiations in WPA is inclusive, so summing them up is a bad idea. It is especially bad when templates are deeply nested, which is the case for Eigen.

It took me quite some time to find a way of computing total impact of Eigen.
I had to implement custom analyzer in my fork.
Of course, this is not very reliable, because who knows what I did wrong 😌


Anyway, here is what it reports for the latest rev:

Microsoft (R) Visual C++ (R) Performance Analyzer DEVELOPER VERSION
Total time for parsing files matching "*Eigen*":
  CPU Time:      79.565226 /  96.567721 / 110.676782
  Duration:     448.373768 / 1112.570639 / 1168.585713
Total time for template instantiations matching "*Eigen*":
  CPU Time:     169.516602 / 169.516602 / 170.677942
  Duration:     122.637069 / 188.706707 / 190.002096

This time did not limit parallelization, so Duration should be ignored.
Every line shows 3 numbers. The first two show total exclusive time spent on Eigen headers/templates. They are computed in slightly different way, and for some stupid reason produce different results 😥 The last number shows total inclusive time spent on topmost Eigen headers/templates.
If Eigen template instantiation internally causes instantiation of non-Eigen template, then the last number includes the time for that child template, while the first two numbers don't. That's the difference.

Here we see that 110s is spent on parsing Eigen headers, and 170s is spent on instantiating its templates. That's 300s in total.


Here is comparison of overall stats between the latest revision and the pre-Eigen revision.

  • Full wall time: 164s -> 184s
  • Total CPU time: 840s -> 1270s
  • CPU time of C1DLL (frontend): 650s -> 1052s

So, the frontend now takes more by about 400s of CPU time, which increases the total CPU time by 50%.
The wall time increases only by 12%.
Perhaps that's because the build is not perfectly parallelized, and the old version has more idle time. Basically, the increased CPU load has filled some of the idle time.

Here is how CPU usage plot looks like:
DrBuildCpuLoad.png.72304f50f64dc169c786fab5c765ac0b.png

Maybe the wall time will catch up with CPU time in future, maybe not...

Also, since my tool reports 300s spent on Eigen in the new version (instead of 400s), it would be more correct to say that adding Eigen increased CPU time by 33% (instead of 50%).

Link to comment
Share on other sites

  • 1 month later...
On 6/9/2021 at 12:30 PM, stgatilov said:

Most likely there is another problem with Eigen: very slow Debug performance.

I have no idea what are typical CPU-intensive workloads in DR, but my impression was that they rarely have numeric nature. If this is true, then slow Debug performance is not a problem (and potentially faster Release performance is not a benefit).

It seems that the "Brush::buildWindings" (which was slowing down patch-based hot reload) has become slower in Debug build.

According to CPU Usage tool, previously "update map" took 0.5 seconds on Bakery Job, now it takes 1.5 seconds.
It looks like Eigen takes most of the added time.
Of course, performance won't differ in Release.

I wonder: how do you debug real-scale maps with such horrible performance?
If Visual Studio is a major developer case, maybe add some sort of "fast debug" configuration?

P.S. I remember how @duzenko used Release build for debugging and added "#pragma optimize off" all over TDM code (and sometimes forgot to remove it) before Debug With Inlines received a performance bump.

Link to comment
Share on other sites

I don't know how it maps to Visual Studio projects, but on Linux I frequently use:

-g -O0 — true debug mode, maximum debug information for identifying actual bugs

-g -O1 — optimised "profiling build", decent performance but still with debug information

-O3 — release mode, no debug information, maximum performance

I understand Visual Studio does have the concept of multiple targets, so perhaps an "optimised debug" configuration could be added to the projects?

Link to comment
Share on other sites

Oh... I remember cmake suddenly asking me for something called Eigen after a git pull a while ago. Thankfully that's a vanilla package in Manjaro OS, I simply installed the library and everything worked great again (no related issues I could spot). This makes sense now that the improvements are explained.

Silly yet legitimate question: Would the engine also using this lib for similar calculations result in a performance improvement?

Edited by MirceaKitsune
Link to comment
Share on other sites

1 hour ago, MirceaKitsune said:

Silly yet legitimate question: Would the engine also using this lib for similar calculations result in a performance improvement?

My bet is NO 😁
But nobody tried to measure it yet.

There is no need to disable debug information in full release build.
In fact, it must be retained if you want to analyze crash/core dumps from users.

There are many options affecting debug performance in Visual Studio:
  https://dirtyhandscoding.github.io/posts/fast-debug-in-visual-c.html

Link to comment
Share on other sites

Decided to compare performance before and after Eigen:

  • Before: 356dfb4e05f1ff6ea7f570376e6a2b4692ad581a "#5581: Fix lights not interacting due to a left-over comment disabling the RENDER_BLEND flag"
  • After: 806f8d2aad9ef7a3d354e2681b71a7417be75e34 "Merge remote-tracking branch 'remotes/orbweaver/master'"

I built Release x64 and started DarkRadiant and TheDarkMod without debugger attached.
Then opened Behind Closed Doors map.

I measured three things:

  • map load timing, as reported in DarkRadiant console.
  • FPS while slightly moving camera in DR preview near start.
  • Time for empty "Update entities now" when connected to TDM (I added ScopeTimer in exportMap scope).

Results:

  • Before: 13.05-13.09 map load, 41-43 FPS, 0.85-0.87 update map
  • After: 13.35-13.41 map load, 38-40 FPS, 0.88-0.90 update map

I had to run everything many times and look at last results. Also made sure TDM is not playing when I measured the first two numbers.
To me it looks like a measurable difference, but a small one. It would be great if someone measures this on a different machine.

Independent measurement is even more important because I sort of expected such results 😁

It would be very surprising to see a big difference, since most of the tasks done by DR are not limited by math.
And I was not surprised to see that Eigen is slower: while it can sometimes do 2 operations in 1 instruction, it has to perform a lot of shuffles all the time, which easily reverts the potential benefits. Plus increased pressure on optimizer should result in less inlining and less optimizations.

 

Link to comment
Share on other sites

The topic of build time was left in somewhat unfinished state (here).

I have measured time for clean single-threaded build, here are results:

  • Debug x64: 534 sec -> 782 sec (+46%)
  • Release x64: 544 sec -> 778 sec (+43%)

The results with fully parallel build are closer to each other: 161 sec -> 195 sec (+21%).
However, I think single-threaded results are most important to consider.

It seems there was some sort of bubble in DarkRadiant build, and when Eigen was added, some of its cost filled the bubble, so the wall time increase was less than increase in computation. However:

  1. We can improve parallelism (e.g. remove dependency between projects) and remove the bubble, then the ratio after/before Eigen will become closed to single-threaded results.
  2. The project will grow, more code will be added (including Eigen-using code). The bubble can fill up by itself, and at that moment the ratio will be closer to single-threaded too.

 

Link to comment
Share on other sites

I'm happy to do some tests on my Linux machine at some point. I'm not too fussed about compile times unless they are becoming excruciatingly slow, but if there's actually a measurable performance degradation that affects users that is of course more concerning.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recent Status Updates

    • Petike the Taffer  »  DeTeEff

      I've updated the articles for your FMs and your author category at the wiki. Your newer nickname (DeTeEff) now comes first, and the one in parentheses is your older nickname (Fieldmedic). Just to avoid confusing people who played your FMs years ago and remember your older nickname. I've added a wiki article for your latest FM, Who Watches the Watcher?, as part of my current updating efforts. Unless I overlooked something, you have five different FMs so far.
      · 0 replies
    • Petike the Taffer

      I've finally managed to log in to The Dark Mod Wiki. I'm back in the saddle and before the holidays start in full, I'll be adding a few new FM articles and doing other updates. Written in Stone is already done.
      · 4 replies
    • nbohr1more

      TDM 15th Anniversary Contest is now active! Please declare your participation: https://forums.thedarkmod.com/index.php?/topic/22413-the-dark-mod-15th-anniversary-contest-entry-thread/
       
      · 0 replies
    • JackFarmer

      @TheUnbeholden
      You cannot receive PMs. Could you please be so kind and check your mailbox if it is full (or maybe you switched off the function)?
      · 1 reply
    • OrbWeaver

      I like the new frob highlight but it would nice if it was less "flickery" while moving over objects (especially barred metal doors).
      · 4 replies
×
×
  • Create New...