I've been out of that area of research for quite some time now, but to me, the findings of that paper are no news. It was always known that vergence is the primary factor to drive accomodation, but that defocus also drives accomodation. It was never stated that defocus alone drives accomodation. So, that XR community was wrong right from the start.
Every stereoscopic content drives vergence. If your stereoscopic stimulus is located 1m behind the display, your eyes will perfectly converge on that point, and they will try to accomodate to that distance. Then your brain realizes "wait a minute?! when I try to accomodate to that distance, everything becomes blurry? What is going on?", and there you have that darned unstable sytem again. How can you solve it? You pretty much mentioned all possible solutions already, except the first one in this list:
Move the focal distance so far away that you don't get defocus cues. --> This is the hyperfocal distance and at that distance, we sadly do not perceive any vergence cues anymore, i.e., you see a 2d display!
Varifocal distance with perfect eyetracking. --> Make vergence distance and focus distance match and combine that with artifical defocus.
Multifocal or lightfield / holographic displays.