Bulk fixing modwiki links in the wiki -- good idea?

SteveL · February 12, 2014

There are a lot of broken external links in the wiki that used to point to very useful modwiki (RIP) articles. I just fixed one to point to the modwiki replication at archive.org.

Should I fix them all? I went through a spell years ago of making bots for Wikipedia and I still have my library code. A script could cycle through all links to modwiki, generate the archive.org equivalent, test it, then fix the link if the page returns ok. A delay would make sure the wiki isn't overloaded by queued changes.

Are there any problems with the idea of linking to archive.org? I know nothing about their business model but at first glance it looks expensive and unprofitable to me, so I'm nervous it might vanish one day. Perhaps we should download all the articles while there's chance.

RJFerret · February 12, 2014

I'd be a fan of that process.

Archive.org continues to invest in infrastructure based on their supporters, I can't imagine profitable, but nobody expects such an endeavor to be. They did a big replication of their servers and data to another continent a while ago. If they go, all that knowledge is lost, but it's currently effectively lost to us as it is, unless a bot could replicate those pages here, I certainly would revise those links.

**Springheel** · February 12, 2014

Perhaps we should download all the articles while there's chance.

I don't have time to do it, but I would definitely support it.

XendroX · February 12, 2014

I think it would be wonderful, and many people, like me, will be very appreciated to you...

Bikerdude · February 12, 2014

Should I fix them all? I went through a spell years ago of making bots for Wikipedia and I still have my library code. A script could cycle through all links to modwiki, generate the archive.org equivalent, test it, then fix the link if the page returns ok. A delay would make sure the wiki isn't overloaded by queued changes.

That would be very much appreciated.

February 12, 2014

Why not pull the content from Archive.org and put it on the Wiki here? That way you have everything in one place.

SteveL · February 12, 2014

Let's do both. I have wget creating a mirror of the archived site -- hopefully that's working as intended, it was fiddly to set up -- and I'll work on a way to fix the links. I'll have to dust off some tools.

Uploading the archived site back to a wiki isn't that easy. What got archived wasn't the wiki markup, it was a rendering of the site in html. It should be possible to strip out the excess then maybe convert the HTML back to wiki markup. Failing that, wget will restructure all the internal links to use relative paths, so the output could just be uploaded somewhere as a standalone site.

SteveL · February 13, 2014

My update for tonight: attempt #2 to replicate the site is running. I had to make my own webcrawler in the end. Tools like wget can't cope with archive.org's versioning. The capture date ends up in the URL, and the links in each page are adjusted to give you the closest version to the capture date of the page you are looking at. So if you try to use a standard webcrawler, you end up with all versions of all pages, which is way too much.

My crawler also filters out non-main-namespace pages like User: and Image:. I'll use wget when this finishes to get the images and other embedded resources. I'll also have to fix the internal links in these pages with a script instead of using wget.

I've not made any effort to disguise my spider, but I've got it running slower than I reckon the vast majority of bots will run, so with luck it won't get squashed.

Bikerdude · February 13, 2014

Good man.

SteveL · February 13, 2014

I currently have 1157 pages downloaded and counting. I might have triggered a bot stop last night. I checked it once during the night and found it had stopped at 970 items with error 503 "service unavailable", then when I kicked it off again it only did 9 more. This morning I've changed ip address, slowed it down more, randomized the wait time, added a long pause after failed fetches, given it 1 retry in the event of random http errors, and kicked it off again. It's been going half an hour or so now without problems.

The pages I've downloaded so far have the next 3k links in them, so we have other options. I could download that list in a random order then parse the new files for more links and repeat. It wouldn't look so obviously like a recursive crawler doing a depth-first-search of the site. Let's see how it goes. I'll keep plugging away at it.

EDIT: 10:20 update: No more stoppages to report. Now on 1978 pages retrieved, with another 2636 658 pages discovered but not yet checked.

EDIT: 12:50: Still no errors. Looks like the speed is ok now. Scores are 3013 visited / 1048 discovered.

Edited February 13, 2014 by SteveL

nbohr1more · February 13, 2014

Just be aware that modwiki was taken down by it's owner due to the continual influx of spam submissions. I know there is a lot of useful content there

but I'm not sure there are over 2000 pages of it? Some of that might be cached spam.

SteveL · February 13, 2014

Early October 2012 was the date that archive.org started to pick up and archive "page not found" errors in place of modwiki articles, so the crawler I have running is requesting the page versions from Oct 1st, and if that results in an archived "page not found", it requests the version current at July 1 2012.

I haven't spotted any obvious spam in the output, but yes there *are* lots of pages we won't want in the TDM wiki. There are thousands of pages describing entities that don't exist in TDM, like shuttle doors and rocket barrels. Lots of it is related to Quake4 and other games. Modwiki doesn't seem to have a "Doom3" category so I'm downloading all of it. The main page at last archive said there were 9k articles. Archive.org reckons there were 16k, but that'll include user pages and category lists and images and other stuff that I'm not downloading on this sweep, which is to recover the text, formatting, and links only.

We haven't discussed what to do with the downloaded site yet. I was going to spark off that conversation when I know how many pages we're dealing with, and how big it is after stripping out the bloat. As soon as it finishes I'll upload a zip of the site to a cloud drive and link it here, but I do plan to do more processing on it. 20kb of each page is archive.com's history toolbar, and inside that wrapper you also have the standard mediawiki navigation stuff. So I'll extract the content, download the embedded media, repair the internal links then upload another zip that could be a working site if someone chooses to put it up.

If we decide to add some of it to the TDM wiki, I'd suggest giving all pages a "Modwiki:" prefix (in effect creating a separate namespace for the modwiki pages), and mark then read-only, and put a template at the top saying where it came from and when it was captured. Important because some of it will duplicate content already on the wiki. Even then, the content would have to be filtered by humans first. If people decide that's the way we want to handle it, then we could split up the archive between volunteers who could decide whether to keep or throw each page.

We could just as well decide not to do anything with it, except to have one or two more people from this forum download and save my zip file so that copies of the site exist somewhere other than archive.org. About 20% of the non-red links in the original modwiki no longer exist on archive.org. Fortunately, the ones I've looked at appear to be very obscure articles on specific entities. Perhaps they have a pruning policy for content that doesn't get visited (or that didn't get visited on the original site), in which case it'll be good to have extra copies lying round.

I still plan to fix the links on our wiki to point to archive.org, so I expect that'll be the main way these pages get seen.

RJFerret · February 13, 2014

I agree except I wouldn't mark them read-only, as then fixes can't readily be made, info can't be enhanced/clarified/noted, while the nature of a wiki is to quickly change info and keep a history of how it was before, just a click away. (If they are marked read only, then folks are likely to create an additional addendum page, and edit pages that link to the original, to try to direct people to the applicable revisions.)

SteveL · February 13, 2014

My thinking was that all new or improved content should be directed to TDM wiki articles, to avoid the danger of branching page versions. The template at the top of the Modwiki: pages could encourage people to copy and paste anything useful into the appropriate article instead of linking to it. But that might rarely happen of course given that we're all busy on maps The effect would be the same if we just upload the site as it is to a normal static web page -- which might be preferable to adding lots to the wiki. I take your point about addendum pages, which are also to be avoided.

SteveL · February 13, 2014

The extraction of modwiki is still running. 5672 pages recovered. No problems since I added a pause+retry after random http errors. It'll soon be done I think. We were down to 32 unexplored pages half an hour ago, but now we're back up to 300-odd. Either way, it's now trending downwards instead of upwards. I'll report back shortly with some counts of different page types. Most modwiki pages have a category in brackets as part of the page name in addition to their proper category. I guess we can ignore the masses of _(entity) articles like Reaction_moveto_shuttledock_(entity) because TDM has its own completely different set of entities. On the other hand, func_entities might be useful. There are plenty of (Quake_4) articles we can ignore. I'll report back with counts.

Topic 2: Fixing links in the TDM wiki

Turns out there aren't many links from the TDM wiki to modwiki. 38 links, that lead to 24 distinct articles. So we don't need a bot. I'll fix these individually, by uploading the articles or merging their content into the existing articles.

* Blend_(Material_stage_keyword)

* Brush_carving

* CVars_(Doom_3)

* Category:CVars

* Cube_maps

* Dmap_(console_command)

* Func_damagable_(entity)

* GUI_scripting

* Light

* LightCarve

* Maya_to_MD5

* Models_vs_brushes

* Move_(script_event)

* Optimising_maps

* ROQ_(file_format)

* Rotate_(Material_stage_keyword)

* SCRIPT_(file_format)

* Script_events_(Doom_3)

* Scroll_(Material_stage_keyword)

* Set_(GUI_command)

* Sound_(keywords)

* Spectrum_(Material_global_keyword)

* Texturing

* Visportal

**Springheel** · February 13, 2014

Much appreciated.

SteveL · February 13, 2014

Thanks! The extraction is finished. 5796 pages found in exploring the site, of which 993 were not archived.

Here's how the pages break down by brackets in the page title:

4134 (entity)

628 (cvar)

411 (script_event)

98 (Nothing in brackets)

91 (GUI_item_property)

72 (Material_global_keyword)

65 (console_command)

64 (file_format)

51 (Material_stage_keyword)

26 (GUI_item_event)

25 (Sound_shader_keyword)

23 (GUI_command)

16 (folder)

16 (decl)

14 (GUI_item_type)

13 (Image_program_function)

11 (Doom_3)

10 (Quake_4)

6 (Prey)

6 (ETQW)

3 (scripting)

2 (class)

2 (Resurrection_of_Evil)

2 (Part_2)

2 (Part_1)

2 (Materials)

1 (tutorial)

1 (tool_overview)

1 (keywords)

As promised, I'll load that lot to a cloud drive right now and post a link here so there's at least one copy out there. I'll add another to it when I've stripped out all the excess HTML and fixed the links and images.

EDIT: Here's the link to the raw text content: https://www.amazon.c...YSPIgzToRdoh2gY Updated version below

Edited February 14, 2014 by SteveL

rich_is_bored · February 13, 2014

Thank you for recovering those. I figured most of that was lost when the site went down.

Bikerdude · February 13, 2014

Thank you for recovering those. I figured most of that was lost when the site went down.

Much appreciated.

+2

SteveL · February 14, 2014

I've updated the archive with a hundred or so images that I was able to recover. That's about half the image links I could find in the downloaded pages, and of those, that's all that archive.org has. I discovered they provide an API specifically for bots to find out what the last good date for a page is, so I used that this time instead of multiple fetch attempts with hard-coded dates.

If anyone thinks there should have been a lot more than 220 images on modwiki, let me know. I didn't know the site, and I can't think of any way to check whether I found them all or not.

Still to do: Fix the links and separate the content from the excess html.

There'd be a lot of human decision making to be done before merging this into our wiki, so I suggest we shelve that plan unless people really want to do it. In the meantime I'll make a simple front page with indexes based on page name, category extracted from page name, and mediawiki category. The internal and external links will work, and it could be uploaded as a website if someone wanted to do it, but it'll also work in the browser from a folder on a hard drive. Searching page names will be easy using ctrl+F in the browser, but I don't know how to set up a content search within the packaged site, so unless someone tells me it's simple, content searching will have to be done through normal (os) file search methods or (if website) getting google to index it.

https://www.amazon.c...kT2clKFnYzt89q4

**Springheel** · February 15, 2014

If anyone thinks there should have been a lot more than 220 images on modwiki, let me know. I didn't know the site, and I can't think of any way to check whether I found them all or not.

It wasn't a very image heavy site.

rich_is_bored · February 15, 2014

It was after the spammers got to it. :laugh:

SteveL · March 2, 2014

Finally got this into a working state and fit to be shared. Bugfixes only from now on! It now works as a website, either from a hard drive (how I'm using it) or it would work from a server if anyone wanted to upload it. It's not pretty but it has all the content and the layout, and it works.

Changes since last update:

Meaningful content of pages extracted from the masses of excess html from Archive.org + Mediawiki
~300 extra pages added, that got missed first time round due to the hard-coded capture dates in the webcrawler
Standardized brackets and spaces etc in page titles / file names / links. Archive.org often has both "(category)" and "%28category%29" variants of a page.
Eliminated empty pages and greyed out their links
Fixed page links to recovered pages so they use right relative paths
Fixed page internal links (eg from index table at top of page)
Removed links to wiki Image: pages
Added a "retrieved from" link to the bottom of every page so people can get at the archive.org version in case they want to find a larger version of an image or in case I broke anything on my version of the page.
Fixed images so they work, where image could be recovered.
Fixed external links, leaving them pointing to archive.org replications where possible so dead links to forum posts work.
Disabled Mediawiki red links.
Fixed non-main-namespace links to point to archive.org if the page exists, otherwise disabled them.
Rebuilt the Category pages from scratch. Archive.org only captured the first 200 links for each.
Added an index.html landing page and various index pages linked from it

Here's the link to the archive. To run it as a website from your hard drive, unpack the archive somewhere. It'll drop 6k files into a folder called "modwiki_archive". Create a shortcut to modwiki_archive/index.html then it should run in your browser like any website.

https://www.amazon.c...YQ6MhNrkzsTm9FU

Edited March 2, 2014 by SteveL

Bulk fixing modwiki links in the wiki -- good idea?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recent Status Updates