> I gave up on dealing with link rot years ago. If I come across an old post with non-functioning links, I may just find a new resource, link to The Wayback Machine or (if I'm getting some spammer trying to get me to fix a broken link by linking to their black-hat-SEO laiden spamfarm) removing the link outright. I don't think it's worth the time to fix old links, given the average lifespan of a website is 2½ years and trying to automate the detection of link rot is a fools errand (a page that goes 404 or does not respond is easy—now handle the case where it's a new company running the site and all old links still go a page, but not the page that you linked to). I'm also beginning to think it's not worth linking at all, but old habits die hard.
After about nine years of writing, I've concluded something similar: my existing reactive approach (https://www.gwern.net/Archiving-URLs) is not going to scale either with expanding content or over time. Fixing individual links is OK if you have only a few or aren't going to be around too long, but as you approach tens of thousands of links over decades, the dead links build up.
So the solution I am going to implement soon is taking a tool like ArchiveBox or SinglePage and hosting my own copies of (most) external links, so they will be cached shortly after linking and can't break. The bandwidth and space will be somewhat expensive, but it'll save me and my readers a ton in the long run.
I stopped blogging after a couple years in because I started running out of new things to talk about. I started back up about 5 years later and repeated the same experience. I've seen other bloggers and web comics struggle with the same issue.
The mistake I vowed to correct if I started a third time was this feeling that if I'd already written a couple pages on a topic I should be done. People change. Tech changes. I shouldn't feel guilty 'retreading' something I said a couple years ago. I have new information.
Which is to say, rather than forever going back and updating old entries, it might be more productive to revisit the material you still find you have the strongest feelings about. Talk about what has changed, and what hasn't.
So the solution I am going to implement soon is taking a tool like ArchiveBox or SinglePage and hosting my own copies of (most) external links, so they will be cached shortly after linking and can't break. The bandwidth and space will be somewhat expensive, but it'll save me and my readers a ton in the long run.
Wouldn't a more ideal solution be archiving via a variety of external (or internal, I guess) sources the first time a link appears on your site, and then after a year automatically switching all links to archived versions? This would kill link rot in its tracks while preserving a lot of the value of links for the people you're linking to, and would cost less in bandwidth costs given the curve of access on old content.
> Wouldn't a more ideal solution be archiving via a variety of external (or internal, I guess) sources the first time a link appears on your site, and then after a year automatically switching all links to archived versions?
Yes, by 'shortly' I meant something like 90 days. In my experience, most pages won't change too much 90 days after I add them (I'm thinking particularly of social media-like things), but it's also rare for something to die that quickly. 365+ days, however, would be perilously long. My main concern there is balancing between delaying so long to snapshot that the link dies (thus generating the manual linkrot-repair problem I'm trying to avoid) and being too eager to snapshot and archiving a version which is not done and would mislead a reader.
(I also went through all my domains and created a whitelist of domains that my experience suggests are trustworthy or where local mirrors wouldn't be relevant. For example, obviously you don't really need to worry about Arxiv links breaking, or about English Wikipedia pages disappearing - barring the occasional deletionist rampage.)
> This would kill link rot in its tracks while preserving a lot of the value of links for the people you're linking to, and would cost less in bandwidth costs given the curve of access on old content.
My traffic patterns are different from a blog, so it wouldn't.
Always glad to hear your perspective on this. I am working on my own archival system for href.cool - I've already lost thewoodcutter.com and humdrum.life. (Been going for one year.)
My approach right now is to verify all of the links weekly by comparing last week's title tag on that page to this week's title tag. I've had to tweak this a bit - for PDF links or for pages that have dynamically generated titles - I can opt to use 'content-length' header or a meta tag. (Of course, the old title can be spoofed, so I'm going to improve my algorithm over time.)
I wish I had a way of seeding sites with you. I imagine we have some crossover on interests and would also love to contribute bandwidth - as would some of your other readers I'm sure.
I also it turns out have ~20 years of blog content. There's a bunch of gaps in the middle for various reasons of data loss (some of which were probably more useful than others), but beyond that I've tried to maintain as much as possible my internal links. At this point I have redirects in place to support link structures going back across two blog migrations (from a Drupal install to a custom Django blog to a much less custom Jekyll site). That's about where I feel my responsibility for link rot ends. I've done the best I can that if for some reason you find a /node/somerandomnumber link somewhere you still get redirected to whatever blog post that leads to even though I haven't used Drupal in over a decade. I've not been perfect in maintaining such redirects (there were some complex CMS sections I once had in Drupal I never bothered to migrate or became entirely new things; plus the aforementioned losses of data prior to that Drupal blog and whatever I'd managed to migrate into it from the other Drupal blog and crazier custom blogs that proceeded it, some of which back before blog was even a word), but I've tried my best. The onus the web asks for site admins is that we all collectively try our bests. It's not my responsibility to worry about external link rot out from my blog posts. I'll still lament it, as sometimes there are great losses, but I'm not the one that broke that link contract.
When I get a spammer asking me to fix a broken link I sometimes cry about whatever the web lost that they are bringing to my attention, but generally my only response is to suggest to the spammer that they file a Pull Request to my blog so I can more properly code review the suggested change. (Unsurprisingly, no spammer has actually bothered to take me up on that offer. It's unlikely I'd actually merge such a change in, but I'd love to see one try.)
It may not be your 'responsibility', whatever that means, and it would be nice if there was less link rot, but the fact remains: you provided those links because you thought they were relevant & useful for the reader; many of them are going to break; are you going to fix them, or not?
You have decided you do not care enough about your writings or your readers to invest the effort to fix them. That is your decision, and I don't know enough about you, your writings, or your readers to criticize it.
I was not criticizing your decision, simply trying to offer a differing viewpoint. I briefly considered doing something similar to the path you are traveling down, but realized that I was happier taking a different path.
I provided those links because I thought they were relevant and useful to the reader at the time. I can't fight time and I can't fight entropy for all of the web or even just my own tiny corner. I salute you for trying. I set my border at the end of domain names that I control, because I know I have responsibilities there and I'm able to also in good conscious end them there, otherwise I'd feel so much guilt for how the web has shifted and changed in > 20 years of posting webpages to it.
My blog captures moments in time, and just as I don't go back and fix rotten opinions that haven't aged well in some of them, I generally don't go back and fix broken links. I would hope that anyone exploring my past archives would give past me the benefit of the doubt and contemplate such archives from the context in which they were written and the very different person that wrote them and sometimes the very different web that they were posted to.
I've been meaning to set up some kind of basic crawl & archive system forever. Ideally I'd like to output something replayable and analyzable like a WARC or HAR, but also spit out a PDF, which I think chrome headless should be able to do. Right now I just print to file if I want to "save" something. But your write-up is very thorough, that is basically the situation.
So much good unique content on youtube too which is almost impossible to properly archive due to size, my subscriptions alone would probably be over 1TB.
I'd like to write a basic tool that would take a PDF, i.e. of a book, and output a directory of PDFs of snapshots of all web links in the book, to create basically a full reference snapshot for a given book that could be stored alongside the book. Not sure if it work well and result in a reasonable size.
> Ideally I'd like to output something replayable and analyzable like a WARC or HAR, but also spit out a PDF, which I think chrome headless should be able to do.
ArchiveBox does WARCs and PDFs, and does embedded media; it's easy to use, you can point it at a newline-delimited textfile of URLs and it'll process it.
I'm not sure how it handles YouTube - whether it shells out to something like youtube-dl or not... But really, 1TB is not all that much. You can get 8TB internal HDDs for like $200 now.
> My PageRank is still high enough to get requests from people trying to leach off from it.
One of his guesses for this is because he doesn't try to game the PageRank system. I hope he's right.
I've argued with other people that the best way to score well on Google is to create the best website you can and forget about SEO games. The specific example we were talking about is recipe websites. They all seem to have pointless essays about the first time the author tried ossobuco when they were an exchange student or something else irrelevant. The theory is that the essay is for Google and not for the poor schmuck trying to make dinner.
The essays at the top of recipes are about creating "romance", and often are intended for people as much as (or more than) SEO. They are intended to create empathy from the recipe reader toward the writer and establish certain bonafides, most of which are inconsequential to the recipe itself but help separate one recipe website and/or one recipe author/collector/distributor from another. It's actually something that goes way back in recipe books, where a lot of classic, well respected recipe books for centuries have always put the work into an introduction chapter describing the author, what their passions are, how they approach recipes. It helps sell recipe books and helps aspiring and/or home chefs see things like "oh, this author is just like me" or "if this person can make this recipe, so can I". It's just that on the internet when you are looking for one specific recipe that "introduction chapter" starts to have to move into every single recipe because people aren't going to pick up your website or blog like a book, they are going to go straight to a single page. Though even that isn't entirely unique in the world of cookbooks: there are also old traditions of narrative recipe books that treat the cookbook as a diary of sorts laying out the author's discovery and interest in every single recipe. Sometimes people want a story to go with your dry lists of ingredients and otherwise boring step-by-step directions.
It emphasizes the art in cooking, that it isn't just boring "food chemistry", but that it is a way for people to connect to each other. Even if it didn't help with SEO, there are probably lots of recipe bloggers that would do it anyway because it shows passion, love for the art, narrative hooks for the reader to explore who they are, their creativity, and/or their other interests outside of just their kitchen.
What works in one medium isn't necessarily the best choice for a different medium. Books and web pages have unique constraints and use cases. Everything you've described sounds like the experience I want from a great cook book. It's something I will browse and spend time with. On the web I just want information. I got there with a specific search and I'm not there to learn about the author.
The essays probably wouldn't bother me if they would put the recipe first. If you have a great story about why sage is your favorite herb put that after the recipe. I don't care, but I suppose somebody might.
I agree with you. I wish I didn't. I enjoy the romantic side that WorldMaker points to, but the fact is that when I'm looking up a recipe on the web I'm just trying to make it. All I want talked about is whether or not it's authentic or (say) made differently for American tastes, possible substitutions, etc.
I was thinking that for a while too, but started to swing back towards caring for how a post frames the recipe. There are an infinite variety of the same basic recipes, and how do you choose between any given two of them? The people and the art around them, where the recipe came from, what the recipe author was trying to capture, thought processes to follow that might lead me along to what I'm hoping to capture in my own cooking, etc. There is no objective evaluation of what makes a "good recipe", it is all subjective.
The UX balance of that is obviously tough, and probably more such posts should have a "skip to the recipe" button. Sometimes it seems like a technical expertise challenge that a lot of these recipe blogs are basic (or worse) off-the-shelf CMSes and some of the ugliest blog platforms humanity has built, and there isn't enough flexibility or training to do something so basic in those platforms.
Also, some of them legitimately have audiences that read every post as a blog, and the day-to-day audience isn't intended to be "I'm in a hurry, I just need this one recipe today", but "follow my stories and I include recipes to go along with them". It's sometimes tough to make both audiences happy, and that day-to-day audience is more regular and likely the better one to focus on for the most sustainable revenue/readership.
Really interesting read. With the current trend of revitalizing personal sites/blogs, I hope to see more of them stick around for even 10+ years. Everyone should have their own personal piece of the web.
I admit I'm mostly amused/fascinated by the list at the end of the feed types the blog supports, ranked by popularity. Gopher being more popular than Atom is a quirk specific to this blog -- it's not like many other sites these days support it, right? -- but it's the quite new JSONFeed being at top that was most surprising. Possibly also a quirk specific to this blog, but I'm not as sure of that as I am about Gopher.
My impression is that because it is the same RSS discovery process that finds both traditional RSS, Atom, and JSONFeed, knowing which one your Reader application favors and/or is using is sometimes difficult. I'd off-hand heard that many of the big, strongest web-based Readers had started favoring JSONFeed, but this is the first evidence I've seen of it. It's also why I assume Atom is so low on the list, in that those same readers likely would prefer Atom to RSS, but have switched to JSONFeed today, and RSS has the long tail of longevity and legacy reading software.
It could also just be that since it is apparently request count (rather than unique origin or such) that possibly JSONFeed tools simply request it more often than the others.
I must confess I have probably spent more time trying out blogging software than actually blogging. Reading that OP has his entries as HTML has given me food for thought. I am using Hugo but not being a front end developer I battled (most probably due to my lack of knowledge, time and application) so much to get my theme simple that I find myself hesitant to blog again. Moral of the story use what you understand.
Love it. I've been blogging for 16 years myself. Find it a wonderful way to share knowledge and see myself changing over time. I've tried keeping a diary but it never works for me.
I've been using the same blogging infrastructure for fifteen years. I write HTML, then run an offline script to create a static website (rss feed, tag indexes, extracts for home page). I rewrote most of it once, switching from parsing the html with regexps to using an actual html parser, and taking the time to clean up some especially ugly parts.
I’ve been using the same WordPress installation for seven years now. I thought it was longer but I guess I was mentally counting the years before that when I just had a modified version of a gallery script (Singapore) shoeing my art and no actual text blog. I’ll probably be using it for years to come.
Does what I need it to, works for the amount of traffic I get, no need to change.
At one point I downloaded my site's Wayback Machine Archive curious what I could recover. It didn't feel like enough because particularly in the regions of time where I lost the most blog posts my front page mostly only had synopses or excerpts of posts and the full posts themselves were on a follow-up page that the Wayback Machine didn't archive.
Reading the article, it says he hasn't been using the same software, because he's rewritten all of it, since, much of it multiple times. The only thing that's the same is the content and the file format.
"You have an ax. You replace the handle. Later, you replace the head. When did it stop being the same ax?"
After about nine years of writing, I've concluded something similar: my existing reactive approach (https://www.gwern.net/Archiving-URLs) is not going to scale either with expanding content or over time. Fixing individual links is OK if you have only a few or aren't going to be around too long, but as you approach tens of thousands of links over decades, the dead links build up.
So the solution I am going to implement soon is taking a tool like ArchiveBox or SinglePage and hosting my own copies of (most) external links, so they will be cached shortly after linking and can't break. The bandwidth and space will be somewhat expensive, but it'll save me and my readers a ton in the long run.