This Page is Designed to Last

shasheene · on Dec 19, 2019

There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.

Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic. Imagine having the last 30 years of web browsing history saved on your local machine. This would especially be useful when in research mode and deep diving a topic.

[1] https://github.com/machawk1/warcreate

[2] https://github.com/machawk1/wail

[3] https://github.com/internetarchive/warcprox

EDIT: I forgot to mention https://github.com/webrecorder/webrecorder (the best general purpose web recorder application I have used during my previous research into archiving personal web usage)

tempestn · on Dec 20, 2019

This was what made me convert from bookmarking to clipping pages into Evernote around 6-7 years ago. I realized I had this huge archive of reference bookmarks that were almost useless because 1) I could rarely find what I was looking for, if I even remembered I'd bookmarked something in the first place, and 2) if I did, it was likely gone anyway. With Evernote I can full text search anything I've clipped in the past (and also add additional notes or keywords to ease in finding or add reference info).

Since starting with replacing bookmarks, I've moved other forms of reference info in there, and now have a whole GTD setup there as well, which is extremely handy since I can search in one place for reference info and personal tasks (past and future). Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.

charlesdaniels · on Dec 20, 2019

Shout out to https://joplinapp.org/

I was an Evernote user when I was on macOS. When I switched to Linux, a proper web clipper was something I really missed. I'm now on Joplin and it does everything I used to use Evernote for and then some.

It even has vim bindings now!

As far as longevity goes, I think they got their archive / backup format right - it's just a tarball with markdown in it.

dragonsh · on Dec 20, 2019

No need of proprietary code and apps why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve.

Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.

Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.

anyzen · on Dec 20, 2019

> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text.

I have used rst intensively on a project. A few years later, I would be hard pressed to write anything in ti and would need to start with a Quick Start tutorial. With all its faults, Markdown is simple enough that it can be (and is) used anywhere, so there is no danger of me forgetting its syntax (even if it wasn't much simpler to star with).

So personally I would prefer md over rst anytime.

dbtx · on Dec 20, 2019

> Markdown is not a specification

Not by that name... https://commonmark.org/

dragonsh · on Dec 20, 2019

It is stil not a specification like restructuredText[1]. Also wikiMarkup (which really started this markdown) is different from GitHub markdown, which is different from other markdown editors. Also many sites use their own markdown versions.

If you are in restructuredText world there is one specification and all implementation adhere to it, be it pandoc, sphinx, pelican, nikola. The beauty of it is that it has extension mechanisms which provides enough room for each tool to develop it. But markup can be parsed by any tool.

[1] https://docutils.sourceforge.io/docs/ref/rst/restructuredtex...

m463 · on Dec 21, 2019

I don't know why markdown is so popular other than maybe "it was easy to get running" or "works for me".

It's better than "designed by a committee" standards, but it lacks elegance or maybe craftsmanship.

setr · on Dec 21, 2019

Because its inherently appealing, close to what you wanted intuitively, and if you're only dealing with a single implementation of it, works fairly well.

You don't really get bit by its lack of a standard and extensibility until after you've bought in.

It's essentially designed by the opposite of a committee -- rather than including everything but the kitchen-sink, it contains support for almost no usecases except the one. Which is very appealing, when you only have the one usecase.

dragonsh · on Dec 21, 2019

Well rst is better than markdown from day one. The only reason it became famous is thanks to wikimarkup.

So markdown needs to thank the popularity of Wikipedia for its success, as rst did not have any application like Wikipedia. But still rst is used widely enough with its killer Sphinx, readthedocs and now its kind of de-facto documentation writing markup in Python and many open source software world.

petepete · on Dec 21, 2019

Because you can teach someone markdown in five minutes. And even if they don't know all the ins and outs, the basics are pretty foolproof (paragraphs, headings, bold and italic).

eitland · on Dec 20, 2019

> No need of proprietary code and apps

Joplin is free and open source.

pinehqcom · on Dec 20, 2019

This... I just found a plugin for the static site generator Pelican that is 7 years old that still works. After running Pelican you get plain HTML that can be hosted anywhere. I like Netlify, but other options like GitHub pages are also great. The author recommends not putting on GitHub Pages because they haven't found a working business model and might not be here in the future. But... GitHub has been taken over by Microsoft which is most likely not going bankrupt soon and Microsoft loves their backward compatibility so I am confident they won't screw GitHub up too much.

dragonsh · on Dec 20, 2019

You can say the same about geocities when it was acquired by yahoo. But it didn't last and then now it's happening with yahoo-groups. So I am not hopeful if GitHub becomes a liability microsoft will keep it.

freetonik · on Dec 22, 2019

Yahoo! And Microsoft have very different business models. One is intently more sustainable (selling software and services).

charlesdaniels · on Dec 20, 2019

> No need of proprietary code and apps

Joplin is open source, which is a big part of the sell to me. It definitely isn't the best of all possible note taking systems that could ever exist, but it's the best open source one I've found so far, and I don't have time to write a better one at the moment.

> why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve

This is solving a different problem though. WARC/MHT and other solutions can do this. Joplin is more of a note taking system that allows ingesting content from the web into one's own local notebook, which is relevant to what the GP post was talking about - Evernote.

However, it would seem that "the modern web" is the now popular standard. 10 years ago it might have been Flash or Java web applets or whatever. Now it's JS. I'm not convinced that JS is any better than what it has replaced. However, people keep paying developers to write them, so presumably someone likes them.

> Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.

Agreed, but that's also not a problem that Joplin, Evernote, or any other such tool is going to be able to solve. Unless you are complaining that Joplin is an Electron app? That's my biggest issue with it personally. It runs well enough, but is definitely the heaviest application I use regularly, which is a little sad for a note taking program. On the other hand, I haven't found a better open source replacement for _Evernote_. There are lots of other open source note-taking programs though.

> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.

reST is indeed very nice. At one point, I kept my personal notes as a Sphinx wiki with everything stored in reST. I found this to be less ergonomic than Evernote/Joplin, although in principle it could do all the same things that Joplin can do, and then some.

techslave · on Dec 21, 2019

> No need of proprietary code and apps why not build it into browsers.

Safari does this. pages added to the reading list archive the content for offline reading.

celeritascelery · on Dec 20, 2019

Joplin is open source.

q-base · on Dec 20, 2019

Thanks a lot for the recommendation! I have been a little annoyed with Evernote not having an app for Ubuntu, which I recently started using quite heavily. So this looks very interesting!

m-p-3 · on Dec 20, 2019

The developer behind it is doing some awesome stuff so I decided to sponsor him on GitHub.

erikbye · on Dec 20, 2019

> Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.

I have used Evernote and OneNote, but have finally, after a long interim period, resorted to using only markdown.

I have a "Notes" root folder and organize section groups and sections in subfolders. VSCode (or Emacs), with some tweaks, shortcuts, and extensions, provides a good-enough markdown editing experience. Like an extension that allows you to paste in images, storing it in a resources folder in the note's current location (yes, I see small problems with this down the road when re-organizing, but nothing that can't be handled).

For Firefox, I use the markdown-clipper extension the few times I would like to copy a whole article, it works well enough. Or I copy/paste what I need; mostly, I take my own summarized notes.

For syncing, I use Nextcloud, which also makes the notes available both for reading and editing on Android and iOS (I use both).

Up until very recently, I used Joplin, which also uses markdown, but there were two things I could not live with: it does not store the markdown files with a readable filename, e.g., its title, and being tied to a specific editor.

If you are mostly clipping and not writing your own notes, I can imagine my setup won't work well, or be very efficient.

I want to use a format that has longevity, and storing in a format that I cannot grep is out of the question.

sharps1 · on Dec 20, 2019

Thanks for pointing out markdown-clipper!

https://addons.mozilla.org/en-US/firefox/addon/markdown-clip...

unicornporn · on Dec 20, 2019

https://archivebox.io/

You bookmark in Pinboard or Delicious and ArchiveBox saves the page. Handy.

Abishek_Muthian · on Dec 20, 2019

>if I even remembered I'd bookmarked something in the first place

I had recently participated in a discussion on the problem of forgetting bookmarks[1].

Copying my workflow from there,

1. If the entire content should be easily viewed, then store via pocket extension.

2. If a partial content should be easily viewed i.e. some snippet with link to entire source, then store in notes (Apple).

3. If the content seem useful in the future, but it is okay to forget it; then I store it in the browser bookmarks.

But, my workflow doesn't address the problem raised by Mr. Jeff Huang; if Pocket app or notes disappear so goes my archives. I think self hosted archive as mentioned by the parent is the way to go, but I don't think it's a seamless solution to a common web browser user.

[1]https://needgap.com/problems/57-i-forget-my-web-bookmarks-qu...

iudqnolq · on Dec 20, 2019

My solution for a small subset of the forgetting problem:

I frequently see something and want to try it out the next time I want to do something else. So I emulate User Agent strings and append lots of "like [common thing I search for a lot]" to the bookmark. When I start typing into the search bar for those other things I'll be reminded of the bookmark.

For example, since file.io is semi-deprecated I decided to try out 0x0.st . But I kept forgetting when I actually needed to transfer a file, so I made a bookmark titled "0x0.st Like file io".

As a side note, I have a similar bash function called mean2use that I use to define aliases that wrap a command and ask me if I'd like to do it another way instead or if I'm sure I want to use the command. I've found this is a nice way to retrain my habits.

Abishek_Muthian · on Dec 21, 2019

That was useful, can you add this to the original needgap thread I linked?

Disclaimer: needgap is a problem validation platform I built.

pqs · on Dec 20, 2019

I'm glad you mention Evernote. I also use it for this, and also for many other purposes.

It is true that it is propietary software but it is worth mentioning that all the content can be exported as an .enex file, which is xml.

So, the data can be easily exported.

jval43 · on Dec 20, 2019

>easily exported

Have you actually looked at such an xml: https://gist.github.com/evernotegists/6116886

Exported sure, it's all there. But importing that into your new favorite notes application is not going to be trivial, especially not for regular users.

That's why I've decided to stick mostly to regular files in a filesystem.

WaltPurvis · on Dec 20, 2019

Presumably "regular users" will not be individually writing XML parsing code to convert the notes. The developers of their "new favorite notes application" will do it (and if they can't be bothered, maybe it shouldn't be your "new favorite notes application").

Joplin, for example, can import notes exported from Evernote. It's just a menu option that even regular users should have no trouble employing.

WalterBright · on Dec 20, 2019

I store bookmarks (i.e. URLs) in a simple .txt file. My text editor lets me click on them to bring them up in a browser.

> Only downside is I'm dependent on Evernote

No special software nor database required.

tempestn · on Dec 20, 2019

What benefit does that provide compared to regular browser bookmarks? It doesn't seem to address either of the issues I mentioned.

WalterBright · on Dec 20, 2019

1. it's independent of the browser

2. it works with any browser

3. I can move it to any machine and use it

4. It is not transmitted to the browser vendor

5. Being a text file, it is under my control

6. I can back it up

7. I don't need some database manager to access it

8. I can add notes and anything else to the file

9. It's stupid simple

duckmysick · on Dec 20, 2019

What happens when the website itself becomes unavailable?

larrywright · on Dec 20, 2019

For me, this is the problem that Evernote solves - it saves the entire content of the page, images, text, and clickable links.

WalterBright · on Dec 20, 2019

The link stops working.

petra · on Dec 20, 2019

I find Evernote's search isn't that good, at least in the free version. Often trying to remember keywords and using Google is faster.

I know about DevonThink, i read good recommendations. But it's IOS/Mac only.

Any Evernote alternative for Win/Android with great search ?

strig · on Dec 20, 2019

Another commenter suggested https://joplinapp.org/, it has a nice search feature and has apps for most platforms.

_57jb · on Dec 20, 2019

You just opened up a world for me I hadn't thought about!

So simple! Thank you!

tempestn · on Dec 20, 2019

You're welcome! If you're interested in getting into GTD in Evernote (which I highly recommend), I wrote a blog post a while back about my setup: https://www.tempestblog.com/2017/08/16/how-i-stay-organized-...

BlackEspresso · on Dec 21, 2019

Nice article but http://www.thesecretweapon.org/ isnt reachable anymore. The Page didnt Last...

tempestn · on Dec 21, 2019

Oh the irony. I'll update my link to an archive post or reproduce the important parts. Thanks for pointing that out!

Edit: doesn't appear to be down, they're just using a self-signed cert.

BlackEspresso · on Dec 23, 2019

hmm, your right. i tried "continue with insecure certificate" yesterday but maxbe i was to impatient

tuxracer · on Dec 20, 2019

Is there a good end-to-end encrypted alternative to Evernote?

fnord77 · on Dec 21, 2019

first I heard of web clipping. Looks like OneNote has web-clipper extensions, too. This is so great.

romwell · on Dec 19, 2019

>There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.

Indeed.

And I still remember the modem days where I would download entire websites because the ISP charged by the hour, and I'd read them offline to save money.

JoeSmithson · on Dec 20, 2019

I can't put my finger on it but this has a sort of Dickensian quality to me.

I think this says something kind of profound about information and capitalism and whatnot.

thomasz · on Dec 20, 2019

No, it hasn't. Technology just wasn't there back then which caused significant cost per time unit, which makes it only fair to charge per time unit.

majewsky · on Dec 20, 2019

Yeah. In the dialup days, layer 1 and 2 of a home internet connection was a long-running phone call between your own modem and a modem of an ISP. You payed via your phone bill, for the duration of the call.

alentist · on Dec 20, 2019

It says scarce resources are pricier. Welcome to the real world!

jakeogh · on Dec 20, 2019

Personal wayback machins should be standard computing kit. I have had one since around 2013. Very bare bones demo: https://bpaste.net/show/3FBH6 it does much more than that. file:// is supported for example, so you can recursively import a folder tree, and re-export it later if you wanted to.

Or in some random script: "from iridb import iri_index" "data = iri_index.get('https://some/url')" I'm skipping lots, you can ref by hash, url, url+timestamp. It hands back a fh, you dont know if the data you are reffing even fits in memory. Extensive caching, all the iri/url quirks, punycode, PSL etc.

Some random pdf in ~/Downloads, "import doc.pdf" and dmenu pops up, you type a tag, hit enter and the pdf disappears into the hash tree, tagged, and you never need to remember where you put it. Later on you only need to remember part of the tag, and a tag is just a sequence of unicode words.

Chunks are on my github (jakeogh/uhashfs, it's heka-outdated dont use it yet), I'll be posting the full thing sometime soonish.

bloopernova · on Dec 20, 2019

I actually this week asked the author of SingleFile if he could implement a save-on-bookmark feature for SingleFile, and he was amenable:

https://github.com/gildas-lormeau/SingleFile

https://github.com/gildas-lormeau/SingleFile/issues/320

j88439h84 · on Dec 20, 2019

Nice. FYI there's also SingleFileZ

> SingleFileZ is a fork of SingleFile that allows you to save a webpage as a self-extracting HTML file. This HTML file is also a valid ZIP file which contains the resources (images, fonts, stylesheets and frames) of the saved page.

https://github.com/gildas-lormeau/SingleFileZ

gildas · on Dec 20, 2019

I'll implement the "save bookmark page" feature in both extensions :)

johnpowell · on Dec 20, 2019

Whoa. I just installed SingleFileZ for FF and it is working great. Before I was using wget and that was clunky. This is working great since I can just toss a single file up on my server and we are good to go. Thanks for this!

bloopernova · on Dec 20, 2019

oh, hello! It's funny how this subject has popped up again.

gildas · on Dec 20, 2019

Hi! I think it confirms that there's a real interest in this feature.

SirYandi · on Dec 21, 2019

Off topic, but could I ask how you knew your software was being talked about? Did you just happen by or have you some monitoring agent looking for mentions? Just curious

gildas · on Dec 30, 2019

Sorry, I didn't see your question. I check the posts on HN very regularly. The title of the post made me think that people might have been talking about SingleFile. Sometimes, friends of mine tell me someone on the Internets is talking about SingleFile :). I also sometimes use the integrated search engine.

_fbpt · on Dec 20, 2019

WorldBrain's Memex (https://addons.mozilla.org/en-US/firefox/addon/worldbrain/) has an option to perform a full-text index (not archive) of bookmarks, or pages you visited for 5 seconds (default) down to 1 second (no option to index all pages). It stores this stuff into a giant Local Storage (etc) database, which Firefox implements as a sqlite file.

steveklabnik · on Dec 20, 2019

https://www.gwern.net/Archiving-URLs describes extracting brower history to create an archive via a batch job.

cosarara · on Dec 20, 2019

Firefox actually purges history automatically. For instance, the oldest history I have on this browser right now is from January 2018. I found about this the hard way.

nayuki · on Dec 20, 2019

I noticed this behavior in Firefox too. So I started writing personal Python scripts to scrape FF's SQLite database where it stores all the browsing history information.

kccqzy · on Dec 20, 2019

Safari does the same, even though I tell it to never clear browsing history.

Avamander · on Dec 20, 2019

I think Chrome(ium) does as well. Very annoying tbh.

heavyset_go · on Dec 20, 2019

Chrome was the first browser I encountered that deletes history without being instructed to.

Thorrez · on Dec 20, 2019

It looks like Firefox has been doing it since 2010[1]. I wonder how long Chrome has been doing it, since launch, 2008? Here's a Chrome bug discussing it[2].

[1] https://web.archive.org/web/20151229082536/http://blog.bonar...

[2] https://bugs.chromium.org/p/chromium/issues/detail?id=500239

6510 · on Dec 20, 2019

Mosaic had full text history search.

pabs3 · on Dec 21, 2019

You can increase the retention period to centuries via about:config.

dbtx · on Dec 20, 2019

I have this problem. Some bits of history are gone except from old backups of profile directories and profiles where I've already set places.history.expiration.max_pages to some absurdly high number.

I need to do a handful of experiments to see exactly how this interacts with Sync, even though I've (foolishly) already synced the important profiles. I'd hope that the cloud copies of the places database just keeps growing, but in any case, I'd rather combine them all offline anyway.

gwern · on Dec 20, 2019

Even if you set the setting, how can you be sure that it won't be reset on an upgrade or that you'll remember to set it if you need a new profile (perhaps your old one becomes buggy, crufty, corrupt, or all three)? I thought I had all my history retained until one day I couldn't find a website I knew I had visited years ago, and took a closer look at my history and was very unpleasantly surprised... What happened? I'll never know, but my suspicion is that Firefox reset the history retention setting at some point along the way. If you do any web dev, you know Firefox occasionally backstabs you and changes on updates. The only way to be sure over a decade-plus is to regularly export to a safe text file where the Mozilla devs can't mess with it. I can't undo my history loss, but I do know I have lost little history since.

dbtx · on Dec 20, 2019

I can't be sure. When I say 'combine them all offline', I mean using something like [1] which refuses to do anything for me because the Waterfox database version is a rather old Firefox version, and that seems to expect all the db's versions to be up-to-date and equal, which seems pointless. #include <sqlite3.h> was my next step-- only I don't walk very well, so that didn't happen "yet". Or I'm lazy, or distracted, or depressed, or something. When I recently got tired of realizing a thing was on the other machine, I bit the bullet and synced them, if only to see how well that worked.

Anyway, thanks for the guide.

[1] https://github.com/crazy-max/firefox-history-merger

BiteCode_dev · on Dec 20, 2019

I like the idea, but wanted to know how realistic it would be so I made a quick and dirty Python script to download all my bookmarks. If you want to make the same experiment, you can get it from here: https://gist.github.com/ksamuel/fb3af1345626cb66c474f38d8f03...

It requires Python 3.8 (just the stdlib) and wget.

I have 3633 bookmarks, for a total of 1.5 Go unziped, 1.0 Go zipped (and we know we can get more from better algo and using links to files with the same checksum like for JS and css deps).

This seems acceptable IMO, espacially since I used to consider myself a heavy bookmarker and I was stunned by how few I actually had and how little disk they occupied. Here are the types of the files:

   31396 text/plain
   3034 application/octet-stream
   1316 text/x-c++
   1123 text/x-po
    865 text/x-python
    384 text/html
    227 application/gzip
    218 inode/x-empty
    178 text/x-pascal
    113 image/png
     44 application/zlib
     29 text/x-c
     28 text/x-shellscript
     14 application/xml
     13 application/x-dosexec
     12 text/troff
      5 text/x-makefile
      4 text/x-asm
      3 application/zip
      2 image/jpeg
      2 image/gif
      2 application/x-elc
      1 text/x-ruby
      1 text/x-diff
      1 text/rtf
      1 image/x-xcf
      1 image/x-icon
      1 image/svg+xml
      1 application/x-shockwave-flash
      1 application/x-mach-binary
      1 application/x-executable
      1 application/x-dbf
      1 application/pdf

It should probably be opt-in though, like a double click on the "save as bookmark icon" to download the entire file, and the star becomes a different color. Mobile phones, chrome books and raspy may not want to use the spaces, not to mention there are some bookmark content that you don't want your OS to index, and show you preview of in every search.

But it would be fantastic: by doing this experiment I noticed that many bookmarks were 404 now, and I will never get their content back. Beside, searching bookmark, and referencing them is a huge pain.

So definitely something I wish mozilla would consider.

asdasdasdasdwd · on Dec 20, 2019

> definitely something I wish mozilla would consider

There used to be this neat little extension called Read It Later that let you do just that. Bookmark and save it so you could read it when you were offline or the page disappeared. Later they changed their name and much later Mozilla bought it and added it to Firefox without a way to opt out. It was renamed to Pocket.

BiteCode_dev · on Dec 20, 2019

Pocket is not integrated with your bookmarks. For offline consultation, you need a separate app. Of course this app is not available on Linux, where you have to get some community provided tools.

Bookmark integration would mean one software, with the same UI, on every platform, and only one listing for your whole archive system.

ryanfox · on Dec 20, 2019

I’ve been building an application to do this, except for everything on your computer! It’s called APSE[0], short for A Personal Search Engine.

[0] https://apse.io

hk__2 · on Dec 20, 2019

Having to pay $15/month ($180/yr!) to be able to search stuff on my own computer for years seems awfully expensive. I'd rather depend on some simple open-source piece of software that I can understand and maintain if necessary.

tasogare · on Dec 20, 2019

Yeah, the sheer idea of paying a subscription for software that is running on my computer to index local resources is crazy. This kind of software should be should sold as one-time buy license.

abraae · on Dec 20, 2019

Decades ago there was an amazing piece of software from lotus when I worked there called magellan. I remember the first time I saw someone search, and find results in text documents, spreadsheets and many other of the common formats of the day.

That was in 1989 and today I mostly search my computer using find and grep commands, since that's what just keeps working.

nukst · on Dec 20, 2019

I should try adopting find and grep, but on Windows I'm currently using this and I'm very happy with it: https://www.voidtools.com/downloads/

Multicomp · on Dec 21, 2019

I use Void Tools Everything to find files by name, and AstroGrep for finding information in them.

vxNsr · on Dec 20, 2019

Yup, I could see paying $180 one time for something like this. but at $180 a year for a self hosted product... that's just very steep.

crucialfelix · on Dec 20, 2019

Google used to have a native Mac extension like a launch bar. Command space ... Enter search all local files. It was really fast

lloydde · on Dec 20, 2019

macOS Spotlight became good enough

https://en.m.wikipedia.org/wiki/Spotlight_(software)

crucialfelix · on Dec 20, 2019

Well I used LaunchBar and then Quicksilver for many years. Spotlight has never been as nice and hackable as those.

pvg · on Dec 20, 2019

You can ask Safari to do that by enabling 'Reading List: Save articles for offline reading automatically'. It's not WARC but it is an offline archive. The shortcut is cmd-shift-D which is almost the same as the bookmark one. It's also the only way I know of to get Safari to show you bookmarks in reverse chronological order. And it syncs to iOS devices.

This could be done in better and more specialized ways, one problem is browser extension APIs don't provide very good access to the browser's webpage-saving features.

chaostheory · on Dec 20, 2019

This problem has been solved a long time ago if you use Pinboard.

https://pinboard.in/

Just pay the yearly subscription so pinboard can cache your bookmarks.

akho · on Dec 20, 2019

I do not see how using a web thing is a solution to web things going away.

L_Rahman · on Dec 20, 2019

Pinboard happens to be a web service run along the same principles as the article we're discussing.

The bus factor is high, but I suspect that Maciej has a plan that'll let us download our archive even if he does get grabbed by the mainland Chinese government let alone a forecasted going out of business action.

chaostheory · on Dec 20, 2019

Then what do you propose the answer is? The blog post just proposes using “web things” differently

akho · on Dec 25, 2019

For archiving web pages? ArchiveBox is ok.

fphilipe · on Dec 20, 2019

I've contemplated upgrading my Pinboard account many times. Finally bit the bullet.

cambalache · on Dec 20, 2019

Until pinboard goes OOB

chaostheory · on Dec 21, 2019

Pinboard is a profitable online equivalent to a mom and pop shop. It’s sustainable and its founder isn’t chasing growth at all costs. It also has a cult following, so OOB is highly unlikely.

simongr3dal · on Dec 20, 2019

Does pinboard caching work with sites that are behind a login or paywall?

L_Rahman · on Dec 20, 2019

It does not. See FAQ here: https://pinboard.in/upgrade/

larrywright · on Dec 20, 2019

This is the advantage to Evernote. Since it’s a browser extension, it has access to anything you have access to.

majewsky · on Dec 20, 2019

The downside is, since it’s a browser extension, it has access to anything you have access to.

larrywright · on Dec 20, 2019

Agreed that there’s a tradeoff. I don’t think there’s really an alternative solution though.

soulofmischief · on Dec 20, 2019

I use the Zotero extension for this feature.

https://www.zotero.org/

_fb63 · on Dec 20, 2019

came here to say this. zotero also saves metadata as an extra. as long as it is used from within a browser.

arjie · on Dec 19, 2019

In practice how is this different from MHTML? I think most browsers have built-in support for MHTML so it should be possible to build that part easily.

Laforet · on Dec 20, 2019

The state of mhtml support is fairly pathetic at the moment. Firefox broke mhtml compatibility with the quantum overhaul. Chrome's mht support had been a hit and miss over the years, sometimes removing the GUI option entirely and requiring one to manually launch the browser with a special tag to enable it. The only browser with a history of consistent mhtml support happens to be....Internet Explorer, followed by a bunch of even more obscure vendors that nobody really uses.

I am currently dealing with the problem of parsing large mht files (several megabytes and up). A regular web browser would hang and crash upon opening these files and most ready made tools I could find struggle with the number of embedded images. It's very much a neglected format with very little support in 2019.

est31 · on Dec 19, 2019

According to the MHTML entry on Wikipedia, Chrome requires an extension, Firefox doesn't support it, and only Internet Explorer supports MHTML.

scrollaway · on Dec 20, 2019

I mean, it's not any worse than WARC support…

est31 · on Dec 20, 2019

Maybe MAFF's are best as they use compression instead of base64 encoding: https://en.wikipedia.org/wiki/Mozilla_Archive_Format

gildas · on Dec 20, 2019

Or SingleFileZ files which can be viewed without installing any extension https://github.com/gildas-lormeau/SingleFileZ.

Edit: it can also auto-save pages, like SingleFile.

mrspuratic · on Dec 20, 2019

I used the excellent unMHT plugin for Firefox, but it got dropped some time ago, failed to meet "enhanced security requirements" :(

I still keep an old ESR with this plugin for archiving, and accessing MHTs.

reificator · on Dec 20, 2019

This is what started me clipping everything to OneNote instead of bookmarking. Unfortunately, it becomes difficult to maintain, the formatting is off, things subtly break, pages clipped on mobile use different fonts for god knows what reason, some content is discarded silently because the clipper deems it's not part of the main article, I could go on.

It's better than nothing but it's also increasingly frustrating to deal with.

franga2000 · on Dec 20, 2019

I've actually been saving every page I visit for a good two years now and it has barely caused in dent in my NAS storage space. As usual though, I wrote a crappy extension and Python script to do that because I never bothered to look online. Thanks for introducing me to WarcProxy - I'll probably be making the switch very soon.

boring_twenties · on Dec 20, 2019

That would be even more useful if a search warrant is ever executed on my house.

Not useful to me personally, but useful to someone!

pjmlp · on Dec 20, 2019

I always save pages instead of only bookmarking them.

Most websites I used to visit during demoscene high days, are now gone.

btrettel · on Dec 20, 2019

Years ago I used the old Firefox addon Shelve to automatically archive the vast majority of web pages I visited.

http://shelve.sourceforge.net/

The main disadvantage was disk space. This is particularly true when some pages are 10 MB or larger. I would periodically prune the archive directory for this reason.

I stopped using Shelve when I started running out of disk space, and now I can't use Shelve because the addon is no longer supported. The author of Shelve has some suggestions for software with similar functionality:

https://groups.google.com/forum/#!topic/shelve-firefox-addon...

dbtx · on Dec 20, 2019

It installs and loads in Waterfox, but of course it still hasn't been touched in 3.5 years.

I used Scrapbook in the past (which also still works) but I usually just save random things in ~/webpages/ since (apparently) 2011. The earliest is a copy of the landing page at bigthink.com. Of course now almost every link is broken, excluding social media buttons, About Us, Contact Us, RSS, Privacy, Terms of Use, Login, and the header logo pointing at the same page.

sneak · on Dec 20, 2019

It would be nice if we had browsers that were actually user-agents that allow full pluggable customizability for all cookie, header, UI, request, and history behavior. Then, this would just be a plugin that anyone could install.

Polylactic_acid · on Dec 20, 2019

And then people install garbage extensions that break the browser and people think "Wow, this firefox browser is so buggy and slow" and switch to chrome. And then your extensions break with every single browser update because they are tampering with internal code.

Everyone is free to fork a browser and apply any changes they want. Allowing extensions to change anything at all essentially is the same as forking and merging your changes with upstream every update.

sneak · on Dec 20, 2019

Somehow that doesn’t seem to happen with editors; we have a ton of them and they are very customizable.

I guess “with a reasonably stable hook api” was supposed to be implicit in my statement.

fouc · on Dec 19, 2019

How is viewing the WARC after? Is it the same quality as archive.is/ or archive.org/ ?

cookiecaper · on Dec 20, 2019

It's a pain, like most of the WARC ecosystem. It's been several months since I dug in, so maybe it's had some spitshine the last little bit, but I usually end up using combinations of wpull, grab-site, and a smattering of other utilities to reliably capture a page/set of pages, and have had to make some quick hacks as well as manually merging in some PRs to get things to work with Python3. Once I have the WARC, I typically end up using warcat to extract the contents into a local directory and explore that way.

WARC as a format seems promising, but at least last I checked, open-source tooling to make it a pleasant and/or transparent experience is not really there, and worse, at least as of several months ago, doesn't really seem actively worked on. Definitely an area you'd expect to be further along.

edraferi · on Dec 19, 2019

Pretty good, depending on the tooling. I’m having good luck with https://webrecorder.io/ and their related open-source tools.

shakna · on Dec 20, 2019

archive.org use WARC if I remember correctly, and offer guides on creating and reading the format.

kccqzy · on Dec 20, 2019

Safari used to (and still) do this automatically but in a limited way. In the browsing history view (Command Y), you can search visited pages by its content, and this is extremely useful. But there's no way(†) to tell Safari to display that saved content. If you revisit a URL in the history, Safari fetches it again, losing the original saved content.

(†): short of direct plist manipulation

saagarjha · on Dec 20, 2019

Does it actually? I thought history was stored in a SQLite database and only kept the URL and page title.

railmeat · on Dec 31, 2019

This comment misses the point.

The point is to make a webpage that lasts. So people can link to it and get the page. That means making a maintainable webpage and a url that does not change.

It is great that you can archive every page you visit for yourself, but that is not the same as making a lasting web.

Lets make something that others can use too.

3xblah · on Dec 20, 2019

Better than a bookmark action would be a commandline option, similar to Firefox's -screenshot, which will work without starting X11. Something like -archive:warc

tripzilch · on Dec 20, 2019

Does this also strip the megabytes of superfluous tracking JS? It's probably what'll be the bulk of the size on the modern web, and I don't feel any particular need to store it.

(I believe that for historical purposes, enough complaining about ads and tracking will survive that future historians can easily deduce the existence of this practice)

crazypython · on Dec 20, 2019

I use DEVONThink on my mac, which has web archiving, full-text search, and auto-categorization.

3xblah · on Dec 20, 2019

"Imagine having the last 30 years of web browsing history saved on your local machine."

I believe the name for that experience was/is "Microsoft Windows".

temporaryvector · on Dec 20, 2019

>Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic.

I tend to do that, I also save a lot of scientific papers, ebooks and personal notes. I've found that doing so does not help me at all. The main problem I have is that when I need to look something up (an article, a book, a bit of info) I reach for google first, usually end up finding the answer and go to save it, only to find that I had already found the answer beforehand (and perhaps already made clarifying notes to go along with it) and then forgot about it.

This, and not dead links, is the fundamental problem with bookmarks for me. Not only bookmarks, it extends to my physical notes and pretty much everything I do. If I haven't actively worked on something for a couple of months, I forget all about it and when I come back to it I usually have to start from scratch until I (hopefully) refresh my memory. Some of it is also usually outdated information.

I think this is a big, unsolved problem and I'm not even sure how to go about starting to solve it. I can envision some form of AI-powered research assistant, but only in abstract terms. I can't envision how it would actually work to make my life better or easier. It would need to be something that would help blur the line between things I know and things that are on my computer somehow. If I think of my brain like it has RAM and cache, things I'm working on right now are in the cache and things I've worked on recently or work on a lot are in RAM, but what's for me lacking is a way to easily move knowledge from my brain-RAM to long term storage and then move that knowledge back into working memory faster than I can do so now. I'm not even talking about brain uploading or mind-machine interfaces, but just something that can remind me of things I already know but forgot about faster than I can do so by myself.

I am convinced that figuring out how to do this will lead to the next leap in technological development speed and efficiency. Not quite the singularity that transhumanists like to talk about, but a substantial advancement.

hikarudo · on Dec 20, 2019

I have exactly the same problem.

What I've found is that I need to spend more time deciding what is important, and less time consuming frivolous information. That's hardly a technology problem.

For things I really don't want to forget, I'm using Anki [0], a Spaced Repetition System (SRS). Anki is supremely good at knowing when you're about to forget an item and prompting you to review it.

Spaced practice and retrieval practice, both of which are used in SRS, are two learning techniques for which there is ample evidence that they actually work [1].

You still need to decide what is worth remembering, but that's something technology can't help with, I think.

[0] https://apps.ankiweb.net/

[1] https://www.learningscientists.org/

dredmorbius · on Dec 20, 2019

Yes so much this.

There are a few issues to consider:

- Any comprehensive archive of your activity is itself going to be a tremendously "interesting" resource for others -- advertisers, law enforcement, business adversaries, and the like. Baking in strong crypto and privacy protections from the start would be exceedingly strongly advised.

- That's also an excellent reason to have this outside the browser, by default, or otherwise sandboxed.

- Back when I was foolish enough to think that making suggestions to Browser Monopoly #1 was remotely useful, I pointed out that the ability to search within the set of pages I currently have open or have visited would be immensely useful. It's (generally) a smaller set than the entire Web, and comprises a set of at least putatively known, familiar, and/or vetted references. I may as well have been writing in Linear A.

- Context of references matters a lot to me. A reason I have a huge number of tabs open, in Firefox, using Tree-Style Tabs, is that the arrangement and relationships between tabs (and windows) is itself significant information. This is of course entirely lost in traditional bookmarks.

- A classification language for categorising documents would be useful. I've been looking at various of these, including the Library of Congress Subject Headings. A way of automatically mapping 1-6 high-probability subjects to a given reference would be good, as well as, of course, tools for mapping between these.

- I've an increasing difference of opinion with the Internet Archive over both the utility and ultimately advisability of saving Web content in precisely the format originally published. Often this is fragile and idiosyncratic. Upconverting to a standardised representation -- say, a strictly semantic, minimal-complexity HTML5, Markdown, or LaTeX, is often superior. Both have their place.

On that last, I've been continuing to play with the suggestion a few days ago for a simplified Washington Post article scrubber, and now have a suite of simple scripts which read both WashPo articles and the homepage, fetching links from the homepage for local viewing. These tend to reduce the total page size to about 3-5% of the original, are easier to read than the source, and are much more robust.

I'm reading HN at the moment from w3m (which means I've got vim as my comment editor, yay!), and have found that passing the source to pandoc and regenerating HTML from that (scrubbing some elements) is actually much preferable, for the homepage. Discussion pages are ... more difficult to process, and the default view in w3m is unpleasant, though vaguely usable.

Upshot: saving a WARC strictly for archival purposes is probably useful, but generating useful formats as noted above would be generally preferable in addition.

With the increasing untenability of mainstream Web design and practices, a Rococco catastrophe of mainstream browsers, the emergence of lightweight and alternative browsers and user-agents (though many based on mainstream rendering engines), the tyranny of the minimum viable user attacking any level of online informational access beyond simple push-stream based consumption, and more, it seems that at the very least there's a strongly favourable environment to rethinking what the Web is and what access methods it should support. Peaks in technological complexity tend to lead to a recapitulation phase in which former, simpler, ideas are resurrected, returned to, and become the basis of further development.

mark242 · on Dec 19, 2019

I fundamentally agree with the principle -- that pages should be designed to survive a long time -- however the steps the author lays out I completely disagree with.

"The more libraries incorporated into the website, the more fragile it becomes" is just fundamentally untrue in a world where you're self-hosting all of your scripts.

"Prefer one page over several" is diametrically opposed to the hypertext model. Please don't do this.

"Stick with the 13 web safe fonts" assumes that operating systems won't change. There used to be 3 web safe fonts. Use whatever typography you want, so long as you self host the woff files.

"Eliminate the broken URL risk" by... signing up for two monitoring services? Why?

I think this list of suggestions does a great disservice to people who just want to be able to post their thoughts somewhere. There's an assumption here that you'll need to be technically capable in order to create a page "designed to last" and frankly that is not what the internet is about. Yes, Geocities went away. Yes, Twitter and Facebook and even HN will go away. But the answer sure as hell isn't "I teach my students to push websites to Heroku, and publish portfolios on Wix" because that is setting up technical gatekeeping that is completely unnecessary.

est31 · on Dec 20, 2019

> "The more libraries incorporated into the website, the more fragile it becomes" is just fundamentally untrue in a world where you're self-hosting all of your scripts.

There are more problems though. older library versions might be vulnerable to XSS attacks, or use features removed by browsers in the future for security reasons (eval?). Or you might want to change something involving how you use the API but the docs are long gone. Generally, libraries imply complexity and when it comes to reliability, complexity will always be your enemy.

rodw · on Dec 20, 2019

Also unless you're very diligent about semantic markup and separation of content, presentation, interaction logic, the more complicated s site is the more difficult it is to port.

I have run into this problem trying to migrate very old web pages or blog posts off of SaaS sites that are shutting down or just decaying. It's not just that complicated sites make it difficult to extract the content in the first place; it's difficult to publish that content on another site in a high-fidelity, and sometimes even readable, way.

The hard part isn't keeping the old site (page) running (although that's not always easy either). The hard part is when you want to do something _else_ with that content -- more complicated means less (easily) flexible.

meesles · on Dec 20, 2019

I didn't perceive the author to be doing any technical gatekeeping, quite the opposite. I feel like their article was targeted at people like me or others who use stuff like Hugo/Jekyll, or those who use free website builders or use large frameworks for simple websites.

I agree a couple of the points seem out of place (the monitoring service one made me laugh. visiting my website is the first thing I do after uploading a new page), but the intent of this article I wholeheartedly agree with:

Reduce dependencies, use 'dumb' solutions, and do a little ritualistic upkeep of your website to keep it around for a decade or more. The things you propose are the norm and the reason nothing sticks around, IMO.

TeMPOraL · on Dec 20, 2019

> (the monitoring service one made me laugh. visiting my website is the first thing I do after uploading a new page)

I think what you want is not just monitoring your internal links, but also external ones - if a page you linked to in your article starts 404-ing or otherwise changes significantly, it's something you'd likely want to know about. That said, just like preferring GoAccess over Google Analytics, it's something I'd like to have running locally somewhere (on my server, or even on my desktop), instead of having to sign up to some third-party service.

strenholme · on Dec 20, 2019

> "Stick with the 13 web safe fonts" assumes that operating systems won't change. There used to be 3 web safe fonts. Use whatever typography you want, so long as you self host the woff files.

Indeed. 10 years ago, “font-family: Georgia, Serif” was guaranteed to work and look the same on pretty much all computers out there. Windows had all of the “web core” fonts (Georgia, Verdana, Trebuchet, Arial, even Comic Sans). Macintosh computers had all of the “web core” fonts. Even most Linux computers had them because it was legal to mirror, download, and install the files Microsoft distributed to make the fonts widely available.

In the last decade, Android has become a big player, and the above font stack with Georgia will look more like Bitstream Vera than it looks like Georgia on Android.

The only way to have a website have the same typography across computers and phones here in the soon-to-be 2020s is to supply the .woff files. Locally (because Google Webfonts might be offline some day). Either via base 64 in CSS or via multiple files; I prefer base 64 in CSS because sites are more responsive loading a single big text file than 4 or 5 webfont files. Not .woff2: Internet Explorer never got .woff2 support, and we can’t do try-woff2-then-woff CSS if using inline base64.

Even with very aggressive subsetting, and using the Zopfli TTF-to-WOFF converter to make the woff files as small as possible, this requires a single 116 kilobyte file to be loaded with my pages. But, it allows my entire website to look the same everywhere, and it allows my content to be viewed using 100% open source fonts.

Then again, for CJK (Asian scripts), webfonts become a good deal bigger; it takes about 10 megabytes for a good Chinese font. In that case, I don’t think it’s practical to include a .woff file; better to accept some variance in how the font will look from system to system.

Edit In terms of having a 10-year website, my website has been online for over 22 years. The trick is to treat webpages as <header with pointers to CSS><main content with reasonably simple HTML><footer closing all of the tags opened in the header> and to use scripts which convert text in to the fairly simple HTML my website uses for body content (the scripts can change, as long as the resulting HTML is reasonably constant). CSS makes it easy for me to tweak the look and fonts without having to change the HTML of every single page on my site, but as the site gets older, I am slowing decreasing how much I change how it looks.

BeatLeJuce · on Dec 20, 2019

I disagree with keeping fonts inline in the page. It means an additional 100kb per page at the very least. Which adds up very quickly. Remember that most of the world still doesn't have broadband (including yourself if you're using roaming services abroad). It also means extremely redundant information is transmitted when people watch more than one page on your site.

theandrewbailey · on Dec 20, 2019

Using base64 fonts in your stylesheet isn't a big deal when you aggressively cache and compress your CSS.

strenholme · on Dec 20, 2019

Exactly. It’s a CSS (not HTML) file with all the inline fonts in that file, with a long cache time, so all of the website’s fonts are loaded once for site visitors.

rimliu · on Dec 20, 2019

  > "Prefer one page over several" is diametrically opposed to
  > the hypertext model.

No, it is not. No need to split the article into five pages when it can be on one. Unless you want to inflate your clicks, that is.

rsync · on Dec 20, 2019

""Prefer one page over several" is diametrically opposed to the hypertext model. Please don't do this."

I think I agree with you here, in that much of the power of hypertext lies in the hierarchical "tree" model.

And yet, I think it has not been used properly up to this point ...

I hesitate to post this as this is not quite finished[1], but here goes - this is something called an "Iceberg Article":

http://john.kozubik.com/pub/IcebergArticle/tip.html

... wherein the main article content is, as the article suggests, a single, self-contained page.

And yet ... that is just the "tip" - underneath is:

" ... at least one, but possibly many, supporting documents, resources and services. The minimum requirement is simply an expanded form of the tip (the "bummock"), complete with references, notes and links. Other resources and services that might lie under the surface are a wiki, a changelog, a software repository, additional supporting articles and reference pages and even a discussion forum."

[1] Neither wiki nor forum exist yet, but the bummock does...

SamBam · on Dec 20, 2019

I don't see how that's fundamentally different from a Wikipedia article, which is basically

(1) a single, self-contained page, (2) that is just the "tip", and (3) linked within it is all the stuff mentioned

That site has various opinions about the "tip" being uncluttered of links, etc, but that's just an opinion (and one I disagree with).

rsync · on Dec 20, 2019

The thinking here is that the "tip" is <= to a single page.

Wikipedia articles can be quite long (and justifiably so) - perhaps scrolling many pages.

The "tip" of an "Iceberg Article" is ".. a single page of writing ..."

Perhaps confusing because I don't mean a "single (web)page" I mean, an actual single page.

hinkley · on Dec 19, 2019

It's not entirely wrong, if the page is held static but the browser continues to be upgraded.

If you're worried about fonts changing out from under a site you should surely also be worried about bitrot in, say, jQuery.

falcolas · on Dec 19, 2019

Or not bitrot, but ever-changing browser APIs.

bepvte · on Dec 20, 2019

When was the last browser change that broke things like simple news websites????

falcolas · on Dec 20, 2019

The phrases "simple" and "news websites" don't combine well these days. Even the NPR website downloads 13.2 MB of content over 91 individual requests, and takes just over 3.6 seconds to load (6.5 to finish).

- CSS Stylesheets: 3

- Animated gifs: 1

- Individual JS files: 11 (around 2MB of JS decompressed (but not un-minimized))

- Asynchronous Requests: 14 (and counting)

And that's with uBlock Origin blocking 12 different ad requests.

That's not simple in any form. So, the possibility of something on this page breaking? High. There's a lot of surface area for things to break over time. And that's not counting what happens when the NPR's internal APIs change for those asynchronous requests.

epicide · on Dec 20, 2019

I know NPR was just an example, but they do actually have a text-only version that I've found really useful: https://text.npr.org

wopian · on Dec 22, 2019

If the site is being served over HTTP/2 then the 11 separate JS files is a good thing compared to a single 2MB JS file.

-

In my case it also has the added benefit of being able to cache JS for a long(er) period of time, with users only having to download maybe 0-30kb of JS when only 1 component is updated instead of invalidating the entire JS served (Way under 1MB however)

notatoad · on Dec 20, 2019

>Use whatever typography you want, so long as you self host the woff files.

or use Google Web fonts, and set let last option in your font-family to be "serif" or "sans-serif" to let an appropriate typeface be used if your third-party font is unreachable. That's the beauty of text, the content should still be readable even if your desired font is unavailable.

TheRealPomax · on Dec 20, 2019

Google Web Fonts are not an "or", here. Fonts have disappeared from it, and there is no reason to not expect Google to, at some point in the future, go: "you know what, this costs too much without any substantial return." And now it's just another killedbygoogle.com product. Just like images, self-hosting woff/woff2 should be step 1.

shapov · on Dec 20, 2019

Fonts disappearing is not a big issue that will ultimately render your page useless. If the font is gone, the look of the page is slightly affected, but the content of the page remains. It's honestly not a big deal at all.

account42 · on Dec 20, 2019

In that case just use sans-serif or a web-safe font and avoid the third-party dependency.

strenholme · on Dec 20, 2019

Here as we enter the 2020s, there are no longer any web safe fonts. Those 1990s Core Fonts for the Web (Verdana, Georgia, Trebuchet, etc.) are no longer universal across all widely used platforms.

IggleSniggle · on Dec 21, 2019

Yeah okay, but the initial suggestion (just specify "sans serif") still holds. Or really, if we're talking about a webpage to last, why do we even care about what font is being used? If you care enough about a font that the glyphs used are important for layout, then obviously you're going to need to include the font. If the specific look of the page is essential to the content conveyed, it seems likely to me you won't be using a standard font anyway.

For typical "the words matter more than how the words look" content...can someone explain to me why we care about including the font?

strenholme · on Dec 21, 2019

There is another thread in this discussion where we discuss this, pointing out that default fonts in browsers tend to be quite ugly.

See here: https://news.ycombinator.com/item?id=21841011

There’s also layout issues caused when replacing a font with another font, unless the metrics are precisely duplicated. There’s a reason RedHat paid a lot of money to have Liberation Sans with the exact same metrics as Arial, Liberation Serif have the same metrics as Times New Roman, and Liberation Mono have the same metrics as Courier New.

zo1 · on Dec 20, 2019

His "or" was to suggest that instead of only self-hosting the font file, you simply use a google one with a "fall-back" that happens to be a super-standard font that won't reasonably disappear from most OSes in the near future. That way, you get a reasonable "best of both".

joppy · on Dec 20, 2019

Google web fonts were a great way to make my site slower. I don’t know if it’s the latency here in Australia or what, but (especially for developing locally) google web fonts were a big headache for having snappy webpages. I took the time one day to produce my own webfont files and self-host those, and the difference in site load speed is like night and day.

inimino · on Dec 20, 2019

And that's where they are not banned. Many pages simply won't load at all in the PRC because someone thought a Google analytics tracker or a hosted library should load before the content (which then never does).

oefrha · on Dec 20, 2019

Actually Google Analytics works fine in PRC.

Google Fonts also isn’t blocked but I recall it being hit-and-miss in terms of responsiveness when I was working on a website that targeted Chinese audience a few years ago. However, I just tried resolving fonts.googleapis.com and fonts.gstatic.com on a Chinese server of mine, and they both resolve to a couple of Beijing IP addresses belonging to AS24424 Beijing Gu Xiang Information Technology Co. Ltd., so it’s probably very much usable now.

catherd · on Dec 20, 2019

Not sure "working fine in PRC" is really something you can say about anything web related.

I do occasional web dev from within China and had to eliminate external references to get manageable page load times. At least from where I work pulling in practically anything from outside the Great Firewall will have a high probability of killing page load time. Anything hosted by Google in particular will often have you staring at your screen for 30 seconds.

inimino · on Dec 21, 2019

Yes, any additional domain you request to has a non-negligible chance of killing the entire connection. To GP, noticing that one request works once from a server (not a home or mobile connection) really means nothing. Every ISP has different and constantly changing failure modes.

ryandrake · on Dec 20, 2019

Or don’t specify any font at all and leave it up to the user’s preference. Why presume you know better than the user?

notatoad · on Dec 20, 2019

It would be great if web browsers had a way to actually indicate the user's preference of typeface, but what we've actually got is the browser's preference, and the browsers almost all have chosen really terrible default typefaces. It's fine to say "just use the default" for Mac users who get a decent default, but then the poor windows users have to suffer through some terrible serif.

The users who actually know how to change the default font also know how to use stylish.

hashmal · on Dec 20, 2019

When you go to a restaurant you let the chef prepare food for you.

Telling him to back off and let you cook because he can't know better than you (his user) would be absurd.

Same thing with design and typography. It requires skill and taste, and hopefully people will be delighted or simply consume the content for what it is, because the design/cooking just reveals that content in a convenient/useful shape.

joveian · on Dec 20, 2019

Most people have the means to cook for themselves without going anywhere and do so at least the vast majority of the time. Even if you do go to a restaurant, they almost always have menus rather than just making one dish for everyone since some people have styles of cooking that they prefer or don't like. People rarely design their own font but rather pick from professionally designed fonts. Additionally, at restaurants people pay for food so incentives are aligned while on the web people generally don't pay for content and any design professional involved is likely an advertiser. I rarely read stuff on the web for a design experience but for the content. I suspect most people would be unhappy with a newspaper that changed fonts for every story or a book that changed fonts every chapter.

Personally, I've been setting my browser to use only DejaVu fonts with a 16pt minimum for years (maybe a decade now) and every time I briefly use a default browser profile I notice the fonts and think not just "this is bad" but "how can people live like this?". Even with the usually minor issues that often appear, setting my own fonts is a way better experience than not doing so. My default experience is much closer to Firefox reader mode than it is to what the page specifies in most cases.

IMO, font speicification should be limited to serif, sans-serif, or monospace and let the user or browser set the actual font. Desingers should not rely on exact sizes of fonts or use custom icon fonts.

bjoli · on Dec 20, 2019

Most fonts picked by designers suck. Plain and simple. I override fonts for most websites I frequent.

hashmal · on Dec 20, 2019

Can you elaborate on why/how they suck? Do you have example links, to set a common ground for the conversation?

I think most fonts that get your attention suck, the best ones are invisible and get you directly to the meaning of text, without getting in the way. So maybe there's a kind of bias (selection or sampling bias?) operating here?

strenholme · on Dec 20, 2019

I can’t speak for the parent poster, but, yes, back in the Myspace days, end users would do really tasteless CSS like Comic Sans or an italic font everywhere. Back then, I told my browser “I don’t care what font they tell you to use, just render it with Verdana”.

These days, people either use their social network’s unchangeable CSS, or they use a Wordpress theme with an attractive and perfectly readable font. Even Merriweather, which I personally don’t care for, is easy enough to read.

The only time I have seen a page use obnoxious fonts in the 2010s is when the LibreSSL webpage used Comic Sans as a joke to highlight that the project could use more money:

https://web.archive.org/web/20140422115234/http://www.libres...

Edit It may be a case that the parent poster likes using a delta hinted font, either Verdana or Georgia, on a low resolution monitor, and doesn’t like the blurry look of an anti-aliased font on a 75dpi screen.

hashmal · on Dec 20, 2019

> back in the Myspace days, end users would do really tasteless CSS like Comic Sans or an italic font everywhere.

Indeed, typography is a skill. Most designers should have it though, which is why I asked more information to OP.

> The only time I have seen a page use obnoxious fonts in the 2010s is when the LibreSSL webpage used Comic Sans as a joke to highlight that the project could use more money

Ah, the infamous Comic Sans. It's a shame because as a typeface on its own, in its category, it is pretty good. Sadly, it's misused all the time in contexts where it's not appropriate at all.

> It may be a case that the parent poster likes using a delta hinted font, either Verdana or Georgia, on a low resolution monitor, and doesn’t like the blurry look of an anti-aliased font on a 75dpi screen.

Without more details we cannot guess. You're right: a lot of things can go wrong and ruin a typeface, regardless of how the characters are designed. Anti-protip: a reliable way to make any font look like shit is to keep the character drawings as they are and mess up the tracking (letter-spacing) and kerning.

strenholme · on Dec 20, 2019

I think one of the reasons Comic Sans got such a bad rep is because it was one of the relatively few available fonts back in the pre-woff “web safe fonts” era of a decade ago. Microsoft should had given us a more general purpose font, such as a nice looking slab serif to fill the gap between the somewhat old-fashioned looking Georgia and the very stylized Trebuchet MS font.

a1369209993 · on Dec 21, 2019

Because they they are not the single system default sans-serif and single system default sans-serif-monospace fonts that all websites MUST use, period, no discussion. As you put it:

> fonts that get your attention suck

If I can tell the difference between your font and the system default font, your font sucks; if I can't tell the difference, what's the damned point?

strenholme · on Dec 22, 2019

> the single system default sans-serif and single system default sans-serif-monospace fonts that all websites MUST use, period, no discussion.

The web standards allow a website to use any WOFF (or WOFF2) font they wish to use. Please see https://www.w3.org/TR/css-fonts-3/

a1369209993 · on Dec 22, 2019

The web standards are wrong. This shouldn't be surprising, since they also allow a website to use javascript and cookies.

strenholme · on Dec 22, 2019

Well, if it makes you feel any better, my website renders just fine on Lynx (no Javascript nor webfonts needed to render the page), complete with me putting section headings in '==Section heading name==', which is only visible in browsers without CSS. Browsers with modern CSS support see the section headings as a larger semibold sans-serif, to contrast with the serif font for body text. [1]

[1] There are some rendering issues with Dillo, with made the mistake of trying to support CSS without going all the way, making sure that http://acid2.acidtests.org renders a smiley face, but even here I made sure the site still can be read.

[2] Also, no cookies used on my website. No ads, no third party fonts, no third party javascript, no tracking cookies, nothing. The economic model is that my website helps me get consulting gigs.

[3] I do agree with the general gist of what you’re trying to say: HTML, Javascript, and CSS have become too complicated for anything but the most highly funded of web browsers to render correctly. Both Opera and Microsoft have given up with trying to make a modern standards compliant browser, because the standards are constantly updating.

a1369209993 · on Dec 23, 2019

> Well, if it makes you feel any better, my website renders just fine on Lynx

It doesn't; I only use lynx when someone tricks apt-get into updating part of my graphics stack (xorg, video dirvers, window manager, etc) and researh is needed to figure out how to forcibly downgrade it, and then only because I can't use a proper browser without a working graphics stack.

> the general gist of what you're trying to say: HTML, Javascript, and CSS have become too complicated for anything but the most highly funded of web browsers to render correctly.

This is subtly but critically wrong; I am saying that it is necessary than web browsers do not render websites 'correctly'. The correct behaviour is to actively refuse to let websites specify hideous fonts, snoop on user viewing activity, or execute arbitrary malware on the local machine.

> Browsers with modern CSS support see [...] the serif font for body text.

My point exactly.

hashmal · on Dec 24, 2019

"not getting your attention" and "can't tell the difference" are not the same thing.

a1369209993 · on Dec 25, 2019

Fair nitpick - "haven't noticed the difference yet" would be more accurate - but I don't see how that changes the argument; if I haven't noticed a difference, what's the point?

chrismorgan · on Dec 20, 2019

The trouble is that the defaults tend not to be the best fonts that are available, and very few users change them. I have changed them myself, but I don’t know of anyone else that has.

For myself, I wish that people would leave Arial, Verdana, Helvetica Neue, Helvetica, &c. out of their sans-serif stack, having only their one preferred font and sans-serif, or better still sans-serif alone; but as a developer I understand exactly why they do it all.

saagarjha · on Dec 20, 2019

Unfortunately, I'm one of those developers :( My font stack is:

  font-family: system-ui, Helvetica, sans-serif;

for prose and

  font-family: ui-monospaced, Menlo, monospace;

for monospaced text. The first being the user's preferred font, the second as a good (IMO?) default that I impose on them, and the third as a full fallback. I'm conflicted on whether this is the right balance between user choice and handling browsers that support nothing.

incompatible · on Dec 20, 2019

: "in a world where you're self-hosting all of your scripts."

Anything self-hosted is already fragile: it will go away when you don't continue to actively maintain it (paying for a domain, keeping a computer connected to the Internet etc.) or when you die.

Normal_gaussian · on Dec 20, 2019

I've outlived and outhosted all of the third party hosts I've used.

incompatible · on Dec 20, 2019

I guess you can last 10 years, which is apparently what "This Page is Designed to Last" aspires to, but what if we have greater ambitions? Like 100 years?

rodw · on Dec 20, 2019

Then I think you need something like archive.org.

You can design and mark-up content that will still be useful and readable in 100 years. You might be able to preserve the presentation logic (CSS-style) for 100 years.

You probably won't be able to preserve the interaction design for 100 years (without a dedicated effort -- that's why they bury computers along with the software in time capsules).

But I think it is optimistic to think that _most_ SasS hosts are going to archive content for 100 years. Preserving digital content is an _active_ process. It takes resources and requires deliberate effort.

Postscript: I'm trying to think of modern companies that would preserve content for 100 years, assuming they make it that for.

Facebook is the only significant current platform that I can even imagine preserving content for 100 years, but even that seems like s stretch. Historians might step in to archive it, but is there real value to Facebook to maintain and publish 50 year old comments on 2.5 billion unremarkable walls?

Twitter won't. Certainly Insta, SnapChat, WhatsApp etc. won't. Flickr probably could do it relatively easily but won't. YouTube maybe, but there's more to store. Something like GitHub maybe?

Normal_gaussian · on Dec 20, 2019

Repo hosts are quite vulnerable to storage abuse as well as simply accruing genuine old content.

I can see a deletion heuristic that considers both account activity and repo activity being deployed within the next 10 years.

However I expect another evolution in SCM in the same timeframe.

rodw · on Dec 20, 2019

Right, look at SourceForge. There's a lot of broken links and/or references to no-longer accessible content in some of the older Apache.org projects too.

Also maybe cvs/svn/git repo generally don't contain content worth preserving for 100 years. There are some historically significant or interesting repos, but for the most part you'll have a bunch of unremarkable (and duplicated) code that may not have run then and certainly won't run now.

TeMPOraL · on Dec 20, 2019

> a bunch of unremarkable (and duplicated) code that may not have run then and certainly won't run now

100 years is a long time, but I do run 20+ years old Common Lisp libraries and expect them to work without modification; I'd be really pissed if they disappeared from the Internet because someone thought that 5 years of inactivity means something doesn't work anymore.

misterdoubt · on Dec 20, 2019

If you have built something worth lasting 100 years, other people will help you ensure that it does. That reduces the concerns in this article considerably.

atoav · on Dec 20, 2019

When I studied media science one of the most lasting experiences I had was a talk with one lady of the viennese film museum (on of the few film museums that store actual films instead of film props).

As a digital native I never gave it a thought, but she told me that there is a collective memory gap in films that have been shot or stored digitally. With stuff that has been stored on film, there was always soem copy in some cellar and they could make a new working copy from whatever they found. With digital technology this became much much harder and costly for them, because it often means cobbling together the last working tape players and maintaining both the machines and the knowledge of how to maintain them. With stuff on harddrives a hundred different codecs that won’t run on just any machine etc this combined to something she called the digital gap.

I had never thought about technology in that way. Nowadays this kind of robustness, archiveability and futureproofing has become a factor that drives many of my decisions when it comes to file formats, software etc. This is one of the main reasons why I dislike relying soly on cloud based solutions for many applications. What if that fancy startup goes south? What happens to my data? Even if they allow me to get it in a readable format, couldn’t I just have avoided that by using something reliable from the start?

I grew to both understand and like the unix mantra of small independent units of organizations — trying as hard as possible not to make software and other things into a interlinked ball of mud that falls apart once of the parts stops working for one or the other reasons. Thinking about how your notes, texts, videos, pictures, software, tools etc. will look in a quasi post apocalyptic scenario can be a healthy exercise.

Twisell · on Dec 20, 2019

On this subject you can dive into the story of the missing "Doctor Who" TV serials.

Some tape of master's were infamously reused to store other contents. Beside the whole archive problem come from the reusable nature and scarcity of the chosen storage. I think I've read something about reusing paper as well in medieval time.

https://en.wikipedia.org/wiki/Doctor_Who_missing_episodes

vlz · on Dec 20, 2019

> I think I've read something about reusing paper as well in medieval time.

This mostly happened with parchment, not paper, but otherwise you are right. It is called a palimpsest.[1] Sometimes the writing under the writing can be reconstructed as happened with the oldest copy of Cicero's Republic.[2]

[1]: https://en.wikipedia.org/wiki/Palimpsest

[2]: https://www.historyofinformation.com/detail.php?entryid=3059

b0rsuk · on Dec 20, 2019

There is an old observation that I found striking at the time:

Newer methods of storing information tend to be progressively easier to write, and progressively less durable.

(The following is not really in chronological order)

You'll never look at stone tablets the same way again. As primitive as they are, their longevity can be amazing. Ancient emperors and tyrants knew what they were doing. Trajan's column from 113 AD is our main source on roman legionary's iconic equipment.

Cuneiform tablets were heavy and awkward, but they were 3D so there was no paint to worry about.

Parchment tends to be more durable than papyrus, and paper. Perhaps the best known among the Dead Sea Scrolls was made out of copper.

Iron Age culture artifacts are harder to find than Bronze Age one, because bronze is more resistant to corrosion.

CD's, especially(?) from home burners, are reported to oxidize after several years. That may still be better than tapes, hard drives and other magnetic media (SSD?) which can be wiped by an EMP pulse. The internet era information storage appears to come with an upkeep cost! Slack practically doesn't archive messages by default. Until Gmail, it was typical for email servers to delete old messages.

People get used to novelty and things being ephemeral. Capitalism supposedly requires low durability goods so people keep buying them, including tools and clothes. Houses are poorly built break down pretty quickly.

I find it amazing people used to decorate their homes, tools, clothes with ornaments, engravings etc. You'd be a fool to do that today, you don't even know how long that thing is going to last.

mmsimanga · on Dec 20, 2019

Interesting analogy. I am having opposite problem. I have a shoe box half filled with miniDV tapes. Camera is long gone. I would like to transfer these to hard drive. Services that offer service digitize your tapes just too expensive for me. Most of the tapes probably just goofing around and there is issue of privacy. With current camcorder I just plug the SD card into computer and copy across.

ant6n · on Dec 20, 2019

Get an old minidv player which can transit the video over FireWire: https://www.quora.com/What-is-the-best-way-to-transfer-mini-...

blackoil · on Dec 20, 2019

Ask Whovians (https://en.wikipedia.org/wiki/Doctor_Who_missing_episodes).

Post internet, most content is globally replicated. Someone somewhere will find time and energy to make an Amiga simulator with exactly correct bugs, to run the program you want. Amount of content lost proportion to amount of content created must have gone down dramatically.

Twisell · on Dec 20, 2019

Sorry for cross interfering with this post timeline ;)

rusk · on Dec 20, 2019

> With digital technology this became much much harder and costly for them, because it often means cobbling together the last working tape players

I think something might be getting lost in translation. Could she have meant “electronic” rather than “digital” (which to me suggests digital media such as DVD etc)

This whole anecdote makes more sense to me with this substitution.

atoav · on Dec 20, 2019

She was refering to both, she said they have similar problems with CDs and even stuff on hard disks, because often the used Video Codecs are hard to get running without the right knowledge and resources, especially because some of the non-consumer-codecs were often also proprietary and sort of made fore specific plattforms, but I don't know too much on that, so take this as speculation.

rusk · on Dec 29, 2019

Yes when you equate it with “both”it all makes more sense! Just that digital as I’m used to the term excludes the analogue stuff like VHS as well

djha-skin · on Dec 21, 2019

Obligatory xkcd entitled "Digital Resource Lifespan" https://www.xkcd.com/1909/