Announcing nanosearch, a Python package for making small search engines

Published on under the Coding category.

In my notes, I wrote down an idea for “tiny search engines.” I wish there were more search engines for small communities that index websites that matter to them. I wish I could make a small search engine of a handful of sites relevant to a particular topic that I can use and share with others.

To enable more search engines, it needs to be easier to make a search engine. I have spent a lot of time thinking about search in the past and have found many rabbit holes into which one can fall. You can spend weeks learning about crawling, indexing, ranking algorithms, data storage, link graphs, and more. There is so much to learn, but you don’t need much to make a small search engine for a few websites.

Introducing nanosearch

I am working on a new Python package called nanosearch (pip install nanosearch). It is designed to create a search index you can use in a few lines of code. I made this tool as part of an exploration I want to do into tools for technical writers. Nanosearch implements primitives that get you a working search engine, without having to set up a database or think about ranking algorithms.

nanosearch accepts either a sitemap, which is then downloaded to find all URLs in the sitemap, or a list of URLs, then crawls all of them. A BM25 or TFIDF search index is created, depending on which option you choose, that you can then search. nanosearch also calculates a link graph from all the pages on your site, and uses the number of inlinks to a page as a ranking factor. This allows the search engine to boost URLs based on how many links are going to them.

How to use nanosearch

Here is how you can create a nanosearch search engine:


from nanosearch import NanoSearchBM25

engine = NanoSearchBM25().from_sitemap(
    "https://jamesg.blog/sitemap.xml",
    title_transforms=[lambda x: x.split("|")[0]]
)
results = engine.search("aeropress", n = 5)

for i, r in enumerate(results):
    print(f"{i + 1}. {r['title']} ({r['url']})")

The code above creatse a NanoSearchBM25 index, then load in all URLs from a sitemap. I have supplied a title transform function that can manipulate the title that corresponds with an indexed document. The transform function above ensures that all text after the first | character, a separator I use on my blog, is removed. The code then runs a search for the term coffee. The search returns:


1. Why I Love the Aeropress  (https://jamesg.blog/2020/10/28/why-i-love-the-aeropress)
2. Aeropress Recipe  (https://jamesg.blog/2020/10/01/aeropress-recipe)
3. An Aeropress glossary  (https://jamesg.blog/2021/02/07/aeropress-glossary)
4. Building a random Aeropress recipe generator for my search engine  (https://jamesg.blog/2021/08/20/random-aeropress-recipes)
5. How to Shake Up Your Aeropress Recipe  (https://jamesg.blog/2021/05/11/shake-up-aeropress)

In a few lines of code, we were able to make a search engine that returns results relevant to a query, with links as a ranking factor to help prioritise documents in the search.

Once you have made an index, you can save it for later use. You can save and load an index with the following code:


engine.to_nanosearch_json("index.json")

engine = NanoSearchBM25().from_nanosearch_json("index.json")

Next steps

At the moment, nanosearch can only search through one sitemap. I would like to add support for searching through multiple sitemaps, so I could make a search engine of several resources. I can imagine building a search engine that combines several academic journal websites so I can explore papers with a limited search space.

I need to think through how nanosearch could be integrated into a web application, and what a template for that would look like. The web application could build an index and serve a search page. In the background, every day, nanosearch could build a new index and save it to the file system. The web application could check for a new search index every day and reload it, allowing the application to run without having to stop.

Go Back to the Top