Implementing a Static Blog Search Clientside in JS
One of the primary features of the blog on this website is the fact that it is compiled offline with a python script and then served as a static site by a "dumb" server. The advantage of this is less overhead for me when hosting, as well as the ability to leverage Github Pages to act as a CDN. More info about the decisions behind the script are detailed in this post.
However, there was something that the site had been lacking, and that was a way to go through and actually search through the posts that I had been made. I had a couple of ideas on how to go about this without having to add overhead with a backend.
Idea #1 Searching Categories Only
My initial intent of the "categories" to the left of every blog post was to generate separate json blobs for every unique category to keep track of which posts were in which categories. Posts can already be referred to by their unique post IDs (seen in the URL). So for instance:
"tutorial" : [1, 4, 7],
In my case post IDs are 10 digits, though. So the idea here is that a search can go through specified known tags and return the matches in the form of links to blog posts. This approach is okay, but only covers the tags, which is a small subset of the content in the posts obviously!
Idea #2 Searching Everything
"6629164446" : "One of the primary features of the blog on this website is the fact that it is compiled offline with a python script and then served ... ",
The issue with this is twofold: firstly, I'm providing the entirty of the site just for the search page. The more blog posts, the more text this is going to be. The second issue is similar: every search query is going to to have to thumb through all the words in all the entries. This is probably actually never a real issue due to the size of my blog, but still it isn't very efficient
Idea #3 Inverted Index
An inverted index is a search engine teachnique to speed up the esponse time of a query. The term inverted index refers to the fac that it's the opposite of a forward index. A forward index is a mapping of the indentifier (post ID) to the set of unique words that are within it. An inverted index flips the key and the entry types, and uses the unique words as the key and the entries as the set of identifiers that the word is in. Below is an example:
So now, it's easy to break the search query into unique words and then look those words up in that data structure. This results in a set of blog post IDs that match for each word in the query. It also means that the keyset of this dictionary is the set of unique words in all my blog posts, or in other words my blog's "vocabulary".
I went with the third approach, which you can see in action on the Search page. There's still a bit of overhead in sending the large json blob of search data, but I believe the benefits of being a static site overcome the download speed. It should never greatly exceed the file size of small images, let alone large ones.
Future upgrades to this mechanism include tagging frequencies of the words per page in order to "rank" them based on which matches the query more closely. Also, by tagging the frequencies, it should be possible to make one of those cool word cloud thingies.