Improving the DSH search experience¶

Over the past few months, I've been increasingly thinking about how to improve the usability of all the ~2500 resources I carefully curate here on DSH.

In this post, I detail the design, development and final result of an interactive semantic search and visualization system for my personal knowledge center, built upon modern technologies like marimo, model2vec and datamapplot.

Motivations¶

I built DSH in late 2020 to give shape and structure(1) to all the interesting data-related links I came across, and I still find it really useful to have a curated collection of online resources that spans my professional and personal interests.(2)

Despite I'm probably still feeding it also to placate a combination of OCD and FOMO.
Even if the idea of delete everything starts to tickle me.

The built-in search functionality of mkdocs-material already provides a good way to enable full-text search¹ through DSH and has some good configuration options(1), but in the LLM era, where most information retrieval tasks are performed through natural language and agentic interactions, a search bar without semantic search capabilities feels too limited.

And even some tweaking is available!

Since I wanted to start simple and add complexity little by little, I decided to do some good old experimentation before unleashing a coding assistant with a still vague request such as "please improve the search experience of my static website" (how? where? why?!).

Index creation¶

First of all, I had to extract all the links from the source Markdown files and put them in a more convenient format for data analysis.

I chose JSON format and wrote a little parser to build an index.json: nothing fancy here, but I took care to include other simple metadata in the index for each link, such as the category (section here on DSH), the topic (filename where the link has been stored) and optional section (file subheader) - see sample below.

[
  {
      "category": "Python",
      "topic": "Documentation",
      "section": "mkdocs",
      "link_name": "Python markdown terminal built for mkdocs",
      "url": "https://github.com/mkdocs-plugins/termynal"
  },
  {
      "category": "Data Science",
      "topic": "Time Series",
      "section": null,
      "link_name": "Pattern mining with `stumpy`",
      "url": "https://towardsdatascience.com/part-8-ab-joins-with-stumpy-af985e12e391"
  },
  {
      "category": "Misc",
      "topic": "Mathematics",
      "section": "Topology",
      "link_name": "Community network detection with Ricci flow and surgery on graphs",
      "url": "https://graphriccicurvature.readthedocs.io/en/latest/tutorial.html"
  }
]

New search design¶

Embeddings¶

Being a data professional fluent in Python, experimentation nowadays means marimo, so I fired up a new notebook.

To follow the "start simple" approach, the first choice for embeddings was TF-IDF, and I opted for the embetter.text.learn_lite_text_embeddings implementation which wraps it in a convenient way together with Latent Semantic Analysis (LSA) via TruncatedSVD.

For each element in the index, the actual data passed to the embedding pipeline is the concatenation of category, section, topic and link name, chained together to resemble an actual sentence with the following template {category} {topic}, {section}: {link_name}. The input sentences corresponding to the sample index entries above are therefore:

Python Documentation, mkdocs: Python markdown terminal built for mkdocs.
Data Science Time Series: Pattern mining with `stumpy`.
Misc Mathematics, Topology: Community network detection with Ricci flow and surgery on graphs.

After testing TF-IDF embeddings, I decided to take the chance to experiment with model2vec by MinishLab: I went for their flagship model potion-base-8M, which results in a very small model on disk (~30 MB) stored via safetensors.

What is the effect on 2D visualization?

Here is a visual comparison between basic embeddings and model-distilled ones, after UMAP projection with Euclidean metric.

Both embedding models allow some clusters to emerge (e.g. the LLM "island"), but only the latter seems to guarantee a better separation between all the labelled categories.

Query syntax¶

I wanted the new search system to support the following features: be capable of excluding terms from the search and restricting the search to given categories.

For the first requirement, I chose a syntax loosely inspired by Google search: using - in a search query penalizes search results that match with the term which follows.

Example

The query time -series returns the most representative results (1) related to "time" but which have little or no semantic relationship with the term "series".

With respect to cosine similarity between the query and the index embeddings.

For the second requirement, I implemented a simple filter logic: using # in a search restricts the results to only those belonging to the specified categories.

Example

The query tree #visualization returns results semantically similar to "tree" found in "data visualization" category.

Search interface¶

Given that I was already coding into a marimo notebook, I chose to stay in the same environment to build the search interface, at least for the moment. I implemented a simple search bar with mo.ui.text placed into marimo sidebar, and the top 10 search results are then displayed vertically underneath, each one being a mo.stat implemented as follows:

Source codeRendered layout

def display_stat(item: pd.Series) -> mo.Html:
    return mo.vstack(
        [
            mo.md(f"[{item['link_name']}]({item['url']})"),
            mo.stat(
                caption=", ".join(
                    [
                        item["category"],
                        (item["topic"] or ""),
                        (item["section"] or ""),
                    ]
                ),
                value=item[
                    "similarity"
                    if not terms_to_ignore
                    else "compound_similarity"
                ],
                bordered=True,
                label=item["url"].split("://")[-1].split("/")[0],
            ),
        ]
    )

Visualization¶

Projection in 2D¶

Having used BERTopic in recent years, I chose UMAP to project the high-dimensional embeddings into 2D and prepare them for visualization. Currently, the only hyperparameter customization has been the selection of cosine metric instead of euclidean one.

DataMapPlot¶

DataMapPlot is a powerful tool based on deck.gl to create stunning "enhanced scatterplot", both static and interactive, with a specific focus on text embeddings. I went for the interactive version of the plot with the following configuration:

the labels for the different level of resolution are, respectively, category, topic and section (see above)(1);
the hover text is the link name, and there is a binding on the on_click event to open the corresponding URL;
the in-plot search is enabled and performs the search versus the link_name field;
a selection handler can be triggered with Shift+Left Button to perform lasso selection and obtain 10 samples from the selected region, listed in a popup on the right side.

Labelling each link I add to DSH, despite all the time spent, seems now to have been a rewarding task!

Search binding¶

As a last feature, I added a mo.ui.switch to optionally bind the plot with search results. The logic is simple:

if the switch is active, a set of relevant points is selected as the set of points for which the relevance, i.e. cosine similarity, versus the search query is at least 50% of the relevance of the most relevant one;
this set of points is then used to filter the visualization, displaying only the points which belong to the set.

Next steps¶

I'm pretty happy with the final result, but I already have some additions in mind:

experiments for a further reduction of Minishlab models size;
embed the marimo app directly into DSH, to allow enhanced search online²;
compare the current implementation with a full BERTopic pipeline;
integrate an LLM to further improve search experience.

This implementation is based on lunr.py. ↩
Unfortunately, WASM-based embedding isn't a solution as of now, mainly because a lot of packages I used aren't available in pyodide list. ↩