FTS, Compression and Archiving for text documents

Published: 2023-12-17

This is a blogpost seeking some feedbacks and suggestion on the design and implementation of storage, compression and lookup operations for an archival system. The project is not public yet, will link it here as soon as it becomes public.

Why I am working on this project? No reason, just seemed like a fun problem to work on. This is inspired by projects such as NYT first said and newsdiff etc.

General Idea

The final idea is to be able to query things like:

When was apple mentioned for the first time in NYT
When was apple, mango and juice mentioned together for the first time in Times of India. (Not necessarily as a phrase but in the same article)
When was "apple and mango juice" mentioned as a phrase for the first time in Times of India.
When was "apple and mango juice" and "reliance jio" mentioned as a phrase for the first time in Times of India for the first time.
Now think of making all these queries but across different publications like NYT, Times of India and others.

What seems to be the main components?

Connectors : Independent data fetcher(s) (sources per publication), these will basically fetch the data from somewhere(can be api, can be existing data dump) and store it for us in the desigired format.
Databases
- SQLite
  - Cached 1st said store
  - Operations DB
    - This is to ensure(?) idempotency of the fetching and processing operations by storing metadata
- DuckDB
  - All words
    - To make queries like (2), we need to store each occurrence of the word per article in a parquet file.
    - Using a columnar store I am assuming compression will kick in and keep the size small.
    - This is my first time working with such a store but we’ll see if my assumption about faster search vs FTS holds true.
    - This will also not contain stop words etc.
- Meilisearch
  - While Meilisearch seems like the winner here, I considered and compared other solutions. (See FTS section)
  - If we get FTS to work correctly and efficiently, we might not even not need the other two databases. We’ll run some benchmarks and comparisions and decide on that later.
Clients
- CLI : For triggering fetch/process and also query
- Web : For query
Orchestration
- I’ve written the basic system which will keep all the components together
- I am looking at windmill for the final orchestration part but as of now I am trying to ensure idempotency from the application run standpoint in regards to data fetching/processing from different sources/publications.

What I am sort of unsure about? (I NEED HELP HERE) 🌟 🆘

So I’ve finally decided that based on the operational needs(see below), it’ll be best to:

Compress each file using zstd (Even if each file is on an average <30KB, we get about 2x compression)
Identity of the file: md5(contents+publication_date+publication_name)
We create the FTS on the identity - content - metadata
I need help/second eyes with the idea of using per file compression and going without any archiving. Am I missing out on any way with which:
- I could achieve greater compression by archiving files and putting them into some kind of buckets (think hashtables)
- And then also enjoy efficient random access into the compressed archive.
- Put it another way, can I have the cake and eat it too?
- It seems like I cannot unless I have a cache file server which would cache files from s3.
- See Compression and Archiving section.

Operations

Operational needs

Everything will be scope per publisher/source. i.e we won’t have a global db etc. All the metadata and data for each source will be isolated by context. i.e Tenant = Publisher/Source.
We need to able to fetch the file by its identity
We need to be able to do full text search on the entire corpus of text

Dataset info

We’re considering upper bounds.

Size of each item?
- Identity: hash of contents + date of publication + publication name
- Each article after processing with trafilatura will be average around 30KB.
Size of the entire dataset?
- No. of publisher x (Years x 365 x 200 x 30KB)
- 200 articles per day, 30KB (average)
- Eg. per publication for last 50 years: 50x365x200x30 = ~100GB/publication
Fixed or continuous?
- About ~200 articles added per day, per source. (+6MB/day)
Item is immutable or mutable?
- Once fetched we don’t intend to re-fetch the same data. We assume it’ll remain the same forever, even though journalism practices do not prove to be so.
Item schema/structure?
- We have a defined structure, text, metadata for each article.
- For other operations like words extractions etc, we have other schemas.

Storage

Storage Medium

Bucket: By publisher/source.
Local and Object store.
Because object store is more limiting than local fs in certain cases, all our operations are limited by what an object store like s3 will allow.
Limitations/Considerations with object store
- Most probably network partition will be involved (HTTP calls)
- We cannot simply get a single file out of an archive, if our file sits in an archive, we’d need to download the archive and then extract the file we need from it. This could be optimized if we had the file on disk. As an optimization for some later point in time, we could have a file cache server which could pre-fetch archives from object store and give us random access to files on demand. In other words solid compression becomes problematic.
- We cannot use specialized file system operations

Compression and Archiving

F=file(eg.txt), C=compression(eg.zstd), A=archive(eg.tar)
Suppose some source has file F1,F2,F3 and F4
RA=Random Access, FS=File System, OS=Object Store

ID	Approach	Compression	RA-FS	RA-OS
1	F1,F1,F3,F4	NO	YES	YES
2	C(F1), C(F2), C(F3), C(F4)	OK	YES	YES
3	A(F1,F2), A(F3,F4)	NO	MAYBE	LIMITED
4	C(A(F1,F2)), C(A(F3,F4))	BETTER	LIMITED	LIMITED
5	C(A(F1,F2,F3,F4))	BEST	MAYBE?	NO

So I’ve decided to go with 2, because it’s nice to have some compression and I’ll need random access per file when files are stored in object store.
In cases where we get greater compression, getting a single file would require us to download the entire compressed file and then we’d need to de-compress it and find our file.
I even though if I should attempt writing a custom archival format with an index in the archive file header, allows appending and with streaming de-compression we could possibly make this work somehow? We could probably jump over each file in the archive because number of the files are anyway not too big in one archive but we want it to be compressed as-well. Ugh, this gets somewhat complicated. But guess that’s an idea for another time. For now 2 seems to be good enough.
- See GitHub - linz/cotar: Cloud optimised tar (.tar + .tar.index + AWS S3 = ❤)
- See Small File Archives in MinIO
Links for more info
- Random Access over Solid Compressed Archive
- Random access to chunks of a file inside a ZIP
- Random access compression in BlobFS | Fuchsia
- Which archival formats efficiently extracts a single file archive
- Compression method supporting solid compression + append
- Streaming tar archive : This could be useful if we’re storing the data in filesystem in certain cases but again we’ll be accessing over object store.

Search

As mentioned previously we want to each over this dataset for certain queries and make the dataset available for others to do other analysis.

One word search
Multi word search
FTS + Phase search + Sorting by oldest(“when was X said for the first time”)

Why Meilisearch for FTS?

As mentioned previously, the files can live locally OR also can live on some object store. So in essence, what we actually need is

Full text search over millions of plain text files stored in an object store
After the search is successful, we should be able to efficiently return the contents of the file to the user based on the file identifier.

The final solution I’ve decided for FTS can be summarized into: Using a dedicated(not a db fts) search index to index content+metadata to the file identifier. Everytime we store new data into the object store we’ll re-populate this index. We’ll need to host the search engine separately, the index for the search engine can live in any storage medium as per requirement. Seems like it is not really possible to do FTS directly on S3. Additionally Mellisearch has good global language support, so would be useful if I want to include non-english publishers in the corpus later.

FTS comparison

Following are meilisearch alternatives I considered and why they’d not(relative) be a fit for my usecase. (I have not hammered the nail on meilisearch atm, but seems like the easiest to get rolling for now). More importantly, I’ve not done any performance benchmarking on with my dataset, so I a way I am going w meilisearch purely on vibes.
- ripgrep: This was my initial idea but this has issues and won’t work when files are in an object store.
- AWS specific
  - AWS Athena : Allows SQL on S3, but our data is unstructured
  - AWS CloudSearch : I don’t really understand this but guess it can do what I want but I am probably going to use garage and don’t really want any kind of vendor lock in.
  - AWS Opensearch: Elasticsearch issues + AWS issue regarding vendor lockin (plus this is side project)
  - AWS Kendra : Why does AWS have so many products for the same(?) thing again?
  - AWS S3 Select : Works with structured data iiuc.
- mickael-kerjean/filestash : Filestash is an amazing project, the author mentions using it for this usecase but it offers little of what I need and more of what I don’t need for this project. But i really look forward to using filestash for my homeserver.
- Elasticsearch: It is possible to do so as it’s the same process as what we’d need to do with meilisearch but seems like elasticsearch would be too much for this small sideproject.
- PG FTS
  - First, we’re not using PG anywhere in this project so it automatically has less preference here even though it’s amazing.
  - Secondly, our dataset it very big(would unnecessarily bloat) and would like it to be re-distributable, keeping all the text data in a DB would require someone to have a PG client to view the data rather than access the data directly from the object store.
  - We want some advanced features which are probably possible with PG FTS but would require me to dive more into it, but meilisearch have those features OOTB.
  - SQLite FTS is similar in this regard and not considering using it for same reason
  - Similar rationale like PG FTS
- Manticore : Could be a potential candidate
- Typesense : Could be a potential candidate (RAM based)
- quickwit
  - Quickwit is interesting because it allows you to store the search index in S3/Object store.
  - Mellisearch even though can create snapshots to s3, cannot do this the way Quickwit does it. But for our usecase Mellisearch is good enough as our data is finite and not so big in the scale of big data.
- sonic
  - Disk based. (Not RAM based). i.e search query results in a lot of random accesses on the disk.
  - Documentation is good but not super extensive
- zincsearch
  - Written in Go
  - Could be a potential candidate
- stork and pagefind
  - More suitable for adding FTS into search in a static website
- edgesearch
  - Interesting idea and execution
  - Built around cloudflare workers
  - Documentation is not super extensive
- Toshi
  - Documentation is not super extensive