When elvex ingests a Datasource, we launch a background job that looks something like this:
Store the raw file (e.g. PDF, DOCX, etc.) in a secure, non-publicly available location.
Launch a job that:
Securely downloads the raw file.
Splits the document into "chunks" (each chunk contains a number of sentences)
Chunks are further processed and stored in elvex's database.
Files provided to elvex are never publicly accessible.
Static files are processed and stored once, however dynamic datasource files can be periodically synced and re-processed by Elvex. Learn more about dynamic datasources.
What information is stored in elvex's database?
In order to support the broad range of use cases elvex plans to support, the "chunks" we store in elvex's database include fields like:
Metadata about the original file the chunk belonged to.
The raw text the chunk was based on (reminder: a chunk is a group of sentences).
An embedding for the chunk (a numeric representation of the chunk which captures its meaning).
Some additional refined fields which are a bit of our secret sauce to make searching fast and "just work".
So, in a way, your data is stored twice, once as a raw file and again in elvex's database as chunks.