Supported File Types
Raw document types
The upload endpoint supports several raw document types. Vectara extracts text
from these documents and sections them as best it can. This provides a
convenient way to index text, yet the caller has less control compared to when
providing the Document proto message themselves. The following raw document
types are supported:
- Commonmark / Markdown (
mdextension). - PDF/A (
pdf). - Open Office (
odt). - Microsoft Word (
doc,docx). - Microsoft Powerpoint (
ppt,pptx). - Text files (
txt). - HTML files (
.html). - LXML files (
.lxml). - RTF files (
.rtf). - ePUB files (
.epub). - Email files conforming to RFC 822.
Semi-structured documents
In gRPC, the upload endpoint supports sending semi-structured documents through
this endpoint that reflect a Document proto message. Those can be sent in
the following formats:
-
pb: Contains binary serializedDocumentproto message. -
pbtxt: ContainsDocumentproto message in proto text format. -
json: ContainsDocumentproto message in json text format.
In REST API v2, use the Indexing API v2 endpoint instead.