Native read support for Lance on HF Hub — plus 3 datasets to try (FineWeb-Edu, OpenVid, LAION-1M).
Lance format is now supported on the Hugging Face 🤗 Hub
We’ve been working closely with the Hugging Face team to enable native read support for Lance datasets on the Hub. You can open Lance datasets via hf:// URLs and scan, filter, and project columns directly from Hugging Face storage—without downloading the full dataset or rebuilding metadata locally.
When a dataset includes vector or full-text indexes, you can also run ANN / FTS queries directly against the dataset on the Hub.
🎉 What This Enables
Lance datasets on the Hub can be published as a single reusable artifact that includes metadata, embeddings, indexes, and large binary assets (images, video, audio). Consumers no longer need to stitch together tables, regenerate embeddings, or rebuild indexes just to explore or query a dataset.
You can scan and filter metadata remotely without touching heavy blobs, fetch individual images or videos only when needed, and reuse precomputed embeddings and indexes created by the dataset author. The same dataset works with Hugging Face streaming APIs for lightweight exploration, or with Lance’s native API when you need ANN search, FTS, or direct blob access.
🔍 Example: Vector search
importlancedb
# In LanceDB, you open a database connection, then a table db=lancedb.connect("hf://datasets/lance-format/laion-1m/data") tbl=db.open_table("train") query_embedding=list(range(768))
Videos stored inline as blobs, with embeddings, metadata, and prebuilt IVF_PQ + FTS indexes. Scan metadata cheaply, then fetch individual videos on demand.
Inline JPEG bytes with CLIP embeddings and a built-in vector index, ready for ANN search.
đź’ˇ What's Next?
We'll be actively adding support for more popular public datasets across different domains like CV, NLP, IR, Robotics and others, in the Lance Format HF organization.
đź”— Learn More
The blog covers the architecture in more detail (OpenDAL integration, blob access, remote scans, index reuse).