Native read support for Lance on HF Hub — plus 3 datasets to try (FineWeb-Edu, OpenVid, LAION-1M).
lance hf email header

Lance format is now supported on the Hugging Face 🤗 Hub

We’ve been working closely with the Hugging Face team to enable native read support for Lance datasets on the Hub. You can open Lance datasets via hf:// URLs and scan, filter, and project columns directly from Hugging Face storage—without downloading the full dataset or rebuilding metadata locally.

 

When a dataset includes vector or full-text indexes, you can also run ANN / FTS queries directly against the dataset on the Hub.

🎉 What This Enables

Lance datasets on the Hub can be published as a single reusable artifact that includes metadata, embeddings, indexes, and large binary assets (images, video, audio). Consumers no longer need to stitch together tables, regenerate embeddings, or rebuild indexes just to explore or query a dataset.

 

You can scan and filter metadata remotely without touching heavy blobs, fetch individual images or videos only when needed, and reuse precomputed embeddings and indexes created by the dataset author. The same dataset works with Hugging Face streaming APIs for lightweight exploration, or with Lance’s native API when you need ANN search, FTS, or direct blob access.

🔍 Example: Vector search

import lancedb

# In LanceDB, you open a database connection, then a table
db = lancedb.connect("hf://datasets/lance-format/laion-1m/data")
tbl = db.open_table("train")
query_embedding = list(range(768))

results = (
  tbl.search(query_embedding, vector_column_name="img_emb")
  .limit(2)
  .to_list()
)

No local download, no index rebuild.

🤗 Datasets Available Today

FineWeb-Edu (Lance) — 1.5B+ text passages

Cleaned long-form text with metadata and 384-dim embeddings. (ANN / FTS indexes coming; download locally for heavy search workloads.)

 

OpenVid (Lance) — ~938k videos

Videos stored inline as blobs, with embeddings, metadata, and prebuilt IVF_PQ + FTS indexes. Scan metadata cheaply, then fetch individual videos on demand.

 

LAION-1M (Lance) — image-text pairs

Inline JPEG bytes with CLIP embeddings and a built-in vector index, ready for ANN search.

đź’ˇ What's Next?

We'll be actively adding support for more popular public datasets across different domains like CV, NLP, IR, Robotics and others, in the Lance Format HF organization.

đź”— Learn More

The blog covers the architecture in more detail (OpenDAL integration, blob access, remote scans, index reuse).

  • Read the blog →

  • Explore Lance Datasets on Hugging Face Hub →

 

Huge thank you to Quentin Lhoest and the Hugging Face team for their support in making this possible!

Read our Lance x Hugging Face blog post
LinkedIn
X
Website
discord

LanceDB, 352 Cumberland Street, San Francisco, California 94114

Unsubscribe Manage preferences