lance-datafusion as a subproject #4779
Replies: 4 comments 4 replies
-
It's definitely an interesting topic. At Rerun I am currently working on an implementation of Taking off my Rerun hat and putting on my DataFusion PMC hat, I think this is a great idea! We'd love to see more implementations of catalog providers, and especially working to make that a pleasant experience with datafusion-python would be a personal win. What I don't know is what kind of support I can provide. I can share the code I've been putting together so far for the rerun implementation and I've been finding a couple of rough spots. I'd also be happy to get on a call (or use the DataFusion weekly community meeting, Wed 11am EST) to discuss. |
Beta Was this translation helpful? Give feedback.
-
I've recently been thinking about something that parallels this: I'd like to separate the
The advantages of this would be:
Do you think it would be appropriate to combine (2) with your current idea of IMO, I think it would be nice to have the |
Beta Was this translation helpful? Give feedback.
-
I think views and UDFs are even a separate discussion. Compute functionality (e.g. UDFs) for multi-media types should probably be in standalone projects that only depend on Arrow. They can have a Views are pretty high up the stack and I'm not sure I understand enough yet to really define anything. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Recently there are a lot of development and discussions around adding UDFs, UDTFs, and views to work with Lance dataset for SQL usage. And when we say SQL, it's basically the DataFusion SQL exposed through Lance interface.
However, doing it at the dataset level seems to be problematic, it is hard to get the ideal behavior of SQL because a single dataset do not have the vision of everything going on with other datasets. It imposes challenges in when to register a table, what name to use to register the table, how to avoid duplications, how to register UDTF for all tables, how to create views with deterministic table reference, etc.
We have discussed about this a few times:
Proposal: a lot of these discussions come from the fact that people want to use Lance with a full SQL experience, and DataFusion is at the best position to be the de-facto SQL engine for Lance. In that case, I think we should properly create a
lance-datafusion
subproject that implements the CatalogProvider in datafusion using Lance namespace, just like what we have done in Spark and Ray.What do we think?
cc @wjones127 @timsaucer @westonpace @wojiaodoubao @yanghua @jaystarshot
Beta Was this translation helpful? Give feedback.
All reactions