Sunday, June 15, 2025

Denny’s prime session picks for Information + AI Summit 2025

Information + AI Summit 2025 is just some weeks away! This 12 months, we’re providing our largest choice of classes ever, with over 700+ to select from. Register to affix us in-person in San Francisco or nearly.

With a profession rooted in open supply, I’ve seen firsthand how open applied sciences and codecs are more and more central to enterprise technique. As a long-time contributor to Apache Spark™ and MLflow, a maintainer and committer for Delta Lake and Unity Catalog, and most not too long ago a contributor to Apache Iceberg™, I’ve had the privilege of working alongside a number of the brightest minds within the trade.

For this 12 months’s classes, I’m specializing in the intersection of open supply and AI – with a specific curiosity round multimodal AI. Particularly, how open desk codecs like Delta Lake and Iceberg, mixed with unified governance by means of Unity Catalog, are powering the subsequent wave of real-time, reliable AI and analytics.

My Prime Picks

The upcoming Apache Spark 4.1: The Subsequent Chapter in Unified Analytics

Apache Spark™ has lengthy been acknowledged because the main open-source unified analytics engine, combining a easy but highly effective API with a wealthy ecosystem and top-notch efficiency. Within the upcoming Spark 4.1 launch, the neighborhood reimagines Spark to excel at each large cluster deployments and native laptop computer growth. Hear and ask inquiries to:

  • Xiao Li an Engineering Director at Databricks, an Apache Spark Committer, and a PMC member.
  • DB Tsai is an engineering chief on the Databricks Spark staff. He’s an Apache Spark Challenge Administration Committee (PMC) Member and Committer
     

Iceberg Geo Sort: Remodeling Geospatial Information Administration at Scale

Geospatial is changing into an increasing number of vital for lakehouse codecs. Be taught from Jia Yu, Co-founder and Chief Architect of Wherobots Inc., and Szehon Ho, Software program Engineer at Databricks, on the newest and best across the geospatial knowledge varieties in Apache Iceberg™.
 

Let’s Save Tons of Cash with Cloud-native Information Ingestion!

R. Tyler Croy from Scribd, Delta Lake maintainer, and shepherd of delta-rs since its inception, will dive into the cloud-native structure Scribd has adopted to ingest knowledge from AWS Aurora, SQS, Kinesis Information Firehose, and extra. By utilizing off-the-shelf open supply instruments like kafka-delta-ingest, oxbow, and Airbyte, Scribd has redefined its ingestion structure to be extra event-driven, dependable, and most significantly: cheaper. No jobs wanted!

This session will dig into the worth props of a lakehouse structure and cost-efficiencies throughout the Rust/Arrow/Python ecosystems. A number of really helpful movies to look at beforehand:

 

Daft and Unity Catalog: a multimodal/AI-native lakehouse

Multimodal AI will basically change the panorama as knowledge is extra than simply tables. Workflows now usually contain paperwork, pictures, audio, video, embeddings, URLs and extra.

This session from Jay Chia, Co-founder of Eventual, will present how Daft + Unity Catalog will help unify authentication, authorization and knowledge lineage, offering a holistic view of governance, with Daft, a preferred multimodal framework.
 

Bridging Massive Information and AI: Empowering PySpark with Lance Format for Multi-Modal AI Information Pipelines

PySpark has lengthy been a cornerstone of massive knowledge processing, however the rise of multimodal AI and vector search introduces challenges past its capabilities. Spark’s new Python knowledge supply API allows integration with rising AI knowledge lakes constructed on the multi-modal Lance format.

This session will dive into how the Lance format works and why it is a crucial element for multimodal AI knowledge pipelines. Allison Wang, Apache Spark™ committer, and Li Qiu, LanceDB Database Engineer and Alluxio PMC member, will dive into how combining Apache Spark (PySpark) and LanceDB means that you can advance multi-modal AI knowledge pipelines.
 

Streamlining DSPy Improvement: Observe, Debug and Deploy with MLflow

Chen Qian, Senior Software program Engineer at Databricks, will present the way to combine MLflow with DSPy to deliver full observability to your DSPy growth.

You’ll get to see the way to monitor DSPy module calls, evaluations, and optimizers utilizing MLflow’s tracing and autologging capabilities. Combining these two instruments makes it simpler to debug, iterate, and perceive your DSPy workflows, then deploy your DSPy program end-to-end.
 

From Code Completion to Autonomous Software program Engineering Brokers

Kilian Lieret, Analysis Software program Engineer at Princeton College, was not too long ago a visitor on the Information Brew videocast for an enchanting dialogue on new instruments for analysis and enhancing AI in software program engineering.

This session is an extension of this dialog, the place Kilian will dig into SWE-bench (a benchmarking instrument) and SWE-agent (an agent framework), the present frontier of agentic AI for builders, and the way to experiment with AI brokers.
 

Composing high-accuracy AI methods with SLMs and mini-agents

The always-amazing Sharon Zhou, CEO and Founding father of Lamini, discusses the way to make the most of small language fashions (SLMs) and mini-agents to scale back hallucinations utilizing Combination of Reminiscence Exports (i.e., MoME is aware of finest)!

Discover out just a little bit extra about MoME on this enjoyable Information Brew by Databricks episode that includes Sharon: Combination of Reminiscence Exports.
 

Past the Tradeoff: Differential Privateness in Tabular Information Synthesis

Differential privateness is a crucial instrument to offer mathematical ensures round defending the privateness of the people behind the info. This discuss by Lipika Ramaswamy of Gretel.ai (now a part of NVIDIA) explores the usage of Gretel Navigator to generate differentially non-public artificial knowledge that maintains excessive constancy to the supply knowledge and excessive utility on downstream duties throughout heterogeneous datasets.

Some good pre-reads on the subject:

Constructing Information Brokers to Automate Doc Workflows
One of many greatest guarantees for LLM brokers is automating all data work over unstructured knowledge — we name these “data brokers.” Jerry Liu, Founding father of LlamaIndex, dives into the way to create data brokers to automate doc workflows. What can generally be advanced to implement, Jerry showcases the way to make this a simplified movement for a basic enterprise course of.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles