Sunday, June 15, 2025

Shreya Shankar on AI for Company Information Processing – O’Reilly

Generative AI in the Real World

Generative AI within the Actual World

Generative AI within the Actual World: Shreya Shankar on AI for Company Information Processing



Loading





/

Companies have numerous knowledge—however most of that knowledge is unstructured textual knowledge: stories, catalogs, emails, notes, and far more. With out construction, enterprise analysts can’t make sense of the info; there’s worth within the knowledge, however it might’t be put to make use of. AI generally is a software for locating and extracting the construction that’s hidden in textual knowledge. On this episode, Ben and Shreya discuss a brand new era of tooling that brings AI to enterprise knowledge processing.

Take a look at different episodes of this podcast on the O’Reilly studying platform.

In regards to the Generative AI within the Actual World podcast: In 2023, ChatGPT put AI on everybody’s agenda. In 2025, the problem will likely be turning these agendas into actuality. In Generative AI within the Actual World, Ben Lorica interviews leaders who’re constructing with AI. Be taught from their expertise to assist put AI to work in your enterprise.

Factors of Curiosity

  • 0:00: Introduction to Shreya Shankar.
  • 0:18: One of many themes of your work is a particular form of knowledge processing. Earlier than we go into instruments, what’s the drawback you’re attempting to handle? 
  • 0:52: For many years, organizations have been struggling to make sense of unstructured knowledge. There’s an enormous quantity of textual content that folks make sense of. We didn’t have the expertise to do this till LLMs got here round.
  • 1:38: I’ve spent the final couple of years constructing a processing framework for individuals to govern unstructured knowledge with LLMs. How can we extract semantic knowledge?
  • 1:55: The prior artwork can be utilizing NLP libraries and doing bespoke duties?
  • 2:12: We’ve seen two flavors of method: bespoke code and crowdsourcing. Folks nonetheless do each. However now LLMs can simplify the method.
  • 2:45: The everyday job is “I’ve a big assortment of unstructured textual content and I wish to extract as a lot construction as attainable.” An excessive can be a data graph; within the center can be the issues that NLP individuals do. Your knowledge pipelines are designed to do that utilizing LLMs.
  • 3:22: Broadly, the duties are thematic extraction: I wish to extract themes from paperwork. You may program LLMs to search out themes. You need some person steering and steerage for what a theme is, then use the LLM for grouping.
  • 4:04: One of many instruments you constructed is DocETL. What’s the everyday workflow?
  • 4:19: The thought is to jot down MapReduce pipelines, the place map extracts insights, and group does aggregation. Doing this with LLMs signifies that the map is described by an LLM immediate. Perhaps the immediate is “Extract all of the ache factors and any related quotes.” Then you may think about flattening this throughout all of the paperwork, grouping them by the ache factors, and one other LLM can do the abstract to provide a report. DocETL exposes these knowledge processing primitives and orchestrates them to scale up and throughout job complexity.  
  • 5:52: What if you wish to extract 50 issues from a map operation? You shouldn’t ask an LLM to do 50 issues directly. It is best to group them and decompose them into subtasks. DocETL does some optimizations to do that.
  • 6:18: The person may very well be a noncoder and won’t be engaged on all the pipeline.  
  • 7:00: Folks try this lots; they may simply write a single map operation.
  • 7:16: However the finish person you take note of doesn’t even know the phrases “map” and “filter.”
  • 7:22: That’s the aim. Proper now, individuals nonetheless have to be taught knowledge processing primitives. 
  • 7:49: These LLMs are probabilistic; do you additionally set the expectations with the person that you simply would possibly get totally different outcomes each time you run the pipeline?
  • 8:16: There are two various kinds of duties. One is the place you need the LLM to be correct and there’s a precise floor reality—for instance, entity extraction. The opposite kind is the place you wish to offload a artistic course of to the LLM—for instance, “Inform me what’s attention-grabbing on this knowledge.” They’ll run it till there are not any new insights to be gleaned.  When is nondeterminism an issue? How do you engineer programs round it?
  • 9:56: You may additionally have an information engineering group that makes use of this and turns PDF information into one thing like an information warehouse that folks can question. On this setting, are you accustomed to lakehouses structure and the notion of the medallion structure?
  • 10:49: Folks really use DocETL to create a desk out of PDFs and put it in a relational database. That’s one of the best ways to consider how you can transfer ahead within the enterprise setting. I’ve additionally seen individuals utilizing these tables in RAG or downstream LLM functions. 
  • 11:31: I notice that this can be a fast-moving area. To what extent can DocETL leverage different libraries like BAML? It’s a domain-specific language that turns prompts into directions. And there are different issues on the knowledge extraction facet—for instance, getting knowledge from photographs in PDF information. To what extent can DocETL leverage the perfect of breed?
  • 12:54: We’ve got plug-ins and operators as plug-ins. Customers can write their very own; group members have contributed totally different plug-ins. We’re serious about native integrations with RAG.
  • 14:01: What are the most typical knowledge varieties?
  • 14:11: PDFs—some individuals will run OCR on PDFs, so unstructured textual content, transcripts, JSON-formatted logs. The great factor is that a lot knowledge may be represented as a string.
  • 14:36: So your start line is strings. So I can have MCP servers that suck knowledge from Confluence and Wikis and also you begin from there.
  • 14:53: Our datasets are JSON or CSV format. So think about a CSV with one or two columns.
  • 15:03: Do you present customers of this software with diagnostics or analysis instruments?
  • 15:14: This brings me to DocWrangler, which is a specialised IDE for writing DocETL pipelines. You get extra observability, it’s simpler to engineer prompts, we now have computerized immediate writing, LLMs that edit prompts. It will get you from zero to a beginning pipeline.  
  • 16:00: Folks at the moment are utilizing issues like expectations and assertions. Is there the equal?
  • 16:13: We’ve got guardrails on LLM-powered operations: We are able to test for hallucination; we will reproduce LLMs for use as guardrails or LLM-as-judge; we will loop on an operation if it doesn’t move; we will additionally write pipelines that question an exterior knowledge supply and drop paperwork that don’t meet standards.
  • 17:16: A separate factor we’re discovering is how to do that in groups. 
  • 17:39: If the aim is to onboard noncoders, numerous this work goes to be on the UX facet. 
  • 18:03: The DocWrangler venture is all about what’s the appropriate UX. How can we leverage AI help as a lot as attainable? The semantic knowledge processing ecosystem is tremendous new. The person has an intent that’s arduous to specific. There’s the semantic pipeline. And there’s the precise knowledge—the paperwork. When you consider constructing UX, it’s a must to optimize the interplay between all three. The place does AI assist? The place does AI not assist?
  • 20:06: Every little thing that we’ve mentioned is within the context of a fast-moving basis mannequin world. Now we now have reasoning fashions. How do you’re feeling about reasoning fashions within the context of what you’re doing? They’re costly and slower. What recommendation do you give customers of DocETL?
  • 21:03: Reasoning is most useful in bridging the understanding between the person and the preliminary pipeline that they write. A reasoning mannequin can go from a crudely specified intent to a well-specified pipeline. The o1 mannequin is best at this than GPT-4o. But when you have already got a well-defined immediate, the reasoning mannequin doesn’t provide you with a lot leverage.
  • 23:10: I’d think about that supervised fine-tuning would repay for a pipeline. Are individuals utilizing DocETL to generate knowledge for fine-tuning LLMs?  
  • 23:36: I haven’t seen individuals doing this, however I’m certain they’re. Persons are writing DocETL pipelines on their very own LLMs, however I’m undecided how they fine-tune them. 
  • 24:09: I all the time use two or three LLMs and attempt to get a consensus. The LLM is determined by your use case and your knowledge, proper?
  • 24:46: Completely. In our person research, individuals say the identical factor: The usual pipeline is to make use of OpenAI or Gemini for extraction, and Claude for content material era and aggregation. Some are utilizing DeepSeek, however we ran the pilot earlier than DeepSeek grew to become well-liked. I’m certain it has risen. 
  • 25:33: I feel you boxed your self in with the identify DocETL; we’re seeing multimodal fashions. As fashions develop into extra succesful, you’ll transfer with the capabilities of the inspiration mannequin.
  • 26:05: After we first launched the venture, a bunch of individuals stated we should always do multimodal, photographs, audio. However these questions simply vanished. Extra individuals stated “I simply have textual content issues.” We’re within the gritty phases of actual enterprise use circumstances, that are textual content wrangling issues.
  • 26:50: The default is to make use of textual content, however there’s numerous nuance in these different realities, particularly video. So I have to ask about associated tasks.
  • 27:20: I simply met the aryn.ai individuals at a convention. All of us share the curiosity in doing semantic knowledge processing. Many establishments have individuals constructing such a system. It’s attention-grabbing to see the place we differ. DocETL has a single map operator; different programs have many map operators. So there are attention-grabbing implementation variations. 
  • 28:58: You’re in Berkeley; inform me you’re utilizing Ray.
  • 29:06: Every little thing runs on a single machine proper now, however we are going to scale up with Ray.  These LLMs usually are not low cost, although they’re getting cheaper. Gemini is admittedly low cost.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles