> ## Documentation Index
> Fetch the complete documentation index at: https://intunedhq.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# extractStructuredData

<Tip>This function uses AI and incurs costs.</Tip>

<Info>This function has multiple overloads</Info>

<Tabs>
  <Tab title="Extract From Page or Locator">
    Extracts structured data from web pages using AI-powered content analysis.

    This function provides intelligent data extraction from web pages using various strategies
    including HTML parsing, image analysis, and Markdown conversion. Or by using Text or Image Content.
    It supports extraction from entire pages or specific elements, with built-in caching and retry mechanisms.

    ```typescript theme={null}
    export declare function extractStructuredData(options: {
      source: Page | Locator;
      dataSchema: JsonSchema | z.ZodSchema;
      prompt?: string;
      strategy?: "IMAGE" | "MARKDOWN" | "HTML";
      enableDomMatching?: boolean;
      enableCache?: boolean;
      maxRetries?: number;
      model?: string;
      apiKey?: string;
    }): Promise<any>;
    ```

    Extract data from web pages or specific elements using HTML, IMAGE, or MARKDOWN strategies with DOM matching support.

    ## Features and limitations

    **Features:**

    * **Smart caching:** Hashes inputs and uses [KV Cache](https://docs.intunedhq.com/docs/01-learn/recipes/kv-cache) for persistent storage
    * **DOM matching:** With `enableDomMatching=true`, values match DOM elements for smart caching
    * **Multiple strategies:** HTML, IMAGE, or MARKDOWN based on content type
    * **Flexible models:** Use any up-to-date model from Anthropic, OpenAI, or Google based on your needs

    **Limitations:**

    * **Model variability:** Quality varies by model—experiment to find the best fit
    * **DOM complexity:** Dynamic structures can affect caching and matching
    * **IMAGE strategy constraints:** Can't capture truncated or off-screen content
    * **Schema design:** Complex schemas may reduce accuracy

    ## Examples

    <CodeGroup>
      ```typescript Extract book details theme={null}
      import { extractStructuredData } from "@intuned/browser/ai";
      import { BrowserContext, Page } from "playwright";

      interface Params {}

      export default async function handler(
        params: Params,
        page: Page,
        context: BrowserContext
      ) {
        await page.goto(
          "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
        );
        // This will extract the book details from the page, using the HTML strategy with the gpt-4o model.
        // The dataSchema is a JSON Schema object that defines the structure of the data to extract.
        // You can also use a Zod schema instead of a JSON Schema object.
        const book = await extractStructuredData({
          source: page,
          strategy: "HTML", // The HTML strategy is the default strategy and will be used if no strategy is provided.
          model: "gpt-4o",
          dataSchema: {
            type: "object",
            properties: {
              name: { type: "string" },
              price: { type: "string" },
              description: { type: "string" },
              inStock: { type: "string" },
              rating: { type: "string" },
            },
            required: ["name", "price"],
          },
          prompt: "Extract book details from this page",
          enableCache: true, // since this is true, the method will call AI for the first time, and then whenever you call this method it will return cached results as long as the DOM is the same.
          enableDomMatching: true, // since this is true, the method will return the results mapped to the DOM elements, you MUST enable cache for this to work.
          maxRetries: 3,
        });

        console.log(`Found book: ${book.name} - ${book.price}`);
      }
      ```

      ```typescript Extract all books listings theme={null}
      import { extractStructuredData } from "@intuned/browser/ai";
      import { BrowserContext, Page } from "playwright";

      interface Params {}

      export default async function handler(
        params: Params,
        page: Page,
        context: BrowserContext
      ) {
        await page.goto("https://books.toscrape.com/");
        // This will extract all the books listings from the page, using the HTML strategy with the claude-sonnet-4-5 model.
        // The dataSchema is a JSON Schema object that defines the structure of the data to extract.
        // You can also use a Zod schema instead of a JSON Schema object.
        const books = await extractStructuredData({
          source: page,
          strategy: "HTML",
          model: "claude-sonnet-4-5",
          dataSchema: {
            type: "object",
            properties: {
              products: {
                type: "array",
                items: {
                  type: "object",
                  properties: {
                    title: { type: "string" },
                    price: { type: "string" },
                    availability: { type: "string" },
                  },
                },
              },
            },
          },
          prompt: "Extract all book listings",
          enableCache: false, // In this example, we don't want to cache the extracted data, we want to extract the data every time.
        });

        for (const book of books.products) {
          console.log(`${book.title}: ${book.price}`);
        }
      }
      ```
    </CodeGroup>

    ## Arguments

    <ResponseField name="options" type="Object" required>
      Configuration object containing extraction parameters

      <Expandable title="properties" defaultOpen>
        <ResponseField name="options.source" type="Page|Locator" required>
          Playwright Page object to extract data from the entire page or Locator object
          to extract data from a specific element.
        </ResponseField>

        <ResponseField name="options.dataSchema" type="JsonSchema|z.ZodSchema" required>
          Schema defining the structure of the data to extract. Can be a JSON Schema
          object or a Zod schema.
        </ResponseField>

        <ResponseField name="options.prompt" type="string">
          Optional prompt to guide the extraction process and provide more context.
          Defaults to undefined.
        </ResponseField>

        <ResponseField name="options.strategy" type="'HTML'|'IMAGE'|'MARKDOWN'" required>
          Type of extraction strategy: - **"HTML"** (default) - Best for text-heavy
          pages with structured content - **"IMAGE"** - Best for visual content, charts,
          or complex layouts - **"MARKDOWN"** - Best for article-style content with
          semantic structure
        </ResponseField>

        <ResponseField name="options.enableDomMatching" type="boolean">
          Whether to enable DOM element matching during extraction. You must enable
          cache for this to work. When enabled, extraction results are mapped to their
          corresponding DOM elements and returned with matched results. These results
          are intelligently cached, allowing subsequent extractions with minor DOM
          changes to utilize the cached data for improved performance. Defaults to
          false.
        </ResponseField>

        <ResponseField name="options.enableCache" type="boolean">
          Whether to enable caching of extraction results. Defaults to true.
        </ResponseField>

        <ResponseField name="options.maxRetries" type="number">
          Maximum number of retry attempts on failures. Failures can be validation
          errors, API errors, output errors, etc. Defaults to 3.
        </ResponseField>

        <ResponseField name="options.model" type="string">
          AI model to use for extraction. Defaults to "claude-haiku-4-5-20251001".
        </ResponseField>

        <ResponseField name="options.apiKey" type="string">
          Optional API key for AI extraction (if provided, will not be billed to your
          account). Defaults to undefined.
        </ResponseField>
      </Expandable>
    </ResponseField>
  </Tab>

  <Tab title="Extract From Content">
    Extracts structured data from web pages using AI-powered content analysis.

    This function provides intelligent data extraction from web pages using various strategies
    including HTML parsing, image analysis, and Markdown conversion. Or by using Text or Image Content.
    It supports extraction from entire pages or specific elements, with built-in caching and retry mechanisms.

    ```typescript theme={null}
    export declare function extractStructuredData(options: {
      content: ContentItem[] | ContentItem;
      dataSchema: JsonSchema | z.ZodSchema;
      prompt?: string;
      maxRetries?: number;
      enableCache?: boolean;
      model?: string;
      apiKey?: string;
    }): Promise<any>;
    ```

    Extract data from text, image buffers, or image URLs without requiring a page source.

    ## Features and limitations

    **Features:**

    * **Smart caching:** Hashes content and uses [KV Cache](https://docs.intunedhq.com/docs/01-learn/recipes/kv-cache) for persistent storage
    * **Multiple content items:** Combine text, images (buffer or URL) for comprehensive extraction
    * **Flexible models:** Use any up-to-date model from Anthropic, OpenAI, or Google based on your needs

    **Limitations:**

    * **Model variability:** Quality varies by model—experiment to find the best fit
    * **Schema design:** Complex schemas may reduce accuracy
    * **Content quality:** Requires meaningful, contextual content for accurate extraction—sparse or ambiguous content produces poor results

    ## Examples

    <CodeGroup>
      ```typescript Basic Text Content Extraction theme={null}
      import { extractStructuredData, TextContentItem } from "@intuned/browser/ai";
      import { BrowserContext, Page } from "playwright";

      interface Params {}

      export default async function handler(
        params: Params,
        page: Page,
        context: BrowserContext
      ) {
        // This will extract the person information from the text, using the gpt-4o model.
        const textContent: TextContentItem = {
          type: "text",
          data: "John Doe, age 30, works as a Software Engineer at Tech Corp",
        };

        const person = await extractStructuredData({
          content: textContent,
          model: "gpt-4o",
          dataSchema: {
            type: "object",
            properties: {
              name: { type: "string" },
              age: { type: "number" },
              occupation: { type: "string" },
              company: { type: "string" },
            },
            required: ["name"],
          },
          prompt: "Extract person information from the text",
        });

        console.log(`Found person: ${person.name}, ${person.age} years old`);
      }
      ```

      ```typescript List Extraction from Text Content theme={null}
      import { extractStructuredData, TextContentItem } from "@intuned/browser/ai";
      import { BrowserContext, Page } from "playwright";

      interface Params {}

      export default async function handler(
        params: Params,
        page: Page,
        context: BrowserContext
      ) {
        const textContent: TextContentItem = {
          type: "text",
          data: "iPhone 15 - $999, Samsung Galaxy - $899, Pixel 8 - $699",
        };

        const products = await extractStructuredData({
          content: textContent,
          model: "gpt-4o",
          dataSchema: {
            type: "object",
            properties: {
              products: {
                type: "array",
                items: {
                  type: "object",
                  properties: {
                    name: { type: "string" },
                    price: { type: "string" },
                  },
                },
              },
            },
          },
          prompt: "Extract all products",
        });

        for (const product of products.products) {
          console.log(`${product.name}: ${product.price}`);
        }
      }
      ```
    </CodeGroup>

    ## Arguments

    <ResponseField name="options" type="Object" required>
      Configuration object containing extraction parameters

      <Expandable title="properties" defaultOpen>
        <ResponseField name="options.content" type="Array<ContentItem>|ContentItem" required>
          Content to extract data from - can be a single content item or array of
          [ContentItem](../type-references/ContentItem).
        </ResponseField>

        <ResponseField name="options.dataSchema" type="JsonSchema|z.ZodSchema" required>
          Schema defining the expected structure of the extracted data. Can be a JSON
          Schema object or a Zod schema.
        </ResponseField>

        <ResponseField name="options.prompt" type="string">
          Optional prompt to guide the extraction process and provide more context.
          Defaults to undefined.
        </ResponseField>

        <ResponseField name="options.maxRetries" type="number">
          Maximum number of retry attempts on failures. Failures can be validation
          errors, API errors, output errors, etc. Defaults to 3.
        </ResponseField>

        <ResponseField name="options.enableCache" type="boolean">
          Whether to enable caching of the extracted data. Defaults to true.
        </ResponseField>

        <ResponseField name="options.model" type="string">
          AI model to use for extraction. Defaults to "claude-haiku-4-5-20251001".
        </ResponseField>

        <ResponseField name="options.apiKey" type="string">
          Optional API key for AI extraction (if provided, will not be billed to your
          account). Defaults to undefined.
        </ResponseField>
      </Expandable>
    </ResponseField>
  </Tab>
</Tabs>

## Returns: `Promise<any>`

The extracted structured data conforming to the provided schema.
