rss-bridge 2025-10-13T17:58:39+00:00

Understanding Etsy’s Vast Inventory with LLMs

For more than 20 years, Etsy has been the destination for human creativity online. Our marketplace is home to more than 100 million special items made, handpicked and designed by more than 5 million sellers. These items and the real people behind them are what set us apart. But while the huge variety of Etsy’s inventory is one of our greatest strengths, it also creates fundamental engineering challenges specific to our marketplace.
The challenge: Etsy’s unique inventory
With millions of creative items across thousands of categories – many of which are unique – it’s difficult to accurately capture all possible product attributes, which range from standard attributes like “color” and “material”, to niche attributes like “bead hole size” and “slime additives.” The range of possible attributes and their values is so broad that it’s a challenge even to enumerate them, let alone label listings with specific attribute data. Unlike other online retailers (that may also have enormous inventories), because products on Etsy are listed by third party sellers and often handmade or customized, we do not have global SKUs (stock keeping units), or mappings from SKUs to product attributes.
The listing below is an example of a unique item on Etsy, which has no SKU number or easy access to product attribute information. At first glance, the item looks like a t-shirt, but it is actually a porcelain sculpture. For niche items like this, seller provided details become especially critical.

We collect both structured and unstructured data from sellers, and they serve different roles in our marketplace.

Unstructured data comes in the form of free-text descriptions, creative titles, and listing photos. While this content is full of useful product information, it’s harder for machines to interpret consistently and quickly at scale.
Structured data - in the form of product attributes like size and color - is easy for our systems to parse. It powers the buyer experience through tools such as search filtering options (offered through selectors in UI) and product-to-product comparison for characteristics of interest (material, price, etc).

Filters can be seen on the left side after a search query

While Etsy does ask sellers to provide structured data on their listings’ attributes, most fields are not required. This reduces friction in the listing process and gives sellers the flexibility to represent their often unique items accurately.

Sellers can fill in these attributes, or leave them blank
and continue through the listing process

As a result, most sellers only or mostly provide unstructured data in the form of listing titles, descriptions, and photos. Frequently, key information like product dimensions is buried in the listing description or only available in listing photos.

Example listing with dimensions in description

Example of dimensions in a photo

While our powerful search and discovery algorithms can process unstructured data such as that in descriptions and listing photos, passing in long context and images directly to search poses latency concerns. For these algorithms, every millisecond counts as they work to deliver relevant results to buyers as quickly as possible. Spending time filtering through unstructured data for every query is just not feasible.
These constraints led us to a clear conclusion: to fully unlock the potential of all inventory listed on Etsy’s site, unstructured product information needs to be distilled into structured data to power both ML models and buyer experiences.
LLMs present a new opportunity
Before the availability of scalable LLMs, we explored various ML-based solutions to this challenge. Supervised product attribute extraction models had limited efficacy; even if we could enumerate all possible product attributes and values, many of them would be so sparse that traditional classification models would struggle to capture the long tail. Sequence tagging approaches also had difficulty scaling to multiple attributes. Transformer-based question-answering models (e.g. AVEQA, MAVEQA ) allowed for generalization to unseen attribute values, but still required large amounts of application-specific training data.
This is where the availability of foundational LLMs presented a transformational opportunity for Etsy. These models have a vast amount of general knowledge from pre-training, can process large context windows quickly and affordably, and can follow instructions given a small number of examples.

With a feasible and performant solution, our next focus was to build a scalable pipeline that could extract attributes across millions of listings while maintaining confidence in the LLM output. This required robust evaluation frameworks/processes that measured quality through various metrics.
Transforming & evaluating unstructured data at scale
Evaluation
When working with LLMs, one of the biggest challenges is evaluating model performance. We needed to ensure that, at scale across our 100M+ listings, the LLMs were consistently and reliably producing accurate, actionable results. To do this, we initially worked with a third-party labeling vendor to collect a large sample of human-annotated data containing attribute annotations for listings across multiple categories. We evaluated performance by comparing LLM inferences to this human-annotated dataset and calculated metrics like precision, recall, and Jaccard index. We used these ground truth metrics as a benchmark for model improvements via prompt and context engineering.
Unfortunately, there were several significant drawbacks to relying on human-labeled data. In many cases, we found that human annotators made mistakes, especially when annotating thousands of listings (after all, no one’s perfect). In the example below, a human annotator marked the light fixture as ½ inch width, while the LLM correctly extracted 5.5 inches.

Furthermore, human labeling is more time-consuming and expensive. To start scaling attribute inference across thousands of categories, we needed to come up with an automated process for labeling that did not rely exclusively on human annotation.
Instead, we’ve started using high-performance, state-of-the-art LLMs to generate ground truth labels (often called “silver labels”). Human-in-the-loop is still an essential part of this process: Etsy domain experts review silver labels and iterate on the prompt to ensure high quality results. Once we’re confident in our silver label generation, we produce a larger dataset for evaluating a more scalable LLM. The diagram below shows the updated process for model development.

Inference
The core of our LLM pipeline is context engineering. We’ve worked with partners in product, merchandising, and taxonomy to ensure that the LLM has the right context for attribute extraction, including:

Seller-provided listing data, including listing titles, descriptions, and images
Few-shot examples hand-selected by domain experts
Business logic from Etsy’s product taxonomy
Category-specific extraction rules

Each listing is represented as a JSON string of context information. This context is injected into a series of prompts to extract product attributes in parallel. LLM requests are routed through LiteLLM to different regions, ensuring higher parallelization and removing a dependency on one singular cloud location. Finally, LLM responses are parsed into Pydantic dataclasses, which provide both basic type validation and custom validation based on business logic.

After this process of inference completes, a post-processing job formats the validated, structured outputs. The data is then exported to filestores, database tables, and our search platform for consumption by partner teams.
Monitoring
Beyond the challenges of evaluating the LLM output, Inference itself may fail for many reasons: code bugs, permissions issues, transient errors, quota exceeded errors, safety filters, and more. Rather than failing the pipeline for any individual error, errors are logged, and error metrics are surfaced via our observability platform. Our team is alerted if the number of failed inferences exceeds a certain threshold. To support debugging, we log a sample of traces to HoneyComb.
Even if the error rate is low, it’s possible that model performance has degraded. To track changes in model performance, we added performance evaluation to our pipeline. First, we run LLM inference on a sample of a ground-truth dataset, and calculate performance metrics like precision, and recall. These metrics are compared to baseline scores from the full ground-truth dataset. If any metrics deviate significantly, the pipeline is terminated. This process allows us to confirm that third-party LLMs are working as expected before we run production-scale inference.
The combination of tracing, logging, metric tracking, model performance evaluation, and alerting provides a complete understanding of both pipeline health and model performance metrics, enabling us to consistently transform data to power key shopping experiences with confidence at scale.
Looking Forward
Where we’ve applied LLM-generated product attribute data to buyer and seller-facing experiences, we’ve seen promising results. In target categories, we’ve increased the number of listings with complete attribute coverage from 31% to 91%. And earlier this year, we added LLM-inferred attributes to search filters, leading to more engagement from buyers:

Engagement with relevant Search filters increased
Overall post-click conversion rate increased

All this work combined most recently into leveraging LLM-inferred color attributes to display color swatches for each listing on the search results page. This provides at-a-glance additional information to our buyers to find exactly what they want, faster.

What's Next
Our goal is to unlock the full potential of Etsy’s inventory. Product attribute extraction is just one of many ways we’re using LLMs to achieve this in our efforts to improve the shopping and selling experience on Etsy. Transforming unstructured information is enabling us to make it easier than ever for our buyers to discover exactly what they’re looking for – and easier for sellers to list and get their unique creations discovered by the shoppers seeking their special item.

Source: https://www.etsy.com/codeascraft/understanding-etsyas-vast-inventory-with-llms?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share