We Tried to Build a Fashion AI Agent in 2024 — and Why It Still Doesn't Work Yet

A retrospective on building FashIntel, a chat-based fashion shopping assistant, and what 2026 taught me: this was never an agent problem. It was a retrieval and ranking problem with an LLM on top.

Published: 2026-03-26

My entry point into fashion wasn't retail — it was fine art. Painting. Photography. Learning how to see. That training shaped how I look at clothing, style, and ultimately, how people make decisions when they shop. Over time, what started as a personal passion turned into a curiosity about something bigger: how the fashion retail system actually works.

That curiosity deepened during a series of startup courses at CMU, where I began to analyze the fashion industry more systematically. This article is a reflection on the evolution of that startup project and my exploration of generative AI in the fashion retail space. I'm writing it now because I recently came across Daydream once again, a startup that has raised $50M to build a product strikingly similar to what I had worked on before with my friends. That caught my attention not because the idea is new, but because it reminded me how hard this problem still is, and how far the technology still has to go before it truly works well.

So this is both a project retrospective and a market/technology reflection: what we built and explored in 2024, how I would design the agent differently in 2026, what are the challenges by then and today, and what I learned from it.

Our Fashion AI Agent's Story

Phase 1: The Illusion of Virtual Try-On

By the end of 2023, a classmate brought the idea of virtual try-on for ecommerce space. Creating a browser app for consumers to use on fashion ecommerce sites seemed to be an interesting project to explore. We believe this service can boost sales for retailers by enabling consumers to visualize the cloth fitting on their own body.

Technically, developing a virtual try-on solution becomes more achievable: there was an emerging trend of leveraging LoRA (Low-Rank Adaptation) to fine-tune diffusion models that generate "virtual try-on" effects. Plus, the Segment Anything Model (SAM) launched in April provided a more cost-effective way to isolate body regions for image generations without specific training.

However, the LoRA-based virtual try-on solutions are fundamentally constrained:

It generates, not simulates: Outputs are visually plausible but lack true understanding of fit, sizing, and fabric behavior.
Low physical accuracy: Draping, tightness, and material properties are approximated, often leading to unrealistic results.
Inconsistent and unreliable details: Logos, patterns, and garment shapes can shift across generations, breaking fidelity.
Struggles with real-world complexity: Occlusion (arms, hair) and layering introduce artifacts and visual errors.

Based on more conversations with online shoppers, virtual try-on might be interesting in UGC generation and marketing purposes, but it is weak for the fashion ecommerce space. From a shopper's perspective, these gaps show up clearly:

It doesn't reflect your real body or real conditions: The body changes (weight, proportions, posture), and clothing looks different under different lighting. A generated image can't reliably capture how it will actually fit or appear in your day-to-day life.
It creates confidence without accuracy: The results look realistic enough to trust, but aren't predictive — leading to mismatched expectations and potentially more returns.
It adds friction without reducing uncertainty: Even after using it, a shopper still can't answer the key question: "Will this actually fit and look right on me?" Over time, that limits trust and repeat usage.

Phase 2: Chasing the "AI Shopping Platform" Vision

With less confidence in the virtual try-on route, I sensed another common user pain point during shopping appeals — styling. At the beginning of 2024, I started to design a shopping platform partnering with brands and retailers that gives consumers styling advice and helps them to find where to buy highlighted by a chat-based feature.

Quote from my slide deck: "StylingAssist aims to create a network of shoppers, e-commerce retailers, and brands with an AI-powered online shopping assistant to enable personalized experience at any place and any time." In my picture, this platform's revenue stream includes both affiliated marketing commissions and B2B services provided for the retail partners. After a dedicated business modeling analysis, I realized that this vision was so broad and it required tremendous resources to jump start.

At this point the story may sound familiar to you — yes, a few months later, Daydream rakes in $50M seed funding to build an AI-powered search engine suited for e-commerce (TechCrunch). It was very unlikely for a fashion industry newbie to duplicate another funding success like Daydream.

A snapshot of our UI/UX drafts

Phase 3: From Platform to Embedded Agent

Starting in spring 2024, I pivoted the product again during a GTM course based on comprehensive market research and diverse stakeholder interviews with both consumers and retail professionals, from sales to executives. Instead of a one-for-all shopping platform, I redefined it as "a virtual sales agent integrated onto any apparel ecommerce sites, which can provide personalized styling advice to shoppers and help them to quickly discover apparel for purchase".

The chat interface and semantic search became a heated topic in the retail industry this year. Amazon was announcing Rufus as a tech leader, and the fashion space was moving fast. According to The Business of Fashion State of Fashion 2025 report, 50% of the surveyed executives see consumer product discovery as the key use for generative AI in 2025. There were already some retailers benefiting from pioneering generative AI in their business (BusinessOfFashion). Online shopping giant Zalando saw an 18 percent year-on-year lift in profitability this year after launching a ChatGPT-powered shopping assistant that helped lower operational costs and grow customer engagement (City AM). Revolve Group just fully deployed their AI-powered contextual search on its website and achieved "significant gains in both add-to-cart rate and conversion rates" with lowered operating costs (Glossy). Daniel Wu, the company's SVP of Business Intelligence, shared with us that Revolve was researching conversational shopping experience as the next step because the team believed in its value of reducing return rates. His insight was in alignment with McKinsey's Harreis that greater personalisation can increase conversion rates and reduce returns (BusinessOfFashion).

This time, I did not want the project to stop at the discovery phase. Throughout the summer, I had the fortune to work with a few seasoned engineers to build a prototype called "FashIntel" based on the updated product design. It was designed to be embedded into any fashion retail site as a chat-based shopping assistant. The goal was to help shoppers discover products faster through natural conversation, while also providing lightweight styling guidance similar to what an in-store sales associate might offer.

2024 Prototype Architecture

Because this was an early prototype and we did not yet have production customer data, we first built a web crawler to bootstrap a product catalog from a multi-brand fashion retailer. This gave us a local dataset of 2,000+ dresses, including structured product metadata, product images, and product URLs. That dataset became the foundation for testing a conversational shopping experience end to end.

On top of the crawled catalog, we built a feature enrichment pipeline powered by VLMs. We started by defining a set of fashion attributes that are commonly used by human sales associates and stylists, such as fit, occasion, style, and fabric vibe. We then used ChatGPT's vision-language capabilities to process both the product text and product images, producing two outputs for each item:

A more structured understanding aligned with a fashion-oriented attribute schema, and
A free-form semantic description capturing how a stylist might naturally describe the item.

To support semantic retrieval, we combined the original structured product data with the VLM-generated descriptions and passed them through an embedding pipeline. The resulting embeddings were stored in MongoDB, allowing us to search products not just by exact metadata, but by richer textual and visual semantics. This was important because shopper intent is often vague or conversational — for example, "something elegant for a summer wedding" or "a more casual version of this look."

At runtime, each chat session used a RAG-based retrieval loop. The system interpreted the latest conversation context, translated that context into a retrieval query, and ran vector search over the product embeddings to return the most relevant items. The top results were shown inline in the chat experience, and the ranking could be refreshed multiple times within the same session as the shopper refined their preferences, such as asking for something more formal, cheaper, or better suited for a specific occasion.

We also designed the prototype with a two-level memory system. A short-term memory tracked in-session preferences and dialogue context so that retrieval and response generation stayed grounded in the current conversation. In parallel, the system generated a session summary at the end of each interaction and stored it as part of a lightweight long-term user profile, enabling personalization across future sessions and improving the cold-start experience.

Overall, the architecture combined data crawling, multimodal understanding, embeddings, RAG, dynamic ranking, and conversational memory into a single prototype. The result was a practical demonstration of how LLMs and VLMs could power a more interactive and personalized shopping experience for fashion e-commerce, even before real user traffic or merchant integrations were available.

FashIntel's prototype architecture

2025 Roadblocks: What Stopped Us From Further Development

In early 2025, we paused — not because the prototype didn't work, but because the path from "cool demo" to "brand-grade product" turned out to be a strategic and operational commitment. A fashion brand tech leader put it bluntly to me in an interview: a conversational shopping experience becomes a large-scale, high-effort service, and the business must make hard trade-offs on strategy, customer segments, and long-term ownership. Enterprise chatbots often must integrate with core systems (CRM, knowledge bases, commerce stack) and operate under security, governance, and compliance requirements. Brand safety and hallucination risk also was a major risk, especially in luxury, where brands worry that AI can make experiences feel robotic and their service quality directly shapes trust and loyalty.

Technically, we also hit real constraints: a single brand's SKU breadth is often too limited for a chat-based "find exactly what you want" promise, and fashion itself drifts seasonally — meaning model outputs, attributes, and embeddings need ongoing refresh. In practice, keeping embeddings in sync with changing catalogs and trend-driven semantics creates recurring re-embedding and re-indexing costs that can dominate the operational budget.

Finally, the macro environment didn't help: luxury and fashion entered a clear slowdown in 2025, and marketing budgets stayed tight. When brands are in slowdown mode, they tend to prioritize initiatives with clearer short-term ROI (inventory efficiency, cost control, proven paid media) over channel-disrupting UX experiments.

How We Would Design It In 2026 (Technically): A Shifted Mindset

Special thanks to Xinhao (Jerome) Li and Yanbin Jiang for sharing their thoughts from technical perspectives with me to compose this section.

In 2024, we framed the AI stylist prototype as a conversational sales agent for fashion e-commerce. The idea was straightforward: if a shopper could describe what they wanted in natural language, the system should be able to understand the request, search a large apparel catalog, and surface items that matched both functional constraints and stylistic intent. To make that possible, we stitched together a scraped product catalog, VLM-based product enrichment, embeddings, and a RAG-style retrieval loop.

At the time, this felt like an agent problem. The prototype worked because it gave the catalog a richer semantic layer besides the merchant metadata. We used a VLM to turn product text and images into a more stylist-like representation: fit, occasion, style, fabric vibe, and a free-form description of how the item should be interpreted. That made the catalog much more searchable in the language real shoppers actually use. A query like "something elegant but not too formal for an outdoor summer wedding" could now be mapped into something more meaningful than a handful of filters.

But looking back from 2026, the most important lesson is that the hard part was never the chat interface itself. It was retrieval and ranking. Once the system had to operate over a few thousand products, the main determinant of quality became whether the right items were even present in the candidate set. If retrieval missed, the model could not rescue the experience no matter how fluent the response sounded. And even when retrieval worked reasonably well, ranking quickly became the true bottleneck. In fashion discovery, many items can be vaguely relevant. The real challenge is deciding which few should be shown first, balancing semantic relevance with price, inventory, diversity, and whatever business objectives sit behind the experience.

That changes how I would describe the system today. I would no longer call it primarily an AI agent. I would describe it as a conversational retrieval and ranking system, with an LLM acting as the interface layer. The model still matters, but mostly because it helps translate messy user language into structured intent, maintain context over multiple turns, and generate grounded explanations for why certain items are being recommended. The "intelligence" of the product does not come from the model alone. It comes from the quality of the catalog representation, the retrieval stack, the ranking logic, and the memory signals that shape the session.

If I were designing the same product in 2026, I would simplify the architecture around that reality. I would invest less in the idea of a general-purpose agent and more in building a strong commerce retrieval stack:

Catalog quality is the foundation: the catalog ingestion layer would produce a cleaner and more explicitly typed product ontology, which brings consistency to recommendations.
Stronger retrieval layer and learned ranking. Retrieval would likely be hybrid rather than purely embedding-based, combining structured filters, sparse signals, and dense semantic search. In commerce, relevance often depends on both hard constraints and softer semantic intent, so embeddings alone are usually not enough. Retrieval should be iterative rather than strictly single-shot — one-pass RAG is often too shallow for real shopping conversations. A better design is to expose retrieval and ranking as a bounded skill that the LLM can invoke when needed, which allows the system to refine queries, broaden or narrow constraints, and run multiple retrieval passes when the initial results are weak or ambiguous. An LLM can help orchestrate this loop, but the goal is not open-ended autonomy — it is better search coverage and recovery. Ranking would become the center of the system rather than an afterthought. In practice, many products may be broadly relevant; the real challenge is deciding which few to show first. That calls for a multi-stage ranking stack: high-recall candidate generation, learned reranking for stylistic and contextual relevance, and a final business-aware layer that accounts for factors like inventory, price, diversity, and margin.
Memory as signals, not just context. Memory would be treated less as a long prompt and more as a source of durable signals: stable preferences, price sensitivity, favored brands, and other traits that can directly improve ranking.
Grounded generation still matters. The assistant needs to explain why an item matches, compare options, ask clarifying follow-ups, and avoid hallucinating product attributes. Anthropic explicitly frames many successful systems as simple workflows or LLMs using tools in a loop, rather than fully autonomous agents.
Consider interaction with the physical world. Voice is already good enough as the primary interface for real-time human-AI collaboration. Rather than building only for screen-based consumers, the system should also serve as a real-time retrieval layer for sales staff and frontline workers, surfacing information through lightweight wearables and voice so that the person builds trust while the AI handles instant recall and search.

The core shift is conceptual. In 2024, it was natural to view the product as an agent because the interface was conversational and the system appeared to reason over shopper intent. In 2026, I think the better framing is more grounded: for this kind of retail assistant, the problem is "how do we build a retrieval and ranking system that understands fashion well enough to feel like a stylist?" That distinction matters, because it changes where the engineering effort should go.

2026 Reality Check on Our Fashion Retail AI Agent

By 2026, I stopped thinking about our 2024 prototype as a chat interface and started seeing it for what it actually was: a catalog intelligence and ranking system that had to justify its existence in a low-growth market.

How the 2025 roadblocks persisted

The 2025 blockers didn't disappear. They became clearer.

Selling conversational commerce into fashion B2B is still fundamentally hard, because brands don't see it as a feature. They see it as control over customer intent. Once you frame it that way, resistance is not a temporary hurdle, it is structural.
At the same time, the market itself stopped giving you room to experiment. Fashion remains in low single-digit growth conditions, with leaders describing the environment as "challenging," and cost pressures (tariffs, input costs, volatility) reshaping priorities (McKinsey). Better UX naturally translates into more spending, but anything that adds operational overhead must show measurable lift quickly. In 2026, the bar shifted from "this feels innovative" to "this drives measurable lift." That's a very different game.
The embedding problem followed a similar pattern. On paper, embeddings look cheap. In reality, they behave like a recurring operational tax. Catalogs change constantly, and fashion language shifts even faster. You are not embedding once, you are running a continuous pipeline just to stay relevant. The real question is no longer how to build embeddings, but how much freshness the business can afford before the ROI breaks.

A new challenge in 2026: Fashion is converging

Then there is a change I didn't expect to matter this much: aesthetic convergence. Watching the 2026 Oscars, it was hard to ignore how similar everything looked. Clean lines, neutral palettes, safe silhouettes. TheWrap described it directly as "cool minimalism" reigning on the carpet, with many stars opting for subdued, safe looks. British Vogue had already been asking a related question a year earlier: why does everyone dress the same now? The industry has quietly converged on a shared aesthetic, reinforced by social media loops that reward what already works. Fashion is still beautiful, but it is less differentiated.

That creates a subtle but important problem for AI. A recommendation system depends on meaningful differences between items. When everything becomes "equally good," ranking gets harder, explanations sound generic, and discovery loses its edge. The problem shifts from helping users find something new to helping them feel confident about choosing something at all.

And that ties directly to behavior. Discovery becomes less exciting. At the same time, shoppers are buying less and thinking longer, which means the product's job is no longer to surface more options. It is to reduce uncertainty and help someone feel confident enough to choose. People are buying less. They are thinking longer for better pieces. In that world, the value of an AI agent is not how many items it can surface, but whether it can reduce uncertainty.

Active players in this area now

Daydream is probably the clearest example. It raised $50 million, launched its beta with a fashion-specific chat interface and a "style passport," and positioned itself as a discovery layer rather than a fulfillment platform, with commission-based revenue as the initial model (TechCrunch). But its revenue disclosure is limited. The monetization story is still tied to whether they can outperform existing discovery — and whether they can survive strong tech leaders breaking into this vertical (e.g. Google shopping agent, OpenAI shopping research) and actually retain users (ModernRetail).

I used Daydream's beta recently, and what stood out to me was not the ambition but the fragility. In one session, after I explicitly told it "I hate dark blue," the product still drifted back toward similar recommendations a few turns later, as if that negative preference had not really stuck. That sounds minor until you realize it breaks the core promise of chat-based shopping. If the system cannot reliably preserve strong feedback even within a single conversation, then the experience stops feeling intelligent and starts feeling slippery. To me, that is the clearest proof that even with abundant funding and an experienced team, fashion recommendation through chat is still extremely hard to optimize into something that delivers reliable and consistent user value.

Amazon has a very different monetization story about its shopping agent, Rufus. Business Insider reported that Amazon internally projected Rufus could indirectly contribute more than $700 million in operating profit in 2025, with value tied to downstream purchases and ads rather than direct user payment for "styling." That is a very different monetization story.

What I learned as a founder

That is probably the most important founder lesson I take from 2026: the issue was always rooted in the shape of the market. The winners in this space are more likely to either own the funnel or improve the funnel.

The middle ground, where a chat product is expected to create durable value on top of fashion discovery alone, is still much harder than it looks. If I were making the call today, I wouldn't bet on a fully conversational agent as the primary interface for a single brand. I would either move down the stack and build catalog intelligence that improves existing systems, or move up and aggregate across merchants where owning intent actually matters.

And in 2026, the market is telling you very clearly what it values: fewer decisions, higher confidence, and systems that fit into how people already shop, not how we wish they would.