Case Study: How I Built a Generative AI Content Collection Builder

As a PM for a large media company, I have built several personalization and recommendation products that leverage Machine Learning and AI. In the below example, I'll share a product my team built and walk through my approach to building products

Assumption

By providing our content team with the ability to leverage machine learning models to pick relevant content for the home screen, we will speed up time to create new collections and improve overall engagement by providing more relevant content for the user

Approach

Identify what feature(s) would be needed to test this assumption. Include reasons.

Identify how it functions and looks

Discuss trade-offs, risks and metrics for success.

Features to Build

We will look to build a tool that the content team can use that allows them to input a text phrase and generate a set of recommended content that best fits with that text phrase. We will build an embeddings model with the metadata from the content in our library and use that to power the tool.

Currently, when a member of the content team wants to create a new collection of content to display on the home screen, they have to come up with their idea and manually search through the content library for titles that fit with that idea. They will leverage metadata such as genre, keywords, summaries, as well as their own anecdotal understanding of the show to identify content that fits. This is a fairly involved and time-consuming process, and is subject to individual interpretation. For example, is Cops a Crime show, a police show, a reality show, etc.

By building a tool that accesses all of the content and uses a ML model to choose based on the text input, we are able to speed up the time to curate the collection and remove the individual user's bias. This allows us to speed up the time we can test new content and show new, more relevant content to users, and drive up content starts and minutes watched.

How Does It Look and Function

User Stories:

I had to look at this from the angle of two different users:

1. The internal content team that would actually utilize the tool

2. The end user that would benefit from the improved content being shown to them

- As a content curator, I want to be able to generate collections of content based on a concept or theme so that I can design the content on the home screen in line with the current strategy

- As a content curator, I want to be able to leverage machine learning models to generate collections of content so that I can decrease the time it takes for me to deploy a new collection.

- As a viewer, I want content organized in collections so that I can more easily browse and discover content I want to engage with.

- As a viewer, I want the content recommendations in a collection to be reflective of the title of the collection, so that my expectations are met.

Product Requirements

- Users have access to a web app where they can input text and apply some filters to generate a list of recommendation results

- Users of the tool are able to leverage our embeddings models to generate collections of content based off of an idea they have

- Users can apply filters for key metadata to fine-tune results

- Users can re-order the ranking of the results of the metadata

- Users can input a text phrase to generate recommendation results

Technical Requirements

I can't share full details here due to confidential and proprietary information but I will share some insight into what we focused on

- For the MVP, we will build a web app that the user can input

- The tool will connect to our ML model platform to be able to leverage real-time results based on direct text input

- The text input from user will be converted to a vector embedding and compared to the results in the vector database, and the results will be returned to the web app

- We will connect to the CMS to provide details about the results returned ie allow the user to directly access the details of the content from the tool

- We will store the inputs and outputs for later analysis to determine efficacy

Tradeoffs, risks and metrics for success

This design was considered our MVP and we had to leave some things off that we thought would be great, but that weren't necessary in order to prove out the value of the product itself. Our goal was to build an MVP that would prove the value of it and then expand on it as needed.

Tradeoffs for the MVP:

- Ability to access multiple models: Our team has a suite of models that we deploy in different areas, however we developed one specifically for this use case and wanted to keep it confined to that. This gave us more control over providing the exact value we wanted.

- Ability to deploy the collection directly to the home screen. This would have been an additional integration involving another team that would have increased the time it took to deliver and test out the value so we opted to save this for later.

Risks:

- Lack of feedback (from a data perspective). When evaluating efficacy of content, we look at impressions and content starts + minutes watched ie did the user engage with what we showed them. In this case, there's no way to know if the content recommendations provided were beneficial to the team because we don't know which eventually made it to the home screen. We mitigate this through regular contact with the content team for anecdotal feedback, and by tracking new collections as they're posted on the home screen.

Success Metrics:

Our north star metrics center around end-user engagement and since this was a tool that was built for an internal team to serve end-users, we wanted to ensure we kept an eye on that.

- Content starts - we need to compare the content starts from collections created using this tool vs collections created without the tool to determine uplift

- Minutes watched - we need to compare the content starts from collections created using this tool vs collections created without the tool to determine uplift

- Cosine similarity - this evaluates how well we're capturing the semantic similarity between words or phrases

- Precision & recall - how many shows are relevant to the query and how often are they presented

- F1 score - how well are we balancing precision & recall

- Mean reciprocal rank - calculates how well the model is ranking the items compared to their relevancy