Constructing a Semantic Ebook Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR Serverless | by Eva Revear | Jan, 2024

Utilizing OpenAI’s Clip mannequin to assist pure language search on a group of 70k e book covers

In a earlier publish I did just a little PoC to see if I may use OpenAI’s Clip mannequin to construct a semantic e book search. It labored surprisingly nicely, for my part, however I couldn’t assist questioning if it will be higher with extra information. The earlier model used solely about 3.5k books, however there are thousands and thousands within the Openlibrary information set, and I assumed it was worthwhile to strive including extra choices to the search house.

Nevertheless, the complete dataset is about 40GB, and making an attempt to deal with that a lot information on my little laptop computer, and even in a Colab pocket book was a bit a lot, so I had to determine a pipeline that would handle filtering and embedding a bigger information set.

TLDR; Did it enhance the search? I believe it did! We 15x’ed the information, which supplies the search far more to work with. Its not excellent, however I assumed the outcomes have been pretty attention-grabbing; though I haven’t achieved a proper accuracy measure.

This was one instance I couldn’t get to work regardless of how I phrased it within the final iteration, however works pretty nicely within the model with extra information.

For those who’re curious you’ll be able to strive it out in Colab!

General, it was an attention-grabbing technical journey, with numerous roadblocks and studying alternatives alongside the way in which. The tech stack nonetheless consists of the OpenAI Clip mannequin, however this time I leverage Apache Spark and AWS EMR to run the embedding pipeline.

This appeared like alternative to make use of Spark, because it permits us to parallelize the embedding computation.

I made a decision to run the pipeline in EMR Serverless, which is a reasonably new AWS providing that gives a serverless setting for EMR and manages scaling sources routinely. I felt it will work nicely for this use case — versus spinning up an EMR on EC2 cluster — as a result of this can be a pretty ad-hoc mission, I’m paranoid about cluster prices, and initially I used to be uncertain about what sources the job would require. EMR Serverless makes it fairly simple to experiment with job parameters.

Under is the complete course of I went via to get every thing up and working. I think about there are higher methods to handle sure steps, that is simply what ended up working for me, so in case you have ideas or opinions, please do share!

Constructing an embedding pipeline job with Spark

The preliminary step was writing the Spark job(s). The total pipeline is damaged out into two levels, the primary takes within the preliminary information set and filters for latest fiction (inside the final 10 years). This resulted in about 250k books, and round 70k with cowl photos accessible to obtain and embed within the second stage.

First we pull out the related columns from the uncooked information file.

Supply hyperlink

Constructing a Semantic Ebook Search: Scale an Embedding Pipeline with Apache Spark and AWS EMR Serverless | by Eva Revear | Jan, 2024

Must read

20+ Suggestions and Concepts That Make Paid Social Media Content material Pay Off

SEC asks court docket for 4 months to supply paperwork for Coinbase

Bitcoin’s Path To $1 Million Nonetheless Intact Regardless Of US Election Outcome – Knowledgeable

Llama 3.1 vs o1-preview: Which is Higher?

Utilizing OpenAI’s Clip mannequin to assist pure language search on a group of 70k e book covers

Constructing an embedding pipeline job with Spark

Establishing the vector database

Establishing the pipeline in AWS

Conclusion

More articles

LEAVE A REPLY Cancel reply

Latest article

20+ Suggestions and Concepts That Make Paid Social Media Content material Pay Off

SEC asks court docket for 4 months to supply paperwork for Coinbase

Bitcoin’s Path To $1 Million Nonetheless Intact Regardless Of US Election Outcome – Knowledgeable

Llama 3.1 vs o1-preview: Which is Higher?

Fast Hit #19 | CSS-Methods

Popular Category

Editor Picks

20+ Suggestions and Concepts That Make Paid Social Media Content material Pay Off

SEC asks court docket for 4 months to supply paperwork for Coinbase