In today’s digital age, professionals across all industries must stay updated with upcoming events, conferences, and workshops. However, efficiently finding events that align with one’s interests amidst the vast ocean of online information presents a significant challenge.
This blog introduces an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook (opens new window)and analyzes the scraped data using MyScale(opens new window. While MyScale is commonly associated with the RAG tech stack or used as a vector database, its capabilities extend beyond these realms. We will utilize it for data analysis, leveraging its vector search functionality to analyze events that are semantically similar, thus providing better results and insights.
You may notice that Grok AI (opens new window)utilized the Qdrant vector database as the search engine to retrieve real-time information from X (formerly known as Twitter) data. You can also assess the power of vector databases in this way with MyScale by integrating MyScale with other platforms like Apify to enhance daily life tasks through the development of simple personalized applications.
So in this blog, let’s develop an application that takes only the name of a city as input and scrapes all related events from Facebook. Subsequently, we will conduct data analysis and semantic search using the advanced SQL vector capabilities of MyScale.
We’ll use several tools, including Apify, MyScale(opens new window, and OpenAI, to develop this useful application.
- Apify: A popular web scraping (opens new window)and automation platform that significantly streamlines the process of data collection, it provides the capability to scrape data and subsequently feed it to LLMs. This allows us to train LLMs on real-time data and develop applications.
- MyScale: MyScale is a SQL vector database that we use to store and process both structured and unstructured data in an optimized way.
- OpenAI: We will use the model
text-embedding-3-small
from OpenAI (opens new window)to get the embeddings of the text and then save those embeddings in MyScale for data analysis and semantic search.
How To Set Up MyScale and Apify
To start setting up MyScale and Apify, you’ll need to create a new directory and a Python (opens new window)file. You can do this by opening your terminal or command line and entering the following commands:
Note: We will be working in a Python notebook. Consider every code block a notebook cell.
How To Scrape Data With Apify
Now, we will use the Apify API to scrape event data of New York City using Facebook Events scraper(opens new window).
Note: Don’t forget to add your Apify API key in the script above. You can find your API token on the Integrations (opens new window)page in the Apify Console.
Data Pre-Processing
When we gather raw data, it comes in various formats. In this script, we will bring the event dates into a single format so that our data filtering can be more efficiently done.
Generating Embeddings
To deeply understand and search events, we will generate embeddings from their descriptions using the text-embedding-3-small
. These embeddings capture the semantic essence of each event, helping the application to return better results.
Connecting With MyScale
As we have discussed at the start we will use the MyScale as a vector database for storing and managing data. Here, we will connect to MyScale in preparation for data storage.
Note: See Connection Details (opens new window) for more information on how to connect to the MyScale cluster.
Create Tables and Indexes Using MyScale
We now create a table according to our DataFrame. All the data will be stored in this table, including the embeddings.
Storing the Data and Creating an Index in MyScale
In this step, we insert the processed data into MyScale. This involves batch-inserting the data to ensure efficient storage and retrieval.
Data Analysis Using MyScale
Finally, we use MyScale’s analytical capabilities to perform analysis and enable semantic search. By executing SQL queries, we can analyze events based on topics, locations, and dates. So, let’s try to write some queries.
Simple SQL Query
Let’s first try to get the top 10 results from the table.
Discover Events by Semantic Relevance
Let’s try to find the top 10 upcoming events with a vibe similar to a reference event, like this, “One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady”
. This is achieved by comparing semantic embeddings of event descriptions, ensuring a match in themes and emotions.
This query ranks the top 10 events by the number of attendees and interested users, highlighting popular events from large city festivals to major conferences. It is ideal for those seeking to join large, energetic gatherings.
Combining relevance and popularity, this query identifies similar events in New York City related to a specific event and ranks them by attendance, offering a curated list of events that reflect the city’s vibrant culture and attract local interest.
This query ranks the top 10 event organizers by the total number of attendees and interested users, highlighting those who excel in creating compelling events and attracting large audiences. It provides insights for event planners and attendees interested in top-tier events.
Previously, we have explored MyScale for data analysis, highlighting its capabilities in enhancing our data workflows. Moving forward, we’ll go one step ahead by implementing Retrieval-Augmented Generation (RAG), an innovative framework combining an external knowledge base with LLMs. This step will help you to understand your data better and find more detailed insights. Next, you’ll see how to use RAG with MyScale, which will make working with data more interesting and productive.
Conclusion
We have explored the abilities and functionalities of MyScale with Apify Scraper through the process of developing an event analytics application. MyScale has demonstrated its capabilities in high-performance vector search while retaining all the functionalities of SQL databases, which helps developers make semantic searches using familiar SQL syntax with improved speed and accuracy.