AI Tip: Web site scraping

Mar 19, 2024 | General

Blog

Have a lot of web content to train?

With Vizaport, and other AI RAG platforms, there are many options for training your customized AI model. Different types of documents, from PDFs to Word docs to spreadsheets and more. But web pages often have content that is already approved for public viewing, and often times summarized for a customer. Therefore, it is common for clients that train an AI to first think about pointing it to their web site, as the first training model.  

In the next image, you’ll find an example of Vizaport’s AI training. Simply drag-and-drop documents of all types. Or give it a URL. 

In our AI tips and tricks, we talk about how web page content may not result in the best accuracy, compared to some other structured data or unstructured data formats. Yet, we know that clients will continue to use the URL extract function. We get it. It’s easy. It’s already there.  

AI Training

What’s the best way to load web site data?

First, let’s review a few different options to get web site data loaded into an AI training model. Here’s a few options, along with the pros and cons of each:

  1. Single web page extract: For sites with limited pages, a URL can be entered into Vizaport and a single page extracted. You control the pages. You delete unnecessary content/pages. It’s helpful… but not for web sites with a lot of content.
  2. Saving pages as individual documents: If you have access to tools that efficiently download and extract web pages, you can save them in a variety of formats and then upload them. It’s helpful to see each document to edit, but can be difficult to maintain for sites with hundreds of pages.
  3. Web scraping into a single document: Perhaps we shouldn’t have used the term “web scraping“, as sometimes this can have a bad connotation, but it’s your web site. You control it.  If you have tools that can download content and put it into a single document, this makes it easy to load and build connectors over time to sync. WordPress has a great one for those that are on WordPress sites as we document below in a video.  

Video Example – Plugin to scrape large web sites

Here’s an example video to walk through the process of legally scraping a web site (that you presumably own) to get its data into a single file that can be loaded and maintained in an AI model like Vizaport. the results are fairly accurate for a broad web site download without customizing data. You can try it yourself if you want to test the results at this web site in the video.

And here’s the link to the plug-in referenced in the video for those that are WordPress users.

Recent Articles

Our new product line

Our new product line

Introducing Vizaport's New Suite of AI-Powered Data Visualization Tools In today’s rapidly evolving technological landscape, data visualization is not just a tool but a necessity for businesses across various sectors. At Vizaport, we are proud to unveil our redesigned...

Network Visualizer – Configuration Options

Network Visualizer – Configuration Options

New Options for Internet Service Providers In Vizaport's latest release, we've added new configuration options for our Network Visualizer, used by Internet Service Providers to allow customers to view proximity to fiber and cable lines. We'll highlight all of the...

Vizaport AI Chatbot – Now With Google Gemini

Vizaport AI Chatbot – Now With Google Gemini

From Zero to Chatbot in 3 minutes…that’s all it takes with Vizaport’s latest WordPress Plugin update. Version 1.2 now comes with an option to choose Google Gemini. You still have the option to use OpenAI, but choosing Gemini bypasses the requirement of obtaining a...