Storing Unstructured Text Data Online Using Your Website

asycd
3 min readAug 13, 2024

--

Is it viable to use your website to house unstructured data for you to access via web scraping methods instead of API calls?

While it may not be the most secure way to store text data, I think it could be a cost efficient way of accessing your data.

Why?

At Asycd, we have a decent amount of text data we use to provide additional context to our text-based AI tools. This includes Asyra, our creative AI chatbot, and the Theme Explorer V1, our custom built image generator.

This is particularly useful as means of providing flexible and real-time prompt data for the TEV1. We can keep the static parts of the image generator offline since we won’t be changing them much and keep the prompts, descriptions and related online, albeit for people to see but I doubt the page will optimised enough to attract traffic.

Whenever we run our image generator, it would scrape the website, create a temporary vector store, and perform RAG on that vector store to aid in generating image descriptions. Pretty simple stuff.

Advantages and Drawbacks

+ No need to create an API to access our data or store it using a paid cloud service. We also don’t need store it within the code.

+ Potentially, decreased latency on our image generator since…

+ Mutable data store that can be customised for commercial use also to drive traffic to the website. A large enough corpus of text is likely to attract scrapers as well as general browsers

  • Cannot add sensitive or confidential data to the data store
  • Need to re-create the vector database for every session. There may be a workaround for this!
  • Embeddings model cost if embeddings need to be recalculated at runtime every time! – there are free embeddings models like Spacy which is an open-source NLP library in Python.

This technique would likely not be ideal for large-scale AI applications but it would be worth trying if you can find a workaround for constant recreation of the vector database.

Perhaps instead of scraping at runtime of your application, you scrape and create the vector database at regular daily intervals? That way you store the vector database locally only for a short period of time until it is replaced with an updated database.

How Does it Work?

The first prerequisite would be a website of course that has its own domain name for easier search-ability.

There are various website builders that allow a certain amount of pages and content within the pages but this well depend on either what your website provider allows or cloud hosting provider allows. Since I use Squarespace, I can refer to limits listed here.

There is no quotes limit on text data that can be included on any given page which is perfect for our use case.

There really isn’t much to set up. Just start uploading and that is your unstructured data store in operation.

Our Private and Confidential Data

You can find the page containing our data used for RAG on our private and confidential data page. I won’t link it you will just have to find it.

The data is used to generate new descriptions for our image generator which explains the randomness of the sentences.

Maximising Unstructured Data on Your Website

Reddit is a good example of this practise. The whole website has become a hub for authentic and useful information for various domains.

Storing unstructured text data on your website can significantly enhance SEO, user engagement, and real-time data access. By leveraging rich snippets, content clustering, and interactive features, you can improve visibility and user experience. Real-time data scraping and dynamic content updates ensure your site remains relevant and informative.

Industry-specific applications, such as patient reviews in healthcare and market analysis in finance, showcase the versatility of unstructured data. Implementing these strategies can transform your website into a powerful tool for data-driven insights and enhanced user experiences.

By integrating these techniques, your website can become a hub for valuable insights and innovation.

--

--

asycd

generating abstract and innovative digital art using generative AI since 2022-_-creating software and websites since 2024 -_-'always something you can do'