Jul 24, 2023

Summarize Webpages in Ten Lines of Code with Unstructured + LangChain



Have you ever had to read through a multitude of documents just to get yourself up-to-date on a topic? Being able to summarize documents quickly is one of the tasks that you can do with very little effort thanks to our library.

In this post, we will show you how easy it is to summarize the content of webpages using unstructured, langchain and OpenAI.

All the code below can be found in the following Colab notebook.

Getting info ready

First of all you’ll need a way to extract or download the content of a web page, and for this purpose we will use the UnstructuredURLLoader class from langchain. This function returns a loader, and after you call .load()you get elements that you can then filter down to only the useful information, removing JS code and irrelevant content from the HTML. So, we define a function generate_document:

from langchain.document_loaders import UnstructuredURLLoader
from langchain.docstore.document import Document
from unstructured.cleaners.core import remove_punctuation,clean,clean_extra_whitespace
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain

def generate_document(url):
"Given an URL, return a langchain Document to futher processing"
loader = UnstructuredURLLoader(urls=[url],
elements = loader.load()
selected_elements = [e for e in elements if e.metadata['category']=="NarrativeText"]
full_clean = " ".join([e.page_content for e in selected_elements])
return Document(page_content=full_clean, metadata={"source":url})

We just keep the NarrativeTextelements, and make use of the cleaning bricks for deleting strange characters and other content that isn’t useful. The last part of the function creates a Documentobject from langchainto store all the content obtained.

Creating the summarization pipeline

The next step is to create a pipeline for ingesting the documents, splitting them into pieces to feed a language model, call the OpenAI API, get the result and store it. Sounds like a lot of work? Absolutely not, this is just a little function thanks to langchain:

def summarize_document(url,model_name):
"Given an URL return the summary from OpenAI model"
llm = OpenAI(model_name='ada',temperature=0,openai_api_key=openai_key)
chain = load_summarize_chain(llm, chain_type="stuff")
tmp_doc = generate_document(url)
summary = chain.run([tmp_doc])
return clean_extra_whitespace(summary)

Essentially we create an llmobject for calling the API (with ‘Ada’ model) and consume the documents we generated obtaining the result from the model.

And…that’s all! In return for that function we’re going to get the summary for every URL we pass it.

One of advantages of our tool is that you can send fewer tokens to OpenAPI, saving on the billing rather than sending the entire HTML content of the page (but, sometimes it’s impossible to do that, since current models have limits on the amount of tokens you send them). It is also useful to know that we have numerous partitioning bricks to get your data ready for this kind of task, so you can leverage the full potential of this LLMs with little effort. PDFs, DOCXs, emails…you name it!

  • You probably want to use a mechanism to store previously summarized information in order not to reprocess URLs. For this purpose in the Colab notebook we use cachier.

  • Other providers for LLMs are available, check out the langchaindocs for more information.