Harnessing LLMs for Automated Timeline Generation
In our data-driven world, the ability to quickly synthesize and visualize historical information is invaluable. This blog post explores how we can leverage Large Language Models (LLMs) to automatically generate timelines from Wikipedia content, focusing on the core functionality using Python, Pydantic, LangChain, and OpenAI’s GPT models.
The Power of Timelines
Imagine a journalist named Pawan working on a comprehensive article about the history of space exploration. With deadlines looming, Naveen Pawan to quickly grasp the key events and milestones spanning decades. Enter our LLM-powered timeline generator. Within minutes, Pawan inputs “space exploration” and receives a chronological list of pivotal moments - from Sputnik 1’s launch to the latest Mars rover landing. This timeline not only saves Pawan hours of research but also provides a structured foundation for the article, ensuring no critical events are overlooked.
This scenario illustrates how automated timeline generation can transform raw data into structured, chronological narratives, enhancing research efficiency and comprehension. Now, let’s dive into the implementation of this powerful tool, with a focus on the key libraries and detailed code explanations.
AI tool to create a timeline
The tools enables the user to select the LLM of his or her choice and creates a timeline dashboard for a given topic. The topic name and number of events to create the timeline can also be input by the user.
The tools is available here. Please experiment with the tool.
Below video that takes you through the tool and the code walkthrough.
Key Libraries Overview
Before we delve into the implementation, let’s briefly overview two crucial libraries we’ll be using:
Pydantic
Pydantic is a data validation and settings management library using Python type annotations. It enforces type hints at runtime, providing clear and informative error messages when data is invalid. Pydantic models are declarative, making them intuitive to define and use. The library seamlessly integrates with many Python frameworks and can automatically generate JSON schemas. In our project, Pydantic is vital for defining structured data models for our timeline events, ensuring that the data extracted by the LLM adheres to our specified format and types.
LangChain
LangChain is a framework for developing applications powered by language models. It provides a standardized interface for chains, a generic way to combine multiple components like prompts, language models, and other chains. LangChain facilitates prompt management, chat history handling, and seamless integration with various LLM providers. It also offers tools for working with different data sources and memory types. In our timeline generator, LangChain is crucial for creating a structured prompt, managing the interaction with the LLM, and parsing the output into our defined Pydantic models.
Implementing Timeline Generation: A Step-by-Step Guide
Now, let’s break down the implementation into detailed steps, providing in-depth explanations of each code snippet.
Step 1: Setting Up the Environment
First, let’s import the necessary libraries:
import os
import pandas as pd
import traceback
from datetime import datetime
from typing import Literal
from pydantic import BaseModel, Field
from fasthtml.common import *
from langchain_core.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.utilities.wikipedia import WikipediaAPIWrapper
from langchain_community.tools.wikipedia.tool import WikipediaQueryRun
Let’s break down these imports:
os
andjson
: Standard Python libraries for operating system interactions and JSON handling.pandas
: A powerful data manipulation library, used here for creating and managing our timeline dataframe.datetime
: For handling date and time objects in our events.typing
: Provides support for type hints, which we’ll use in our Pydantic models.pydantic
: For defining our data models with built-in validation.langchain_core.prompts
: Allows us to create structured prompts for our LLM.langchain.output_parsers
: Helps in parsing the LLM output into our defined Pydantic models.langchain_openai
,langchain_google_genai
, andlangchain_anthropic
: Interfaces for OpenAI, Google and Anthropic LLMs.langchain_community.utilities.wikipedia
: Provides tools for fetching content from Wikipedia.
Step 2: Defining Data Models with Pydantic
Now, let’s create our data models using Pydantic:
class Event(BaseModel):
time: datetime = Field(description="When the event occurred")
description: str = Field(description="A summary of what happened. Not more than 20 words.")
sentiment: Literal["Positive", "Negative"] = Field(..., description="Categorization of the event sentiment")
class EventResponse(BaseModel):
events: List[Event] = Field(max_length=20, description="List of events extracted from the context")
Let’s break this down:
Event
class:time
: Adatetime
field representing when the event occurred.description
: A string field for a brief summary of the event, limited to 20 words.sentiment
: ALiteral
field that can only be “Positive” or “Negative”, representing the event’s sentiment.
EventResponse
class:events
: A list ofEvent
objects, with a maximum of 20 events.
The Field
function is used to provide additional metadata and validation rules for each field. This structure ensures that our LLM output will conform to this specific format, making it easier to process and analyze the timeline data.
Step 3: Setting Up the LLM Chain
Now, let’s set up our LangChain components:
parser = PydanticOutputParser(pydantic_object=EventResponse)
event_extraction_template = """
Extract the time based informations or events from the context and return a list of events with time, event description and event sentiment type whether it was positive or negative event.
The context may contain information about people, organization or any other entity.
<context>
{context}
</context>
The response must follow the following schema strictly. There will be penalty for not following the schema.
<schema>
{format_instructions}
</schema>
Must ensure the event belongs to the topic {topic} and try to get at least {numevents} unique events possible from the context.
Output:
"""
event_prompt = PromptTemplate(
input_variables=["context"],
partial_variables={"format_instructions": parser.get_format_instructions()},
template=event_extraction_template
)
Here’s what’s happening:
-
We create a
PydanticOutputParser
that will parse the LLM’s output into ourEventResponse
model. - We define an
event_extraction_template
. This is a crucial part of our implementation as it instructs the LLM on how to process the input and format the output. Let’s break it down:- It asks the LLM to extract time-based information from the given context.
- It specifies that the events should include time, description, and sentiment.
- It emphasizes the need for a detailed and unique list of events.
- It includes placeholders for the context (
{context}
) and format instructions ({format_instructions}
).
- We create a
PromptTemplate
using this template. Theinput_variables
specify what will be passed to the prompt (in this case, thecontext
), andpartial_variables
includes theformat_instructions
from our parser.
This setup ensures that our LLM will receive clear instructions on how to process the input and format the output according to our Pydantic model.
Step 4: Configuring the Language Model
Next, we’ll create a function to set up either OpenAI’s , Anthropic’s or Google’s Gemini language model based on user preference:
def getModel(model, key):
if(model == 'OpenAI Gpt-o'):
os.environ['OPENAI_API_KEY'] = key
return ChatOpenAI(temperature=0, # Set to 0 for deterministic output
model="gpt-4o-2024-08-06", # Using the GPT-4 Turbo model
max_tokens=8000) # Limit the response length
elif (model == 'Anthropic Claude'):
os.environ['ANTHROPIC_API_KEY'] = key
return ChatAnthropic(model='claude-3-5-sonnet-20240620') # Limit the response length
else:
os.environ['GOOGLE_API_KEY'] = key
return ChatGoogleGenerativeAI(
model="gemini-1.5-pro",
temperature=0,
max_tokens=8000,
max_retries=2,
)
This function does the following:
- It takes two parameters:
model
(to choose between OpenAI and Anthropic) andkey
(the API key). - Depending on the chosen model:
- For OpenAI, it sets the API key in the environment variables and returns a
ChatOpenAI
instance.temperature=0
ensures deterministic output.- We’re using the “gpt-4-1106-preview” model, which is GPT-4 Turbo.
max_tokens=4000
limits the response length.
- For Anthropic, it sets the API key and returns a
ChatAnthropic
instance using the Claude 3 Opus model.
- For OpenAI, it sets the API key in the environment variables and returns a
This setup allows flexibility in choosing the LLM provider while encapsulating the configuration details.
Step 5: Fetching Wikipedia Content
For fetching Wikipedia content, we’ll use the WikipediaQueryRun tool:
wikipedia = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
wiki_content = wikipedia.run(topic)
This code:
- Creates a
WikipediaQueryRun
instance, which is a tool for querying Wikipedia. - Uses the
run
method to fetch content for a given topic.
This abstraction simplifies the process of retrieving relevant information from Wikipedia, providing our LLM with rich context for timeline generation.
Step 6: Generating the Timeline
Now, let’s create the core function that orchestrates the timeline generation process:
def generate_timeline(topic, llm):
try:
wikipedia = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
wiki_content = wikipedia.run(topic)
chain = event_prompt | llm | parser
result = chain.invoke({"context" : wiki_content,
"topic": topic,
"numevents": numevents})
df = pd.DataFrame([event.dict() for event in result.events])
df = df.sort_values("time", ascending=True).reset_index(drop=True)
# Save to CSV
df.to_csv(f"{topic.replace(' ', '_')}_timeline.csv", index=False)
print(f"Timeline saved to '{topic.replace(' ', '_')}_timeline.csv'")
return df
except Exception as e:
print(f"Error parsing LLM output: {str(e)}")
return None
Let’s break down this function:
-
It takes two parameters:
topic
(the subject for the timeline) andllm
(the configured language model). -
It fetches Wikipedia content for the given topic using
WikipediaQueryRun
. -
It creates an
LLMChain
using the provided LLM and our predefined prompt. - It runs the chain with the Wikipedia content as context, which:
- Sends the content to the LLM with our structured prompt.
- Receives the LLM’s response, which should be a list of events.
- In the
try
block:- It parses the LLM’s output using our Pydantic parser.
- Converts the parsed events into a pandas DataFrame.
- Sorts the events chronologically and resets the index.
- Adds a formatted date string column (‘TimeStr’).
- Saves the timeline to a CSV file, naming it based on the topic.
- Returns the DataFrame for further use.
- If any error occurs during parsing or processing, it’s caught in the
except
block, printed, and the function returnsNone
.
This function encapsulates the entire process of timeline generation, from fetching data to processing the LLM output and saving the results.
Step 7: Using the Timeline Generator
Finally, here’s how we can use our timeline generator:
# Example usage
topic = "Space Exploration"
llm = getModel('OpenAI', 'your-api-key-here')
timeline_df = generate_timeline(topic, llm)
if timeline_df is not None:
print(timeline_df.head())
else:
print("Failed to generate timeline.")
This code snippet demonstrates the usage of our timeline generator:
- We define the topic (“Space Exploration” in this case).
- We get an instance of the LLM using our
getModel
function, specifying OpenAI as the provider and passing the API key. - We call
generate_timeline
with our topic and LLM. - If successful, we print the first few rows of the resulting DataFrame.
- If there was an error, we print a failure message.
This simple interface allows users to easily generate timelines for any topic of interest, leveraging the power of LLMs and Wikipedia data.
Conclusion
By leveraging the power of LLMs, Wikipedia, and Python libraries like Pydantic and LangChain, we’ve created a robust tool for generating timelines on any given topic. This approach demonstrates the potential of AI in research and data synthesis, offering a powerful way to quickly organize and understand historical information.
Whether you’re a journalist like Pawan, a researcher, or a data enthusiast, this tool opens up new possibilities for exploring and understanding chronological data. As LLMs continue to evolve, the potential for even more sophisticated timeline generation and analysis tools is immense. This implementation serves as a starting point, inviting further enhancements such as improved event extraction, sentiment analysis refinement, or integration with visualization libraries for creating interactive timelines.