Internal Linking Guide for SEO with Google Colab (Python) |

I have a new passion! It’s called Google Colab, and I think after this quick guide to streamline and facilitate your SEO tasks, you will have it, too 🙂

You may be thinking, Ana, but I don’t know how to program. Let me tell you something: Neither do I! Just use common sense, ChatGPT, and Google Colab.

Here’s how to quickly find internal linking opportunities for SEO with Google Colab.

What’s Google Colab?

Contenido

1 What’s Google Colab?
- 1.1 Step 1. Extract copy with Screaming Frog
- 1.2 Step 2. Create the Python script and upload the CSV to Drive
2 Do you want to download the internal linking proposal?
- 2.1 What is the logic behind this Python script?
  - 2.1.1 Streamlining our SEO processes with AI

Google Colab is a tool that allows you to run Python directly on the platform quickly with almost no installation required. Colab is the equivalent of Jupyter Notebook, which was developed by Google.

In case you have little experience with code, Colab is a free online tool that you can use to edit and run code and perform and visualize data analysis. It is also a digital notebook that you can easily share with your team.

Before I show you the script to optimize the internal linking, there are a few small steps. Here we go!

Step 1. Extract copy with Screaming Frog

First, we must start by extracting the copy of the pages to be analyzed so that the script understands each page’s content and helps us link them based on similarity. For this, I have used Screaming Frog with its custom extraction function.

The idea is to look for a page that serves as a pattern with a text block. Check the categories, blog, or type of web page you want to analyze, and with the same XPath, you can extract all of their content.

Once you have chosen your page, right-click> Inspect > Copy > Copy XPath.

We open Screaming Frog, and in custom Extraction, we paste our code.

Once the crawl is ready, we download the CSV and rename the file as custom_extraction_full_text.csv.

It is important to rename the CSV because the script is configured to be called that way. Otherwise, it will not work. You can edit this script and adapt it to your needs, but if you want it to work without making changes, remember to rename it.

Also, you must open the CSV and delete all blank lines. You will also have to rename the column where the text is with “content.” The final column structure you should have is address, status code, status, and content.

Step 2. Create the Python script and upload the CSV to Drive

Once we have downloaded the CSV from Screaming Frog, renamed it, and removed the blank lines, we will upload this file to our Google Drive.

Now it’s time to open abrir Google Colab and a new notebook. File > New Notebook in Drive > Paste the code below

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the CSV file from Google Drive
# Make sure that this file exists and is in the correct location
file_path = '/content/drive/My Drive/custom_extraction_full_text.csv'  
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
df.head()

# Function to find internal link opportunities
def find_internal_link_opportunities(df):
    contents = df['content'].tolist()
    vectorizer = TfidfVectorizer().fit_transform(contents)
    vectors = vectorizer.toarray()
    csim = cosine_similarity(vectors)
    return csim

# Assuming the 'Content' column contains the text content of the pages
csim = find_internal_link_opportunities(df)

# Function to display internal link opportunities
def display_link_opportunities(df, csim, threshold=0.5):
    for idx, row in df.iterrows():
        similar_indices = [i for i, score in enumerate(csim[idx]) if score > threshold and i != idx]
        if similar_indices:
            print(f"Page: {row['Address']}")
            print("Potential internal links:")
            for i in similar_indices:
                print(f" - {df.iloc[i]['Address']} (Similarity: {csim[idx][i]:.2f})")
            print()

# Display internal link opportunities with a threshold for similarity
display_link_opportunities(df, csim, threshold=0.5)

Now, we can run the script after granting access to Google Drive. You must accept them so that Colab can read the .csv file you uploaded, making this code work.

To execute the script, we have to click on the circular play button, and the result we will obtain is the proposed internal linking.

The result displayed is the page and its potential links. Each of them indicates a similarity with the page. The higher the number, the higher the similarity.

Do you want to download the internal linking proposal?

If you want to download the internal linking proposal, you only have to add a small code at the end of the script. This way, working with large amounts of data is much more convenient.

# Save the DataFrame to a CSV file
df.to_csv('/content/internal_link_opportunities.csv', index=False)

# Download the CSV file
from google.colab import files
files.download('/content/internal_link_opportunities.csv')

You will have to run the script again to download the csv.

What is the logic behind this Python script?

The find_internal_link_opportunities function extracts the content of the pages in the Content column of the DataFrame.
The TfidfVectorizer converts this content into numeric vectors based on the TF-IDF model.
These vectors calculate the cosine similarity between all pages (how similar their content is).
The result is a similarity matrix (csim), where each value indicates the similarity between two pages.

This function, display_link_opportunities, iterates over each page (each row of the DataFrame).
Each page looks for other pages with a cosine similarity score above the threshold (0.5 in this case), excluding the current page.
If it finds similar pages, it prints the URL of the page (from the “Address” column) and the URLs of possible internal links, along with their similarity scores.

As we can see in the image above, it tells us the similarity between 0.5 and 1. The 0.5 similarity can be raised by modifying the script. I have used ChatGPT to ask questions, fix bugs, and learn.

Streamlining our SEO processes with AI

Is it perfect? No. But it helps us see more clearly which pages to link to and streamline our SEO processes.

If you found this post helpful, I’d be grateful if you could share it on your social networks. Your comments and feedback are also valuable, so please get in touch with me on X (Twitter). Thanks for taking the time to read!