Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Tutorial

How to Scrape Wikipedia Data with Python

Learn how to scrape Wikipedia articles, tables, and infoboxes with Python. Covers the MediaWiki API, BeautifulSoup, and structured data extraction.

Wikipedia is one of the most scraping-friendly sites on the internet. It has no anti-bot protection, offers a free API, and contains structured data across millions of articles. Here is how to extract it efficiently.

Method 1: Wikipedia API (Best Approach)

The MediaWiki API returns structured data without any scraping needed.

import requests

def get_article(title):
    """Get the full text of a Wikipedia article."""
    response = requests.get("https://en.wikipedia.org/w/api.php", params={
        "action": "query",
        "titles": title,
        "prop": "extracts",
        "exintro": False,
        "explaintext": True,
        "format": "json"
    })
    pages = response.json()["query"]["pages"]
    page = next(iter(pages.values()))
    return page.get("extract", "")

text = get_article("Web scraping")
print(text[:500])

Method 2: Wikipedia Python Library

pip install wikipedia-api
import wikipediaapi

wiki = wikipediaapi.Wikipedia(
    user_agent="MyScrapingProject/1.0 (contact@example.com)",
    language="en"
)

page = wiki.page("Web_scraping")
print(f"Title: {page.title}")
print(f"Summary: {page.summary[:300]}")

# Get all sections
for section in page.sections:
    print(f"Section: {section.title}")

Scraping Wikipedia Tables

Wikipedia tables contain highly valuable structured data.

import pandas as pd

# pandas can read Wikipedia tables directly
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)

# The page may have multiple tables
for i, table in enumerate(tables):
    print(f"Table {i}: {table.shape[0]} rows x {table.shape[1]} columns")
    print(table.head())
    print("---")

Scraping Infoboxes

Infoboxes contain structured key-value data displayed on the right side of articles.

import requests
from bs4 import BeautifulSoup

response = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
soup = BeautifulSoup(response.text, "html.parser")

infobox = soup.find("table", class_="infobox")
if infobox:
    rows = infobox.find_all("tr")
    for row in rows:
        header = row.find("th")
        value = row.find("td")
        if header and value:
            print(f"{header.get_text(strip=True)}: {value.get_text(strip=True)}")

Bulk Article Extraction

import requests
import time

def get_category_articles(category, limit=50):
    """Get all articles in a Wikipedia category."""
    articles = []
    params = {
        "action": "query",
        "list": "categorymembers",
        "cmtitle": f"Category:{category}",
        "cmlimit": min(limit, 500),
        "cmtype": "page",
        "format": "json"
    }
    response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
    members = response.json()["query"]["categorymembers"]

    for member in members:
        articles.append({"title": member["title"], "pageid": member["pageid"]})

    return articles

articles = get_category_articles("Programming_languages")
print(f"Found {len(articles)} articles")

Best Practices

  • Use the API, It is faster, more reliable, and returns cleaner data than HTML scraping
  • Set a User-Agent, Wikipedia requires a descriptive User-Agent string
  • Respect rate limits, Limit requests to about 200 per second for the API
  • Use Wikidata, For structured entity data, Wikidata's SPARQL endpoint is even better than scraping Wikipedia directly