Tutorial
How to Scrape Wikipedia Data with Python
Learn how to scrape Wikipedia articles, tables, and infoboxes with Python. Covers the MediaWiki API, BeautifulSoup, and structured data extraction.
Wikipedia is one of the most scraping-friendly sites on the internet. It has no anti-bot protection, offers a free API, and contains structured data across millions of articles. Here is how to extract it efficiently.
Method 1: Wikipedia API (Best Approach)
The MediaWiki API returns structured data without any scraping needed.
import requests
def get_article(title):
"""Get the full text of a Wikipedia article."""
response = requests.get("https://en.wikipedia.org/w/api.php", params={
"action": "query",
"titles": title,
"prop": "extracts",
"exintro": False,
"explaintext": True,
"format": "json"
})
pages = response.json()["query"]["pages"]
page = next(iter(pages.values()))
return page.get("extract", "")
text = get_article("Web scraping")
print(text[:500])
Method 2: Wikipedia Python Library
pip install wikipedia-api
import wikipediaapi
wiki = wikipediaapi.Wikipedia(
user_agent="MyScrapingProject/1.0 (contact@example.com)",
language="en"
)
page = wiki.page("Web_scraping")
print(f"Title: {page.title}")
print(f"Summary: {page.summary[:300]}")
# Get all sections
for section in page.sections:
print(f"Section: {section.title}")
Scraping Wikipedia Tables
Wikipedia tables contain highly valuable structured data.
import pandas as pd
# pandas can read Wikipedia tables directly
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)
# The page may have multiple tables
for i, table in enumerate(tables):
print(f"Table {i}: {table.shape[0]} rows x {table.shape[1]} columns")
print(table.head())
print("---")
Scraping Infoboxes
Infoboxes contain structured key-value data displayed on the right side of articles.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
soup = BeautifulSoup(response.text, "html.parser")
infobox = soup.find("table", class_="infobox")
if infobox:
rows = infobox.find_all("tr")
for row in rows:
header = row.find("th")
value = row.find("td")
if header and value:
print(f"{header.get_text(strip=True)}: {value.get_text(strip=True)}")
Bulk Article Extraction
import requests
import time
def get_category_articles(category, limit=50):
"""Get all articles in a Wikipedia category."""
articles = []
params = {
"action": "query",
"list": "categorymembers",
"cmtitle": f"Category:{category}",
"cmlimit": min(limit, 500),
"cmtype": "page",
"format": "json"
}
response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
members = response.json()["query"]["categorymembers"]
for member in members:
articles.append({"title": member["title"], "pageid": member["pageid"]})
return articles
articles = get_category_articles("Programming_languages")
print(f"Found {len(articles)} articles")
Best Practices
- Use the API, It is faster, more reliable, and returns cleaner data than HTML scraping
- Set a User-Agent, Wikipedia requires a descriptive User-Agent string
- Respect rate limits, Limit requests to about 200 per second for the API
- Use Wikidata, For structured entity data, Wikidata's SPARQL endpoint is even better than scraping Wikipedia directly