Skip to main content

Parse HTML faster than beautifulsoup4 via selectolax

·175 words·1 min

What is the best HTML parser you can think of? I believe BeautifulSoup is the one most famous and used a lot. What if there is an alternative library which works faster than BS4? Based on scraping performance table, selectolax has a significant advantage compared to bs4. Let’s walk through an usage of the library. We take famitsu video game ranking website as an example.

Pip Install
!pip install selectolax

Loading libraries
1
2
3
4
5
from selectolax.parser import HTMLParser
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
Loading libraries.

Get objects
Here is an example of specifying an elements of page.
title = [node.text() for node in tree.css('div.card-game-sale-rank__title')]
console = [node.text() for node in tree.css('span.icon-console')]
sold_thisweek = [node.text() for node in tree.css('p.card-game-sale-rank__sales-num-past')]
publisher =   [node.text() for node in tree.css('p.card-game-sale-rank__publisher')]
msrp =  [node.text() for node in tree.css('p.card-game-sale-rank__price')]
rank = [node.text() for node in tree.css('span.icon-ranking')]
Then, save the retrieve data into dictionary and make it as a DataFrame.
data = {
    "rank": rank,
    "title": title,
    "console": console,
    "sold": sold_thiswk_float,
    "publisher": publisher,
    "msrp": msrp_float,
    "duration": duration
    }
    pd.DataFrame(data=data)

Result