Parse HTML faster than beautifulsoup4 via selectolax

What is the best HTML parser you can think of? I believe BeautifulSoup is the one most famous and used a lot. What if there is an alternative library which works faster than BS4? Based on scraping performance table, selectolax has a significant advantage compared to bs4. Let’s walk through an usage of the library. We take famitsu video game ranking website as an example.

!pip install selectolax

1
2
3
4
5
from selectolax.parser import HTMLParser
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

Loading libraries.

Here is an example of specifying an elements of page.

title = [node.text() for node in tree.css('div.card-game-sale-rank__title')]
console = [node.text() for node in tree.css('span.icon-console')]
sold_thisweek = [node.text() for node in tree.css('p.card-game-sale-rank__sales-num-past')]
publisher =   [node.text() for node in tree.css('p.card-game-sale-rank__publisher')]
msrp =  [node.text() for node in tree.css('p.card-game-sale-rank__price')]
rank = [node.text() for node in tree.css('span.icon-ranking')]

Then, save the retrieve data into dictionary and make it as a DataFrame.

data = {
    "rank": rank,
    "title": title,
    "console": console,
    "sold": sold_thiswk_float,
    "publisher": publisher,
    "msrp": msrp_float,
    "duration": duration
    }
    pd.DataFrame(data=data)