Parse HTML faster than beautifulsoup4 via selectolax
·175 words·1 min
What is the best HTML parser you can think of? I believe BeautifulSoup is the one most famous and used a lot. What if there is an alternative library which works faster than BS4? Based on scraping performance table, selectolax has a significant advantage compared to bs4. Let’s walk through an usage of the library. We take famitsu video game ranking website as an example.

!pip install selectolax
|  |  | 

title = [node.text() for node in tree.css('div.card-game-sale-rank__title')]
console = [node.text() for node in tree.css('span.icon-console')]
sold_thisweek = [node.text() for node in tree.css('p.card-game-sale-rank__sales-num-past')]
publisher =   [node.text() for node in tree.css('p.card-game-sale-rank__publisher')]
msrp =  [node.text() for node in tree.css('p.card-game-sale-rank__price')]
rank = [node.text() for node in tree.css('span.icon-ranking')]data = {
    "rank": rank,
    "title": title,
    "console": console,
    "sold": sold_thiswk_float,
    "publisher": publisher,
    "msrp": msrp_float,
    "duration": duration
    }
    pd.DataFrame(data=data)