Parse HTML faster than beautifulsoup4 via selectolax
·175 words·1 min
What is the best HTML parser you can think of? I believe BeautifulSoup is the one most famous and used a lot. What if there is an alternative library which works faster than BS4? Based on scraping performance table, selectolax has a significant advantage compared to bs4. Let’s walk through an usage of the library. We take famitsu video game ranking website as an example.
!pip install selectolax
|
|
Here is an example of specifying an elements of page.
title = [node.text() for node in tree.css('div.card-game-sale-rank__title')]
console = [node.text() for node in tree.css('span.icon-console')]
sold_thisweek = [node.text() for node in tree.css('p.card-game-sale-rank__sales-num-past')]
publisher = [node.text() for node in tree.css('p.card-game-sale-rank__publisher')]
msrp = [node.text() for node in tree.css('p.card-game-sale-rank__price')]
rank = [node.text() for node in tree.css('span.icon-ranking')]
data = {
"rank": rank,
"title": title,
"console": console,
"sold": sold_thiswk_float,
"publisher": publisher,
"msrp": msrp_float,
"duration": duration
}
pd.DataFrame(data=data)