parse_tr

pyrcs.parser.parse_tr(trs, ths, sep=' / ', as_dataframe=False)

Parse a list of parsed HTML <tr> elements.

See also [PT-1].

Parameters

trs (bs4.ResultSet or list) – contents under <tr> tags of a web page
ths (list or bs4.element.Tag) – list of column names (usually under a <th> tag) of a requested table
sep (str or None) – separator that replaces the one in the raw data
as_dataframe (bool) – whether to return the parsed data in tabular form

Returns

a list of lists that each comprises a row of the requested table

Return type

pandas.DataFrame or List[list]

Example:

>>> from pyrcs.parser import parse_tr
>>> import requests
>>> import bs4

>>> example_url = 'http://www.railwaycodes.org.uk/elrs/elra.shtm'
>>> source = requests.get(example_url)
>>> parsed_text = bs4.BeautifulSoup(markup=source.content, features='html.parser')
>>> ths_dat = [th.text for th in parsed_text.find_all('th')]
>>> trs_dat = parsed_text.find_all(name='tr')

>>> tables_list = parse_tr(trs=trs_dat, ths=ths_dat)  # returns a list of lists

>>> type(tables_list)
list
>>> len(tables_list) // 100
1
>>> tables_list[0]
['AAL',
 'Ashendon and Aynho Line',
 '0.00 - 18.29',
 'Ashendon Junction',
 'Now NAJ3']