Forums  > Software  > web scraping  
     
Page 1 of 1
Display using:  

dgn2


Total Posts: 2072
Joined: May 2004
 
Posted: 2020-08-08 22:07
This isn't really the right place for this question, but can anyone explain to me why selenium does not allow me to 'see' the "Download Holdings (Excel)" link on the following page?

https://api.kurtosys.io/tools/ksys317/fundProfile/ZAG?locale=en-CA#holdings

or:

https://www.bmo.com/gam/ca/advisor/products/etfs?fundUrl=/fundProfile/ZAG!hash!holdings#fundUrl=%2FfundProfile%2FZAG%23holdings

How can I find that "Download Holdings (Excel)" link using Python?


...WARNING: I am an optimal f'er

dgn2


Total Posts: 2072
Joined: May 2004
 
Posted: 2020-08-09 01:33
I think I need to switch to the frame for the holdings, but I can't seem to find the selector. I think something like this should work, but I can't get it right:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

sleep_time=60

fund_holdings_url=https://api.kurtosys.io/tools/ksys317/fundProfile/ZAG?locale=en-CA#holdings

driver=webdriver.Chrome('C:/Users/dgn2/chromedriver/chromedriver.exe')

driver.get(fund_holdings_url)
time.sleep(sleep_time)
driver.switch_to_frame(driver.find_element_by_id("#fundEtfFrame"))
page_source=driver.page_source
soup=BeautifulSoup(page_source,'html.parser')
driver.quit()

Anyone have any ideas?

Thanks

...WARNING: I am an optimal f'er

WayconKidd


Total Posts: 112
Joined: Mar 2010
 
Posted: 2020-08-09 02:19
Not sure what's going on there. The markup Selenium sees (driver.page_source) looks nothing like what curl gets if I run the request from a terminal -- the latter does contain the a tag that needs to be clicked, but that's not very useful given it's not _in_ the browser, and it's Javascripty.

Why not do this (quick and dirty but you get the idea):

import json
import requests

d = json.loads(requests.get('https://api.kurtosys.io/entity/get?_api_key=EA87F964-5384-484A-954B-18C394A23126&_user_token=80AC692F-1E69-47FF-B0DC-91BC94DF29C2&entitytypeid=4617&getallforclient=true&maxresults=1000&admindata=0').text)['data']['value']

lookingFor = 'ZAG'
found = None
for id, entry in d.items():
    for idInfo in entry['identifiers'].values():
        if idInfo['identifier'] == lookingFor:
            found = id
            break
    if found is not None:
        break

if found is not None:
    print('download link: https://api.kurtosys.io/tools/ksys317/api/download/csv/holding/%s?fileName=%s_Holdings.csv&locale=en-CA' % (found, lookingFor))

dgn2


Total Posts: 2072
Joined: May 2004
 
Posted: 2020-08-09 03:24
Thanks! I will do that - and if I figure out what is going on with the frame I will come back and post.

...WARNING: I am an optimal f'er

dgn2


Total Posts: 2072
Joined: May 2004
 
Posted: 2020-08-09 17:26
I was actually scraping all of the BMO ETF holdings and historical NAVs/index values to do a quick comparison against some overlap in iShares products and that JSON has all of the fund IDs that I needed to build the URLs for files I wanted to download.

@WayconKidd: How did you figure out how to get that JSON (i.e., what process did you use figure out how to use the API)? Seems like a much more robust way to scrape, but I have no idea how you did that.

...WARNING: I am an optimal f'er

prikolno


Total Posts: 67
Joined: Jul 2018
 
Posted: 2020-08-09 22:59
In Chrome: right click > "Inspect" > Network tab > Reload the page.

There you will see the problem you're experiencing. There are 87 requests when I hit the page. The static HTML page that you're requesting (corresponding to the first request) does not have data in that iframe element you're extracting on initial paint; that page contains a script that it subsequently invokes, which in turn makes a private API call to request that data and render it dynamically.

I use this whenever a site has a private API for fetching data, e.g. if you want to save a video from a streaming site to disk but there's no download button.

It takes a bit of guesswork but with some experience you know you can quickly ignore most of the static assets (*.jpg). If you scroll down you will see it hits the `get?_api_key=`request: this looks like it might fetch data. To confirm, on the right panel, click the "Response" tab to see the data returned.

Sometimes it's not so obvious and you still have to set your cookies with a library like cookiejar for the request to work correctly - you can click the "Headers" tab to reverse engineer this part of the private API.

dgn2


Total Posts: 2072
Joined: May 2004
 
Posted: 2020-08-10 13:02
@prikolno: Thanks - that is extremely helpful! I will definitely try this approach to explore the private API call.

...WARNING: I am an optimal f'er

prikolno


Total Posts: 67
Joined: Jul 2018
 
Posted: 2020-08-10 14:34
No problem. Have fun. Also I agree with @WayconKidd, you shouldn't use Selenium for web scraping, it's more of an automation tool for UI testing, like Cypress. requests + bs4 should be sufficient.

WayconKidd


Total Posts: 112
Joined: Mar 2010
 
Posted: 2020-08-10 21:06
@dgn2:

I did basically what prikolno described with the Network tab inside of Chrome's DevTools. I was watching that to see if the page was doing any additional data loading, so once I realized the download link contained one unique identifier I went in search of the identifier in the other requests.

dgn2


Total Posts: 2072
Joined: May 2004
 
Posted: 2020-08-11 02:08
It used to be - say five years ago - that any time there was a site with js that made web-scraping difficult I could use Selenium and get what I needed very quickly. These days, this only seems to work for me about 30% of the time. I have used historical index constituents and ETF holdings to define the instrument universes I feed my PA trading systems for a long time. A lot of this stuff has been scraped over the years so I guess I have to invest some time in upgrading my rudimentary skills. Thanks to each of you for taking the time to help me out.

...WARNING: I am an optimal f'er

contango_and_cash


Total Posts: 122
Joined: Sep 2015
 
Posted: 2020-09-09 13:22
And, just to add, within requests you should learn to use sessions and cookies.

I've seen cookiejar used:
https://stackoverflow.com/questions/6878418/putting-a-cookie-in-a-cookiejar

Previous Thread :: Next Thread 
Page 1 of 1