 dgn2
|
|
Total Posts: 2077 |
Joined: May 2004 |
|
|
This isn't really the right place for this question, but can anyone explain to me why selenium does not allow me to 'see' the "Download Holdings (Excel)" link on the following page?
https://api.kurtosys.io/tools/ksys317/fundProfile/ZAG?locale=en-CA#holdings
or:
https://www.bmo.com/gam/ca/advisor/products/etfs?fundUrl=/fundProfile/ZAG!hash!holdings#fundUrl=%2FfundProfile%2FZAG%23holdings
How can I find that "Download Holdings (Excel)" link using Python?
|
...WARNING: I am an optimal f'er |
|
|
 |
 dgn2
|
|
Total Posts: 2077 |
Joined: May 2004 |
|
|
I think I need to switch to the frame for the holdings, but I can't seem to find the selector. I think something like this should work, but I can't get it right:
import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.chrome.options import Options
sleep_time=60
fund_holdings_url=https://api.kurtosys.io/tools/ksys317/fundProfile/ZAG?locale=en-CA#holdings
driver=webdriver.Chrome('C:/Users/dgn2/chromedriver/chromedriver.exe')
driver.get(fund_holdings_url) time.sleep(sleep_time) driver.switch_to_frame(driver.find_element_by_id("#fundEtfFrame")) page_source=driver.page_source soup=BeautifulSoup(page_source,'html.parser') driver.quit()
Anyone have any ideas?
Thanks |
...WARNING: I am an optimal f'er |
|
 |
|
Not sure what's going on there. The markup Selenium sees (driver.page_source) looks nothing like what curl gets if I run the request from a terminal -- the latter does contain the a tag that needs to be clicked, but that's not very useful given it's not _in_ the browser, and it's Javascripty.
Why not do this (quick and dirty but you get the idea):
import json import requests
d = json.loads(requests.get('https://api.kurtosys.io/entity/get?_api_key=EA87F964-5384-484A-954B-18C394A23126&_user_token=80AC692F-1E69-47FF-B0DC-91BC94DF29C2&entitytypeid=4617&getallforclient=true&maxresults=1000&admindata=0').text)['data']['value']
lookingFor = 'ZAG' found = None for id, entry in d.items(): for idInfo in entry['identifiers'].values(): if idInfo['identifier'] == lookingFor: found = id break if found is not None: break
if found is not None: print('download link: https://api.kurtosys.io/tools/ksys317/api/download/csv/holding/%s?fileName=%s_Holdings.csv&locale=en-CA' % (found, lookingFor)) |
|
|
|
 |
 dgn2
|
|
Total Posts: 2077 |
Joined: May 2004 |
|
|
Thanks! I will do that - and if I figure out what is going on with the frame I will come back and post. |
...WARNING: I am an optimal f'er |
|
 |
 dgn2
|
|
Total Posts: 2077 |
Joined: May 2004 |
|
|
I was actually scraping all of the BMO ETF holdings and historical NAVs/index values to do a quick comparison against some overlap in iShares products and that JSON has all of the fund IDs that I needed to build the URLs for files I wanted to download.
@WayconKidd: How did you figure out how to get that JSON (i.e., what process did you use figure out how to use the API)? Seems like a much more robust way to scrape, but I have no idea how you did that. |
...WARNING: I am an optimal f'er |
|
|
 |
 prikolno
|
|
Total Posts: 90 |
Joined: Jul 2018 |
|
|
In Chrome: right click > "Inspect" > Network tab > Reload the page.
There you will see the problem you're experiencing. There are 87 requests when I hit the page. The static HTML page that you're requesting (corresponding to the first request) does not have data in that iframe element you're extracting on initial paint; that page contains a script that it subsequently invokes, which in turn makes a private API call to request that data and render it dynamically.
I use this whenever a site has a private API for fetching data, e.g. if you want to save a video from a streaming site to disk but there's no download button.
It takes a bit of guesswork but with some experience you know you can quickly ignore most of the static assets (*.jpg). If you scroll down you will see it hits the `get?_api_key=`request: this looks like it might fetch data. To confirm, on the right panel, click the "Response" tab to see the data returned.
Sometimes it's not so obvious and you still have to set your cookies with a library like cookiejar for the request to work correctly - you can click the "Headers" tab to reverse engineer this part of the private API. |
|
|
 |
 dgn2
|
|
Total Posts: 2077 |
Joined: May 2004 |
|
|
@prikolno: Thanks - that is extremely helpful! I will definitely try this approach to explore the private API call. |
...WARNING: I am an optimal f'er |
|
|
 |
 prikolno
|
|
Total Posts: 90 |
Joined: Jul 2018 |
|
|
No problem. Have fun. Also I agree with @WayconKidd, you shouldn't use Selenium for web scraping, it's more of an automation tool for UI testing, like Cypress. requests + bs4 should be sufficient. |
|
|
 |
|
@dgn2:
I did basically what prikolno described with the Network tab inside of Chrome's DevTools. I was watching that to see if the page was doing any additional data loading, so once I realized the download link contained one unique identifier I went in search of the identifier in the other requests. |
|
|
|
 |
 dgn2
|
|
Total Posts: 2077 |
Joined: May 2004 |
|
|
It used to be - say five years ago - that any time there was a site with js that made web-scraping difficult I could use Selenium and get what I needed very quickly. These days, this only seems to work for me about 30% of the time. I have used historical index constituents and ETF holdings to define the instrument universes I feed my PA trading systems for a long time. A lot of this stuff has been scraped over the years so I guess I have to invest some time in upgrading my rudimentary skills. Thanks to each of you for taking the time to help me out. |
...WARNING: I am an optimal f'er |
|
 |
|
And, just to add, within requests you should learn to use sessions and cookies.
I've seen cookiejar used: https://stackoverflow.com/questions/6878418/putting-a-cookie-in-a-cookiejar
|
|
|
|
 |