Michael F. Stemper
2024-03-11 14:11:24 UTC
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Despite this, I tried telling wget to provide empty username and
password, with no observable change in results.
On a purely cargo-cult basis, I tried some different user agent
strings, with no effect.
I searched on "401 HTTP Forbidden", only to find that there does
not appear to be such an error. There is "401 Unathorized", and
"403 Forbidden", but no such cross-breed.
I looked briefly at the page source (in Firefox), but without a
top-level design document, couldn't make head or tail of it.
Does anybody have any suggestions on how to fix my problem and
again automatically download this, and neighboring, pages?
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Despite this, I tried telling wget to provide empty username and
password, with no observable change in results.
On a purely cargo-cult basis, I tried some different user agent
strings, with no effect.
I searched on "401 HTTP Forbidden", only to find that there does
not appear to be such an error. There is "401 Unathorized", and
"403 Forbidden", but no such cross-breed.
I looked briefly at the page source (in Firefox), but without a
top-level design document, couldn't make head or tail of it.
Does anybody have any suggestions on how to fix my problem and
again automatically download this, and neighboring, pages?
--
Michael F. Stemper
Indians scattered on dawn's highway bleeding;
Ghosts crowd the young child's fragile eggshell mind.
Michael F. Stemper
Indians scattered on dawn's highway bleeding;
Ghosts crowd the young child's fragile eggshell mind.