Discussion:
Unable to wget some pages
(too old to reply)
Michael F. Stemper
2024-03-11 14:11:24 UTC
Permalink
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
download some pages. A simplified version, showing the problem, is:

$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden

Username/Password Authentication Failed.
$

Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.

Despite this, I tried telling wget to provide empty username and
password, with no observable change in results.

On a purely cargo-cult basis, I tried some different user agent
strings, with no effect.

I searched on "401 HTTP Forbidden", only to find that there does
not appear to be such an error. There is "401 Unathorized", and
"403 Forbidden", but no such cross-breed.

I looked briefly at the page source (in Firefox), but without a
top-level design document, couldn't make head or tail of it.

Does anybody have any suggestions on how to fix my problem and
again automatically download this, and neighboring, pages?
--
Michael F. Stemper
Indians scattered on dawn's highway bleeding;
Ghosts crowd the young child's fragile eggshell mind.
Dan Purgert
2024-03-11 14:27:09 UTC
Permalink
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I'd imagine either

(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason

The "401" is the error code. The "HTTP Forbidden" is (for lack of a
better word) "custom text" they're supplying. I've done similar where a
HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
when scripting things with expect.
--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860
Josef Möllers
2024-03-11 16:08:59 UTC
Permalink
Post by Dan Purgert
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I'd imagine either
(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason
The "401" is the error code. The "HTTP Forbidden" is (for lack of a
better word) "custom text" they're supplying. I've done similar where a
HTTP upload process sends back "200 OK, Got it!" as a proof-of-sanity
when scripting things with expect.
Besides that ... is it on purpose that $uas is between single quotes, so
won't get expanded? Double quotes are required because the user agent
string has blanks (and parantheses), but single quotes are definitely
wrong here!

Josef "2cts" Möllers
Michael F. Stemper
2024-03-11 19:01:44 UTC
Permalink
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
Besides that ... is it on purpose that $uas is between single quotes, so won't get expanded? Double quotes are required because the user agent string has blanks (and parantheses), but single quotes are definitely wrong here!
Interesting. All this time, I've been sending a User Agent string
of $uas, and it's worked.

If I recall my thinking from six years ago, I had used single quotes
because I used double quotes in the definition of the variable. Now
that I look back, that was pretty obviously wrong, since the *value*
of the variable doesn't have any quotes in it.

However, changing from single to double didn't help. (I'm guessing
that you didn't expect that it would.)

Single versus double quotes always get me tangled up.
--
Michael F. Stemper
There's no "me" in "team". There's no "us" in "team", either.
Michael F. Stemper
2024-03-11 19:05:01 UTC
Permalink
Post by Dan Purgert
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I've never installed lynx. Is it capable of running as a background
process, e.g., via crontab?
Post by Dan Purgert
I'd imagine either
(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason
Any ideas how I could test for, or prevent, either of these?
--
Michael F. Stemper
There's no "me" in "team". There's no "us" in "team", either.
Dan Purgert
2024-03-12 09:24:24 UTC
Permalink
Post by Michael F. Stemper
Post by Dan Purgert
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23-- https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I've never installed lynx. Is it capable of running as a background
process, e.g., via crontab?
Not that I'm aware of, sorry.
Post by Michael F. Stemper
Post by Dan Purgert
I'd imagine either
(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason
Any ideas how I could test for, or prevent, either of these?
Potentially adding "-e robots=off" will avoid #1. More verbosity (-v) or
turning on headers (-S?) may help for both as well.

But both of these were a bit of a stab in the dark.
--
|_|O|_|
|_|_|O| Github: https://github.com/dpurgert
|O|O|O| PGP: DDAB 23FB 19FA 7D85 1CC1 E067 6D65 70E5 4CE7 2860
Michael F. Stemper
2024-03-12 14:14:22 UTC
Permalink
Post by Dan Purgert
Post by Michael F. Stemper
Post by Dan Purgert
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Looks like the page *does* have a login button / javascript thing
"somewhere" (at least I can see it when I open the page in lynx here).
I'd imagine either
(1) wget is respecting some robots.txt somewhere OR
(2) wget is following that login link for some reason
Any ideas how I could test for, or prevent, either of these?
Potentially adding "-e robots=off" will avoid #1. More verbosity (-v) or
turning on headers (-S?) may help for both as well.
No joy from robots=off, and wget's man page says that -v is the default.

But, I just tried with curl, and think that I've found a clue. Included
in what it downloaded was:
"Please enable JS and disable any ad blocker"

I'm not sure if it's possible for wget to fake having javascript, but
it seems as if that's the next place to look.
--
Michael F. Stemper
This sentence no verb.
Paul
2024-03-11 16:46:50 UTC
Permalink
Post by Michael F. Stemper
Late last week, a script that I have used for several years suddenly
stopped working. Investigation showed that wget was failing to
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23--  https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Looking at the error message, one might think that this page/site
requires user login credentials. However, the same URL works just
fine in Firefox, with no login requested or required.
Despite this, I tried telling wget to provide empty username and
password, with no observable change in results.
On a purely cargo-cult basis, I tried some different user agent
strings, with no effect.
I searched on "401 HTTP Forbidden", only to find that there does
not appear to be such an error. There is "401 Unathorized", and
"403 Forbidden", but no such cross-breed.
I looked briefly at the page source (in Firefox), but without a
top-level design document, couldn't make head or tail of it.
Does anybody have any suggestions on how to fix my problem and
again automatically download this, and neighboring, pages?
Almost like there is a mixing up at some point, of https://
versus http:// in the operation. The website denying http:// access.

Maybe at some point, the website used to redirect the http://
attempt to https:// for you, and maybe it's not doing that
any more ?

Or perhaps wget has developed a defect in dining habits
related to that aspect.

Paul
Michael F. Stemper
2024-03-11 18:51:49 UTC
Permalink
Post by Paul
Post by Michael F. Stemper
$ cat ic
uas="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
wget "https://www.marketwatch.com/investing/index/spx" -U '$uas' -O SP500
$ . ./ic
--2024-03-11 08:52:23--  https://www.marketwatch.com/investing/index/spx
Resolving www.marketwatch.com (www.marketwatch.com)... 13.227.37.29, 13.227.37.70, 13.227.37.8, ...
Connecting to www.marketwatch.com (www.marketwatch.com)|13.227.37.29|:443... connected.
HTTP request sent, awaiting response... 401 HTTP Forbidden
Username/Password Authentication Failed.
$
Almost like there is a mixing up at some point, of https://
versus http:// in the operation. The website denying http:// access.
Maybe at some point, the website used to redirect the http://
attempt to https:// for you, and maybe it's not doing that
any more ?
Sorry, but I don't follow this. The URL that I show above is
https:// not http:// and that's also what the output of wget
shows as the URL.

What is the source for your suspicion that it's really doing
http:// under the covers?
--
Michael F. Stemper
There's no "me" in "team". There's no "us" in "team", either.
Loading...