Skip to content

How to crawl a site behind basic authentication (CredentialStore/HttpAuthenticationCredential ends up with 401) #662

@danijanos

Description

@danijanos

Dear Heritrix3 Community,

Thank you for this great tool! Please help me with this issue:
I am using version 3.10.0.

I need to crawl a site's previous version that has undergone a major upgrade. The old site was placed under a domain that the developers configured to be behind a basic login. (Every request header sent out includes the Authorization field, which supplies credentials for basic authentication with the base64-encoded value of the username and password, as granted by the site administrators.)

Image

I configured the job as I learned from the docs. So the crawl has these two beans for the basic authentication:

<bean id="credentialStore" class="org.archive.modules.credential.CredentialStore">
   <property name="credentials">
     <map>
       <entry key="OLDSiteLoginCredential" value-ref="OLDSiteLoginCredential"/>
     </map>
   </property>
</bean>

<bean id="OLDSiteLoginCredential" class="org.archive.modules.credential.HttpAuthenticationCredential">
   <property name="domain" value="https://old.site.edu:443"/>
   <property name="realm" value="oldsiterealm"/>
   <property name="login" value="myloginname"/>
   <property name="password" value="passwordformyloginname"/>
</bean>

But every time I build and launch it, it stops and finishes with the DNS resolve, and two 401s regarding the main page URL and the robots.txt

401        381 https://old.site.edu/ - - text/html #001
401        381 https://old.site.edu/robots.txt P https://old.site.edu/ text/html #001
1          51  dns:old.site.edu P https://old.site.edu/ text/dns #001

Could you please help me identify what I am doing wrong here? Or would you happen to know how I should do this?
Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions