del-packURLs web security automation tool designed to enhance information disclosure vulnerability discovery, particularly in bug hunting scenarios. It streamlines the process of extracting potentially sensitive files and information. The tool leverages the Wayback Machine CDX API to retrieve archived URLs for a given target domain, filtering for specific file extensions (e.g., apk, dll, exe, json, txt, pdf, zip etc...). A key feature is its 200 History Mode,' which identifies when these files were accessible with a 200 OK status code, addressing the challenge of locating resources that are currently unavailable (404 Page Not Found). This automation aims to improve efficiency for bug hunters by providing direct terminal access to this historical data, common for Linux users. Furthermore, the tool can integrate with AI models (Gemini, Claude, GPT) to provide intelligent suggestions on potentially sensitive PDF files. The tool also provides functionality to use concurrency."
The main reason for developing this was so that pentesters could efficiently perform the maximum number of information disclosure vulnerability finding tasks from the terminal. When I watched Lossec's video (or rather, I saw this reel first, then watched Lossec's video a few months later, and even observed some bug hunters), I noticed that most people perform these tasks manually instead of using the terminal. What kind of Linux user abandons the terminal to work manually? Also, I saw that when both encounter a '404 Page not found' error, they manually go to each link and enter it into the web archive to see when it was live (200 Status OK). With so many links, the user won't understand which one is good and sensitive, so I developed a solution where it fetches from the terminal, shows live archived links from the terminal itself, and provides AI recommendation of sensitive PDFs . I know that people can fetch using the curl command, but I used Golang to make it a bit faster. One thing to note here is that if the internet speed is fast while fetching with curl and slow while using Go, curl's result will come sooner, even though Go has good performance. But I thought, why use curl when I have the standard library? Performance + Fast Internet Speed 🗿
- Git clone the repo :
git clone https://github.com/gigachad80/del-packURLs
- Go to del-packURLs directory and give permission to main.go or you can directly build from the source ( go build
del-packURLs.go) - Run command
./del-packURLs.go. Please note that either you can use whole syntax like this
./del-packURLs -domain example.com ---- and rest of flagsor just type the command./del-packURLsand it'll ask for domain & extension . Enter your target domain/URL and flags and run it .
- For help or menu guide , enter
del-packURLs.go -h
| grep-backURLs | del-packURLs |
|---|---|
| Uses keyword from keyword.txt to find sensitive data | Uses Wayback CDX API and pre-defined keywords to find sensitive files |
| Finds all URLs | Finds only files |
| Does not Use AI | Use AI models like Gemini , Claude , GPT to suggest sensitive PDFs for analysis |
Fun Fact : I developed both of them.🤓 My Repo for grep-backURLs: Repo link
- Add
-loadflag in syntax . - Update README.md with Demo sntax to use
- Add
Back to Main Menufunctionality - Add more keywords for in
sort-keywords.pyfor sensitive docs . - Support for Virustotal & Alienvault to fetch URLs just like CDX API.
- Not sure but if possible, I'll integrate AI file analysis ( for image , text , pdf etc...)
Note
- PDF Suggestion & Analysis: AI will only recommend sensitive PDF while PyMuPDF will analyze it.
- Decode URLs:Check go file line 219 if you need decoded URLs to fetch.'.
- Requirements txt : pip installs all AI models by default , so if you wnt to use single AI model , then install only that.'.
- Modify prompt : Check line 106 of
ai-suggestor.pyin to modify the prompt for suggestions.'. - Python : It uses
pythonfor Windows &python3for Linux'. - AI testing: I have only tested Gemini so far, because ChatGPT and Claude's API keys are not free, that's why
- Starting Download: Script shows "Downloading: [URL]" first.
- File Not Found: "Not Found (404): [URL]" means URL is broken/removed.
- Download Error: "Error downloading [URL]: [error details]" indicates a network issue.
- PDF Processing Error: "Error processing PDF [URL]: [error]" means file isn't a valid PDF. Even if it's PDF, PyMuPDF lib. won't be unable to analyse that.
- No Keywords: "No sensitive keywords found in: [URL]" means PDF text lacks defined terms.
- Keywords Found: "Found keywords: [keywords] in [URL]" means terms were detected in PDF.
- Keyword Found Color: Green output indicates keywords were successfully found.
- Error Colors: Red output signals download or processing errors.
- Ctrl+C with Concurrency: Ctrl+C will not stop immediately with concurrency (esp. with
sort-keywords.py). It'll process & analyze all PDF then. - Output File: Sensitive URLs with keywords saved to 'sorted-keywords.txt'.
Tip
- Concurrency Impact: Concurrency ("yes" flag) can speed up checks.
- Use grep for sorting Found Keyword(s) from
sorted-keywords.txtfile.
First, I decided to use both the Web Archive CDX API and Waybackpack (one for fetching and one for showing the 200 status of archived URLs). However, after trying a lot, Waybackpack didn't work. Then, one day, an idea suddenly came to me: why not just do it normally using the CDX API, which would show timestamps, status codes, and URLs? After modifying it a bit, it easily showed all the archived URLs that once had a 200 OK status code but are currently 404. So, even though I didn't end up using Waybackpack, it was my initial approach which refers to pack ,and del refers to deleted (404 Page not found). So, I named it del-packURLs.
⌚ Total Time taken in development , testing , trying diferent approaches & variations , debugging , even writing README.
Approx 18 hr 10 min
I extend my sincere gratitude to both IHA org. & CoffinXP for creating video This project simply wouldn't exist if they hadn't created it.
- Lossec aka CoffinXP and his video for inspiration.
- IHA 089 for IG Reel
📧 Email: [email protected]
Licensed under GNU General Public License v3.0
🕒 Last Updated: April 4, 2025