Skip to content

Commit c9eff0f

Browse files
committed
feat: add sourcegraph query.
1 parent 705057c commit c9eff0f

File tree

4 files changed

+104
-3
lines changed

4 files changed

+104
-3
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,5 @@
11
.env
22
.idea
3+
.task
4+
data/*.jsonl
5+
data/*.csv

README.md

+51-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,16 @@
11
# non-npm-package-json-files
22
Get a collection of package.json files for non-NPM packages
33

4-
# Requirements
4+
## Why?
5+
6+
We needed `package.json` files from real projects that aren't packages published to NPM. While
7+
NPM can tell you the absolute usage of NPM packages in terms of download numbers, we were
8+
interested in the set of dependencies that people were using together in a given project.
9+
10+
More details about how we used a sample of these package.json files to simulation for StackAid:
11+
[StackAid in Beta](https://www.stackaid.us/blog/stackaid-in-beta)
12+
13+
## Requirements
514

615
* [Brew](https://brew.sh) (MacOSX)
716
* [Task](https://taskfile.dev)
@@ -16,7 +25,7 @@ On MacOS:
1625
brew install brew install go-task/tap/go-task && task brew:requirements
1726
```
1827

19-
# Get Sourcegraph Access Token
28+
## Get a Sourcegraph access token
2029

2130
Use the `src` CLI to see if you're authenticated:
2231
```shell
@@ -34,3 +43,43 @@ SRC_ACCESS_TOKEN=<your access token>
3443
Once configured correctly, rerun `src:login` task to confirm your
3544
configuration.
3645

46+
## Query Sourcegraph
47+
48+
To query for all package.json files on GitHub that aren't in `node_modules` or directories such
49+
as `test`, `fixture` or `examples`:
50+
51+
```shell
52+
task src:query
53+
```
54+
55+
The command will take about 1 minute and return just over 1M results. The results file in the data
56+
directory: `./data/src_github_results.jsonl` and it should look like this:
57+
58+
```json lines
59+
{"type":"path","path":"package.json","repository":"freeCodeCamp/freeCodeCamp","branches":[""],"commit":"382717cce4ea5593eb623ba5ef0bd47c534411d1"}
60+
{"type":"path","path":"web/package.json","repository":"freeCodeCamp/freeCodeCamp","branches":[""],"commit":"382717cce4ea5593eb623ba5ef0bd47c534411d1"}
61+
{"type":"path","path":"curriculum/package.json","repository":"freeCodeCamp/freeCodeCamp","branches":[""],"commit":"382717cce4ea5593eb623ba5ef0bd47c534411d1"}
62+
{"type":"path","path":"tools/crowdin/package.json","repository":"freeCodeCamp/freeCodeCamp","branches":[""],"commit":"382717cce4ea5593eb623ba5ef0bd47c534411d1"}
63+
{"type":"path","path":"tools/scripts/seed/package.json","repository":"freeCodeCamp/freeCodeCamp","branches":[""],"commit":"382717cce4ea5593eb623ba5ef0bd47c534411d1"}
64+
```
65+
66+
To convert the file to a CSV:
67+
68+
```shell
69+
task src:query:csv
70+
```
71+
72+
The results will be in `./data/src_github_results.csv` and it should looks this this:
73+
74+
```csv
75+
repo,commit_sha,path
76+
freeCodeCamp/freeCodeCamp,382717cce4ea5593eb623ba5ef0bd47c534411d1,package.json
77+
freeCodeCamp/freeCodeCamp,382717cce4ea5593eb623ba5ef0bd47c534411d1,web/package.json
78+
freeCodeCamp/freeCodeCamp,382717cce4ea5593eb623ba5ef0bd47c534411d1,curriculum/package.json
79+
freeCodeCamp/freeCodeCamp,382717cce4ea5593eb623ba5ef0bd47c534411d1,tools/crowdin/package.json
80+
freeCodeCamp/freeCodeCamp,382717cce4ea5593eb623ba5ef0bd47c534411d1,tools/scripts/seed/package.json
81+
```
82+
83+
## Debug Sourcegraph query
84+
85+
Try the [query](https://sourcegraph.com/search?q=context:global+file:%28%5E%7C/%29package.json%24+fork:no+-file:%28%5E%7C/%29%5C.+-file:%28%5E%7C/%29%28node_modules%7Ctest%7Ctests%7Cfixture%7Cfixtures%7Cexamples%29/+count:all+archived:no+-file:%28%5E%7C/%29vendor/+&patternType=standard) on Sourcegraph!

Taskfile.yaml

+50-1
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,17 @@
22

33
version: '3'
44

5+
vars:
6+
DATA_DIR: ./data
7+
SRC_RESULTS_JSONL: "{{.DATA_DIR}}/src_github_results.jsonl"
8+
SRC_RESULTS_CSV: "{{.DATA_DIR}}/src_github_results.csv"
9+
510
dotenv:
611
- .env
712

813
tasks:
914
brew:requirements:
15+
desc: Install required utilities.
1016
cmds:
1117
- |-
1218
brew install \
@@ -16,5 +22,48 @@ tasks:
1622
sourcegraph/src-cli/src-cli \
1723
sqlite \
1824
xsv
25+
1926
src:login:
20-
- src login
27+
desc: Test Sourcegraph CLI authentication.
28+
cmds:
29+
- src login
30+
31+
src:query:
32+
desc: Query Sourcegraph for package.json files
33+
summary: |
34+
Query SourceGraph for package.json files.
35+
36+
SourceGraph query asks for all package.json files excluding files found in directories such
37+
as node_modules, test, fixture, and examples. The returned results are filtered to contain
38+
GitHub repositories and reformatting the repository field in the output.
39+
cmds:
40+
- |-
41+
src search -stream -json '{{ .SRC_QUERY }}' \
42+
| jq -c 'select(.type == "path") | select(.repository | test("^github.com"))' \
43+
| jq -c '.repository = (.repository | sub("github.com/"; ""))' \
44+
> {{ .SRC_RESULTS_JSONL }}
45+
vars:
46+
SRC_QUERY: >-
47+
file:(^|/)package.json$
48+
fork:no
49+
archived:no
50+
-file:(^|/)\.
51+
-file:(^|/)(node_modules|test|tests|fixture|fixtures|examples|vendor)/
52+
count:all
53+
generates:
54+
- "{{ .SRC_RESULTS_JSONL }}"
55+
56+
src:query:csv:
57+
desc: Convert Sourcegraph query results into a CSV.
58+
summary: |
59+
Convert Sourcegraph query results into a CSV.
60+
cmds:
61+
- echo "repo,commit_sha,path" > {{ .SRC_RESULTS_CSV }}
62+
- |-
63+
jq -r '[.repository, .commit, .path] | @csv' {{ .SRC_RESULTS_JSONL }} \
64+
| xsv fmt \
65+
>> {{ .SRC_RESULTS_CSV }}
66+
sources:
67+
- "{{ .SRC_RESULTS_JSONL }}"
68+
generates:
69+
- "{{ .SRC_RESULTS_CSV }}"

data/.gitkeep

Whitespace-only changes.

0 commit comments

Comments
 (0)