Skip to content

teamcity: GCE nightlies can be canceled by unstable perf artifact uploads #169950

@williamchoe3

Description

@williamchoe3

Summary

GCE roachtest nightlies are failing, after the TC build has started to exit because of timeout, during TeamCity artifact publication in roachtest which happens after the cancel signal is sent. e.g. https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/21338895?expandBuildDeploymentsSection=false&hideTestsFromDependencies=false&hideProblemsFromDependencies=false&expandBuildTestsSection=false&expandBuildChangesSection=true&expandBuildProblemsSection=false

Canceled:
Stopping build because of repeated remote command failures: "Publishing 3276 files to ''" command failed 6 times: Failed to send one of [/home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpcc/multiregion/survive=region/chaos=true/run_1/1.

From the log

[04:55:08]W:	 [Publishing artifacts] Recoverable problem publishing artifacts (will retry): Failed to send one of [/home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpcc/multiregion/survive=region/chaos=true/run_1/1.perf/profiles/merged.allocs.pb.gz
...
/home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpcc/multiregion/survive=region/chaos=true/run_1/1.perf/profiles/merged.allocs.pb.gz]: file size on disk changed. Make sure no process updates files during upload.
[04:55:45]W:	 [Publishing artifacts] Artifacts publishing has been interrupted
[04:55:46] : Build canceled

The error message seems to imply that the artifacts are still being written to as TeamCity is uploading them. It could also be a problem with the file itself or something related. But the end result is TC retries this upload 6 times and then cancels (causing a retrigger due to https://cockroachlabs.enterprise.slack.com/archives/C03SG8QKYRJ/p1778187977692439?thread_ts=1778184313.601179&cid=C03SG8QKYRJ but this retrigger should be resolved after this setting is changed.)

Initially observed on GCE on 26.2. Most likely occurring on all branches. 3 arbitrary failing TC Builds that were cancelled and retried by TeamCity were chosen and they all had different tests' artifact's that they were trying to upload e

artifacts/tpcc/multiregion/survive=region/chaos=true/run_1/1.perf/profiles/merged.allocs.pb.gz
artifacts/tpcc/multiregion/survive=zone/chaos=true/run_1/1.perf/profiles/merged.allocs.pb.gz

While the retrigger of these jobs will seemingly be resolved by adjusting the setting described in the above slack thread, we still need to investigate why TC is failing to upload these artifacts in the first place. Alternate storage solutions like GCS may need to be considered as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-testeng-infraC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-testengTestEng Teambranch-masterFailures and bugs on the master branch.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions