|
|
DescriptionChange task timeout limit to 3 days
As ChromeOS CTS/GTS qualification suite may take at most 2 days to
complete, change the swarming client task timeout limit from 1 day to 3
days.
Relavent discussion:
https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=footer#!msg/chromeos-infra-discuss/AOQP2CD90oE/sc3D1jm_AAAJ
R=vadimsh@chromium.org
BUG=746327
Review-Url: https://codereview.chromium.org/2984773002
Committed: https://github.com/luci/luci-py/commit/710ff0d7f30b1e8813d2b15865fbed61e7deb80c
Patch Set 1 #
Total comments: 1
Patch Set 2 : fix error message/task timeout to 7 days/task expiration to 14 days #Patch Set 3 : fix error message #
Messages
Total messages: 30 (13 generated)
Description was changed from ========== Change desk timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG= ========== to ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG= ==========
ddmail@google.com changed reviewers: + vadim sh. ihf@chromium.org - vadimsh@chromium.org
ddmail@google.com changed reviewers: + ihf@chromium.org, vadimsh@chromium.org - vadim sh. ihf@chromium.org
The CQ bit was checked by ddmail@google.com
The CQ bit was unchecked by ddmail@google.com
I think we need to increase the timeout for Chrome OS CTS to 7 days. The reason is we have jobs that can run 24-48h and we want to average out the work over the course of a week.
pwang@chromium.org changed reviewers: + maruel@chromium.org - ddmail@google.com
Hi Vadim, Ilja, Marc Please review a changelist from pwang: https://codereview.chromium.org/2984773002
Description was changed from ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG= ========== to ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ==========
Hi, if there is a use case, I'm fine to extend the validity range. That said, keep in mind there are risk that we'll break runs when updating, as we generally wait a day for breaking changes. So while it'll generally work, it may fail on breaking Swarming bot API changes. You forgot to update doc/User-Guide.md. https://codereview.chromium.org/2984773002/diff/1/appengine/swarming/server/t... File appengine/swarming/server/task_request.py (right): https://codereview.chromium.org/2984773002/diff/1/appengine/swarming/server/t... appengine/swarming/server/task_request.py:233: '%s (%ds) must be 0 or between %ds and one day' % update
On 2017/07/21 14:35:56, M-A Ruel wrote: > Hi, if there is a use case, I'm fine to extend the validity range. > > That said, keep in mind there are risk that we'll break runs when updating, as > we generally wait a day for breaking changes. So while it'll generally work, it > may fail on breaking Swarming bot API changes. Understood. How often do you expect such a breaking change to happen? If it was about once a month, I think we would not care much. But if you expect them weekly then we should probably look at a different solution.
On 2017/07/21 17:03:07, ilja wrote: > On 2017/07/21 14:35:56, M-A Ruel wrote: > > Hi, if there is a use case, I'm fine to extend the validity range. > > > > That said, keep in mind there are risk that we'll break runs when updating, as > > we generally wait a day for breaking changes. So while it'll generally work, > it > > may fail on breaking Swarming bot API changes. > > Understood. How often do you expect such a breaking change to happen? If it was > about once a month, I think we would not care much. But if you expect them > weekly then we should probably look at a different solution. ~once a year.
Can "ChromeOS CTS/GTS qualification suite" be sharded somehow so instead of one 2-day task it is N shorter tasks? Very long lived tasks are not a good fit for Swarming :(
On 2017/07/21 18:20:16, Vadim Sh. wrote: > Can "ChromeOS CTS/GTS qualification suite" be sharded somehow so instead of one > 2-day task it is N shorter tasks? We can, but it is very inefficient (and the longest tasks might still run O(12h)). For more background, we would like to submit low priority jobs from Chrome OS builders to the test lab for every single build. This is a massive amount of work and we can't afford the sharding overhead. > Very long lived tasks are not a good fit for Swarming :( Can you explain which problem it causes you? AFAIK all that is happening here is a pass through of jobs into the Chrome OS scheduling system with immediate exit from the builders. (So I think you shouldn't notice much of this.) But I am not very familiar with the sharding/builder code.
Description was changed from ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ========== to ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 7 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ==========
Description was changed from ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 7 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ========== to ========== Change task timeout limit to 7 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 7 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ==========
Oh, if it's only for proxy part, that's no so bad then... For some reason I thought it will run other portions of LUCI stack on the bot (e.g. log collector). The main problem is that a running task essentially "locks" APIs and implementations for duration of its run. E.g. if we want to make a backward-incompatible change to an API, we usually wait 1 day for all old tasks (that may be using old API) to terminate. With unbounded limits we either will need to wait a lot longer (unlikely we will), or break long-running tasks by changing server-side APIs underneath their feet. Like M-A said, we don't do changes like this often, but they do happen. And more chunks of LUCI stack the task is using, more the risk (since there's more backend services involved).
On 2017/07/21 18:41:48, ilja wrote: > On 2017/07/21 18:20:16, Vadim Sh. wrote: > > Can "ChromeOS CTS/GTS qualification suite" be sharded somehow so instead of > one > > 2-day task it is N shorter tasks? > > We can, but it is very inefficient (and the longest tasks might still run > O(12h)). For more background, we would like to submit low priority jobs from > Chrome OS builders to the test lab for every single build. This is a massive > amount of work and we can't afford the sharding overhead. > > > Very long lived tasks are not a good fit for Swarming :( > > Can you explain which problem it causes you? AFAIK all that is happening here is > a pass through of jobs into the Chrome OS scheduling system with immediate exit > from the builders. (So I think you shouldn't notice much of this.) But I am not > very familiar with the sharding/builder code. We highly value latency over efficiency. We calculate efficiency in terms of overhead in % over throughput. So what % of inefficiency are you thinking about? Let's say, just for the purpose of illustration: - constant overhead setup cost of each shard is ~1h - running as one task takes 49 hours, one hour of "setup time" and 48 hours of processing, an overhead of 2%. Let's propose: - running as 5 shards gives your ~5h of overhead with results within ~12h +/- sharding imbalance, for a total cost of workers of 53 hours; 10% of overhead. Even 20% of overhead seem totally acceptable trade off to get results 4x faster. Basically, we are questioning the trade offs you are making on your infrastructure, as one of the explicit goal of the infrastructure is to trade off throughput to lower latency. This assumes SWEh >> hardware cost. We want to assert that it's a use case you can't get off without before allowing this.
On 2017/07/21 19:11:10, M-A Ruel wrote: > We highly value latency over efficiency. We calculate efficiency in terms of > overhead in % over throughput. So what % of inefficiency are you thinking about? > Let's say, just for the purpose of illustration: > - constant overhead setup cost of each shard is ~1h > - running as one task takes 49 hours, one hour of "setup time" and 48 hours of > processing, an overhead of 2%. As I pointed out we are unable to shard into equal pieces. What we have is sharding into 154 wildly variable chunks ranging from seconds to 6h (soon 12h due to doubling ABIs). Each shard has about 10 minutes overhead, in other words there is 1540 minutes overhead or 26h for a 24h run. I am sure you will agree that this is expensive and that it is a reasonable choice to accepting twice the latency for twice the throughput. > Let's propose: > - running as 5 shards gives your ~5h of overhead with results within ~12h +/- > sharding imbalance, for a total cost of workers of 53 hours; 10% of overhead. > Even 20% of overhead seem totally acceptable trade off to get results 4x faster. > > Basically, we are questioning the trade offs you are making on your > infrastructure, as one of the explicit goal of the infrastructure is to trade > off throughput to lower latency. We can meet and I can explain you our tradeoffs. But as explained this is a passthrough and should not affect your infra. > This assumes SWEh >> hardware cost. > > We want to assert that it's a use case you can't get off without before allowing > this. I will create my own cron job scheduling these jobs if buildbot won't support our use case.
I mean, I'm fine with increasing the delays to 3 days, but not to 7 days, which you did in the second patchset. Please revert timeout limit to 3 days and expiration limit to 1 day. The expiration is orthogonal to your issue. Looking at sharding the test sequence more has much more advantages than just the Swarming specific issues. You didn't address my other comment about doc update. Please do so.
On 2017/07/21 21:34:37, M-A Ruel wrote: > I mean, I'm fine with increasing the delays to 3 days, but not to 7 days, which > you did in the second patchset. Please revert timeout limit to 3 days and > expiration limit to 1 day. The expiration is orthogonal to your issue. > > Looking at sharding the test sequence more has much more advantages than just > the Swarming specific issues. > > You didn't address my other comment about doc update. Please do so. I checked doc/User-Guide.md, but I can't find things mentioning the maximum timeout of a task request in "Request" section. Is it possible for you to point me to the right place?
Description was changed from ========== Change task timeout limit to 7 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 7 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ========== to ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ==========
On 2017/07/21 23:07:31, pwang1 wrote: > On 2017/07/21 21:34:37, M-A Ruel wrote: > > I mean, I'm fine with increasing the delays to 3 days, but not to 7 days, > which > > you did in the second patchset. Please revert timeout limit to 3 days and > > expiration limit to 1 day. The expiration is orthogonal to your issue. > > > > Looking at sharding the test sequence more has much more advantages than just > > the Swarming specific issues. > > > > You didn't address my other comment about doc update. Please do so. > > I checked doc/User-Guide.md, but I can't find things mentioning the maximum > timeout > of a task request in "Request" section. > Is it possible for you to point me to the right place? Wow, sorry I totally hallucinated. lgtm with patchset #3.
The CQ bit was checked by pwang@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1500916271664520, "parent_rev": "f59af786b5b9f1571698333c0b064231ca07d144", "commit_rev": "710ff0d7f30b1e8813d2b15865fbed61e7deb80c"}
Message was sent while issue was closed.
Description was changed from ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 ========== to ========== Change task timeout limit to 3 days As ChromeOS CTS/GTS qualification suite may take at most 2 days to complete, change the swarming client task timeout limit from 1 day to 3 days. Relavent discussion: https://groups.google.com/a/google.com/forum/?utm_medium=email&utm_source=foo... R=vadimsh@chromium.org BUG=746327 Review-Url: https://codereview.chromium.org/2984773002 Committed: https://github.com/luci/luci-py/commit/710ff0d7f30b1e8813d2b15865fbed61e7deb80c ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://github.com/luci/luci-py/commit/710ff0d7f30b1e8813d2b15865fbed61e7deb80c
Message was sent while issue was closed.
Thank you!
Message was sent while issue was closed.
On 2017/07/24 17:16:53, commit-bot: I haz the power wrote: > Committed patchset #3 (id:40001) as > https://github.com/luci/luci-py/commit/710ff0d7f30b1e8813d2b15865fbed61e7deb80c Thanks all. The proxy seems working and CTS suite get executed. |