|
|
Created:
3 years, 8 months ago by Maks Orlovich Modified:
3 years, 2 months ago Reviewers:
pasko CC:
chromium-reviews, cbentzel+watch_chromium.org, gavinp+disk_chromium.org, net-reviews_chromium.org Target Ref:
refs/heads/master Project:
chromium Visibility:
Public. |
DescriptionThis includes stream 0 all the time, and stream 1 if available from a prefetch. This change does regress some zero-length-read type cases from non-idle being sync to reduce code duplication, which may be worth reconsidering and undoing.
This also includes a microbenchmark to measure some of our bookkeeping overhead, which is also improved a bit by this, but it's not the sort of numbers (order of half a microseconds) to get overly excited about.
BUG=293014
Patch Set 1 #Patch Set 2 : git cl try #Patch Set 3 : Benchmark for simple reads; meant to help evaluate any changes to early dispatch logic in general. #
Total comments: 1
Patch Set 4 : Experimental: upload a bit of an alternative approach for discussion #
Total comments: 2
Patch Set 5 : Resurrect it, and update for the preread changes. #Patch Set 6 : A bunch of refactors and cleanups. A buggy bunch I want to read a combined diff for to help spot th… #Patch Set 7 : Fix a couple of silly mistakes, restore some comments at proper spots #Patch Set 8 : Use the proper limit for the barrier. #
Messages
Total messages: 50 (30 generated)
The CQ bit was checked by morlovich@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
On 2017/04/06 19:37:14, commit-bot: I haz the power wrote: > Dry run: Try jobs failed on following builders: > mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, > http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...) Right. Needs some of the range changing that's only really done in ReadDataInternal. Hmm.
The CQ bit was checked by morlovich@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
> > Right. Needs some of the range changing that's only really done in > ReadDataInternal. Hmm. So v2 is probably correct, but I am not liking the code duplication, both what this adds, and existing --- the checks for out-of-bound reads are basically identical, modulo callback invocation/log method (And I am not sure guarding them with just pending_operations_.empty() in ReadData is sufficient --- what if a write is running). It may be worth thinking if any refactor along the lines of https://bugs.chromium.org/p/chromium/issues/detail?id=293014 can both reduce the duplication and permit consistently fast-pathing to execution without marshaling and demarshaling things into the queue needlessly... The whole thing feels rather heavy, what with copying callbacks and the like, too, and std::queue being quite a memory hog.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== Implement quick path for stream 0 read. Hmm. This avoids reporting on parallelism stats, but that's probably OK, since this doesn't hit the disk anyway. Need to figure out a good way of measuring it, too. It does seem to help on an unmodified DiskCachePerfTest.SimpleCacheBackendPerformance: "Read disk cache headers only (warm)" metric, but it's not really a great benchmark, and is quite noisy. Perhaps microbenchmarking with one key would be appropriate. BUG= ========== to ========== Implement quick path for stream 0 read when nothing else is going on. The biggest effect of this is that it avoids going back to event loop before invoking the callback. There are also CPU savings, going something like 0.74us -> 0.21us (or 0.66us -> 0.17us) for the call itself, not counting the delayed callback cost. These of course are pretty minor when there are ms of I/O involved, but I may need to chip at every bit. To measure this, another microbenchmark was added. The benchmark itself is part of motivation for the change: there are some things that seem worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction, such as the RAM-munching std::queue object, the amount of packing/unpacking we do when the queue is usually empty in the first place, and some of the logic duplication between FooInternal() methods and some of the quick paths in corresponding Foo (which this makes worse). Obviously all of those would need to be measured, and that's a tricky task. BUG=293014 ==========
The CQ bit was checked by morlovich@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Implement quick path for stream 0 read when nothing else is going on. The biggest effect of this is that it avoids going back to event loop before invoking the callback. There are also CPU savings, going something like 0.74us -> 0.21us (or 0.66us -> 0.17us) for the call itself, not counting the delayed callback cost. These of course are pretty minor when there are ms of I/O involved, but I may need to chip at every bit. To measure this, another microbenchmark was added. The benchmark itself is part of motivation for the change: there are some things that seem worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction, such as the RAM-munching std::queue object, the amount of packing/unpacking we do when the queue is usually empty in the first place, and some of the logic duplication between FooInternal() methods and some of the quick paths in corresponding Foo (which this makes worse). Obviously all of those would need to be measured, and that's a tricky task. BUG=293014 ========== to ========== Implement quick path for stream 0 read when nothing else is going on. The biggest effect of this is that it avoids going back to event loop before invoking the callback. There are also CPU savings, going something like 0.74us -> 0.21us (or 0.66us -> 0.17us) for the call itself, not counting the delayed callback cost. These of course are pretty minor when there are ms of I/O involved, but I may need to chip at every bit. To measure this, another microbenchmark was added. The benchmark itself is part of motivation for the change: there are some things that seem worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction, such as the RAM-munching std::queue object, the amount of packing/unpacking we do when the queue is usually empty in the first place, and some of the logic duplication between FooInternal() methods and some of the quick paths in corresponding Foo (which this makes worse). Obviously all of those would need to be measured, and that's a tricky task. BUG=293014 ==========
Description was changed from ========== Implement quick path for stream 0 read when nothing else is going on. The biggest effect of this is that it avoids going back to event loop before invoking the callback. There are also CPU savings, going something like 0.74us -> 0.21us (or 0.66us -> 0.17us) for the call itself, not counting the delayed callback cost. These of course are pretty minor when there are ms of I/O involved, but I may need to chip at every bit. To measure this, another microbenchmark was added. The benchmark itself is part of motivation for the change: there are some things that seem worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction, such as the RAM-munching std::queue object, the amount of packing/unpacking we do when the queue is usually empty in the first place, and some of the logic duplication between FooInternal() methods and some of the quick paths in corresponding Foo (which this makes worse). Obviously all of those would need to be measured, and that's a tricky task. BUG=293014 ========== to ========== Implement quick path for stream 0 read when nothing else is going on. The biggest effect of this is that it avoids going back to event loop before invoking the callback. There are also CPU savings, going something like 0.74us -> 0.21us (or 0.66us -> 0.17us) for the call itself, not counting the delayed callback cost. These of course are pretty minor when there are ms of I/O involved, but I may need to chip at every bit. To measure this, another microbenchmark was added. The benchmark itself is part of motivation for the change: there are some things that seem worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction, such as the RAM-munching std::queue object, the amount of packing/unpacking we do when the queue is usually empty in the first place, and some of the logic duplication between FooInternal() methods and some of the quick paths in corresponding Foo (which this makes worse). Obviously all of those would need to be measured, and that's a tricky task. BUG=293014 ==========
morlovich@chromium.org changed reviewers: + pasko@chromium.org
Come to think of it, might also not produce one netlog entry, and change the order in which things execute..
Avoiding a delay from a posttask sounds handy, and it happens every time the entry is opened. Given that the entry queue is empty anyway, there should be no reordering problem in simplecache. I don't think layers above will be confused (since memory cache can do the same). The netlog is important, however. If you find a non-intrusive way to maintain the netlog in the same, I'd be OK with that. Microbenchmark while is good to estimate a few properties right now, sounds like potential overkill to maintain. I might be wrong, did not read it. https://codereview.chromium.org/2807433002/diff/40001/net/disk_cache/simple/s... File net/disk_cache/simple/simple_entry_impl.cc (left): https://codereview.chromium.org/2807433002/diff/40001/net/disk_cache/simple/s... net/disk_cache/simple/simple_entry_impl.cc:392: // TODO(felipeg): Optimization: Add support for truly parallel read fyi: this one did not show much opportunity on the metrics (SimpleCache.Http.WriteDependencyType, SimpleCache.Http.ReadIsParallelizable, SimpleCache.Http.EntryOperationsPending) feel free to remove
On 2017/04/07 16:13:39, pasko wrote: > Avoiding a delay from a posttask sounds handy, and it happens every time the > entry is opened. Given that the entry queue is empty anyway, there should be no > reordering problem in simplecache. Oh, right, the RunNextIfNeeded() will be a no-op. I don't think layers above will be confused > (since memory cache can do the same). > > The netlog is important, however. If you find a non-intrusive way to maintain > the netlog in the same, I'd be OK with that. It'd basically be copying a few more lines from ReadDataInternal; though at this point I may just try step 2: just calling ReadDataInternal directly (with an extra bool telling to return rather than callback) from ReadData when pending_operations_.size() == 0 && state_ == STATE_READY, and removing some of the duplication of handling of invalid-range cases. Longer-term I am wondering if I could just store a callback to ReadDataInternal when we are not idle --- SimpleEntryOperation is copied by value a bunch ATM, and has a scoped_refptr<net::IOBuffer> buf_; can mostly get to it incrementally by just adding a callback state to SimpleEntryOperation to start with. > > Microbenchmark while is good to estimate a few properties right now, sounds like > potential overkill to maintain. I might be wrong, did not read it. > > https://codereview.chromium.org/2807433002/diff/40001/net/disk_cache/simple/s... > File net/disk_cache/simple/simple_entry_impl.cc (left): > > https://codereview.chromium.org/2807433002/diff/40001/net/disk_cache/simple/s... > net/disk_cache/simple/simple_entry_impl.cc:392: // TODO(felipeg): Optimization: > Add support for truly parallel read > fyi: this one did not show much opportunity on the metrics > (SimpleCache.Http.WriteDependencyType, SimpleCache.Http.ReadIsParallelizable, > SimpleCache.Http.EntryOperationsPending) feel free to remove Yeah, though it was helpful in determining that the queue length is typically one. I am strongly tempted to remove all the stuff that computes parallelism, too, since it's a lot of complexity and introduces trickiness in changing how queuing works, due to executing_operation_
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
A side thought: it came up during the talk that HttpCache* might start accessing headers and payload in parallel. Thinking some more, does it make any sense to serialize stream 0 vs. stream 1 ops, and shouldn't we be able to do all stream 0 stuff immediately as long the entry is healthy? Hmm, I guess failure conditions is where things may show up?
The CQ bit was checked by morlovich@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2017/04/07 19:21:50, Maks Orlovich wrote: > A side thought: it came up during the talk that HttpCache* might start accessing > headers and payload in parallel. > Thinking some more, does it make any sense to serialize stream 0 vs. stream 1 > ops, and shouldn't we be able to do > all stream 0 stuff immediately as long the entry is healthy? Hmm, I guess > failure conditions is where things may > show up? I can only see something breaking if it writes the new version of headers and reads back the old version (without waiting for the write completion), which would be a misuse of the API and I am having a hard time imagining where this edge case would be useful. Hence, probably safe? We already write stream0 synchronously to memory without queuing (when the queue is empty). Same spirit, let's do it?
> I can only see something breaking if it writes the new version of headers and > reads back the old version (without waiting for the write completion) Hmm, we could end up w/new headers but no new data being visible, though. That sounds potentially bad? I don't know how much HttpCache cares. Sadly none of this stuff seems specified in disk_cache::Backend docs... I should probably ask Shivani about how much HttpCacheTransaction relies on this stuff. (Of course, that also leaves the other clients, but those might just be simple enough to straight up inspect). > which would be a misuse of the API and I am having a hard time imagining where this > edge case would be useful. Hence, probably safe? We already write stream0 > synchronously to memory without queuing (when the queue is empty). Same spirit, > let's do it? Well, the question is whether we can drop the "queue is empty" condition for stream 0. Anyway, I would really appreciate feedback on this CL proper: I think at its current scope it can help see if those thread ping-pongs matter much --- I think the one for CRC computation is avoidable as well (cleanly, but far more complexly than this).
https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/s... File net/disk_cache/simple/simple_entry_impl.cc (left): https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/s... net/disk_cache/simple/simple_entry_impl.cc:837: RecordReadResult(cache_type_, READ_RESULT_BAD_STATE); I should probably double-check the condition for this one...
On 2017/04/12 14:38:41, Maks Orlovich wrote: > > I can only see something breaking if it writes the new version of headers > > and reads back the old version (without waiting for the write completion) > > Hmm, we could end up w/new headers but no new data being visible, though. That > sounds potentially bad? Since the all entry ops are queued, this can only happen after a browser crash, right? I think it's unlikely to be a problem, and if this world is able to observe such a problem at some point, we can mitigate: temporarily corrupt the entry every time we write stream0 and un-corrupt after the next data write. > I don't know how much HttpCache cares. Sadly none of > this stuff seems specified in disk_cache::Backend docs... I should probably > ask Shivani about how much HttpCacheTransaction relies on this stuff. > > (Of course, that also leaves the other clients, but those might just be simple > enough to straight up inspect). I think other clients don't use stream0 (appcache - not sure, anyone still using it?:) > > which would be a misuse of the API and I am having a hard time imagining > > where this edge case would be useful. Hence, probably safe? We already write > > stream0 synchronously to memory without queuing (when the queue is empty). > > Same spirit, let's do it? > > Well, the question is whether we can drop the "queue is empty" condition for > stream 0. Isn't it always empty when stream0 is read? I think our most common client has to ReadData(stream0) after OpenEntry() before it can think of reading other streams. > Anyway, I would really appreciate feedback on this CL proper: I think at its > current scope it can help see if those thread ping-pongs matter much How do you want to see if it helps? Would there be a study parameter in the next change or you'd compare histograms from different channels? (spoiler: the latter does not work) > --- I think the one for CRC computation is avoidable as well (cleanly, but far > more complexly than this). I won't bother optimizing the CRC computation. I believe the effect is small.
> Since the all entry ops are queued, this can only happen after a browser crash, > right? I think it's unlikely to be a problem, and if this world is able to > observe such a problem at some point, we can mitigate: temporarily corrupt the > entry every time we write stream0 and un-corrupt after the next data write. Well, again, the question is whether we can avoid caring about the queue for stream 0. The disk state on crash issue is indeed already largely done as you suggest, by having a bogus footer. That's another thing that may be possible to speed up a bit, by combining some of the length fiddling into writes. (Sadly it'd be a bit on the hacky and fiddly side, especially since Windows doesn't seem to have a real replacement for writev...) (Stream 0 is only written at Close, though) > I think other clients don't use stream0 (appcache - not sure, anyone still using > it?:) > > > > which would be a misuse of the API and I am having a hard time imagining > > > where this edge case would be useful. Hence, probably safe? We already write > > > stream0 synchronously to memory without queuing (when the queue is empty). > > > Same spirit, let's do it? > > > > Well, the question is whether we can drop the "queue is empty" condition for > > stream 0. > > Isn't it always empty when stream0 is read? I think our most common client has > to ReadData(stream0) after OpenEntry() before it can think of reading other > streams. Oh, right, background: Shivani mentioned during my talk that that HttpCacheTransaction may try accessing both streams in parallel, for writes, I think. I should probably pin down the details. > > Anyway, I would really appreciate feedback on this CL proper: I think at its > > current scope it can help see if those thread ping-pongs matter much > > How do you want to see if it helps? Would there be a study parameter in the next > change or you'd compare histograms from different channels? (spoiler: the latter > does not work) I would hope it would be big enough to see in trendlines on some of the cache latency metrics, but that's probably naive for anything that's not an order-of-magnitude change to something that's measured more directly. Studies are just too slow when I am trying to understand a problem more than to confirm a solution :( > > --- I think the one for CRC computation is avoidable as well (cleanly, but far > > more complexly than this). > > I won't bother optimizing the CRC computation. I believe the effect is small. Well, it's one more roundtrip around event loop on read before the final callback, I think?
On 2017/04/12 15:28:05, Maks Orlovich wrote: > > Since the all entry ops are queued, this can only happen after a browser > crash, > > right? I think it's unlikely to be a problem, and if this world is able to > > observe such a problem at some point, we can mitigate: temporarily corrupt the > > entry every time we write stream0 and un-corrupt after the next data write. > > Well, again, the question is whether we can avoid caring about the queue for > stream 0. I am lost a little now. I thought we are talking about a potential speedup. If reading stream 0 always has an empty queue - mission accomplished, an extra check for entry queue size will ensure correctness at low complexity, and the speedup is there. > > > > stream0 synchronously to memory without queuing (when the queue is empty). > > > > Same spirit, let's do it? > > > > > > Well, the question is whether we can drop the "queue is empty" condition for > > > stream 0. > > > > Isn't it always empty when stream0 is read? I think our most common client has > > to ReadData(stream0) after OpenEntry() before it can think of reading other > > streams. > > Oh, right, background: Shivani mentioned during my talk that that > HttpCacheTransaction may try > accessing both streams in parallel, for writes, I think. I should probably pin > down the details. All HttpCacheTransaction only can request things in parallel if the disk cache returns asynchronously. So if it asks stream0 first, and the backend responds right away (as this change suggests), then there would be no parallelism. > > > Anyway, I would really appreciate feedback on this CL proper: I think at > > > its current scope it can help see if those thread ping-pongs matter much > > > > How do you want to see if it helps? Would there be a study parameter in the > > next change or you'd compare histograms from different channels? (spoiler: > > the latter does not work) > > I would hope it would be big enough to see in trendlines on some of the cache > latency metrics, > but that's probably naive for anything that's not an order-of-magnitude change > to something that's > measured more directly. > > Studies are just too slow when I am trying to understand a problem more than to > confirm > a solution :( OK, I see no issues with the code, it should be good for benchmarking. If the benchmark does not demonstrate any effect, it would be simpler to not commit it. Would you agree to such plan? > > > --- I think the one for CRC computation is avoidable as well (cleanly, but > far > > > more complexly than this). > > > > I won't bother optimizing the CRC computation. I believe the effect is small. > > Well, it's one more roundtrip around event loop on read before the final > callback, I think? Maybe something changed, but I thought we just maintain the crc as we read (if the incoming reads are sequential), maybe there is an extra roundtrip accidentally, but I don't remember it, the data is available in the entry at the last read anyway, and we can return an error without posting to the loop.
https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/s... File net/disk_cache/simple/simple_entry_impl.cc (left): https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/s... net/disk_cache/simple/simple_entry_impl.cc:837: RecordReadResult(cache_type_, READ_RESULT_BAD_STATE); On 2017/04/12 14:55:49, Maks Orlovich wrote: > I should probably double-check the condition for this one... this won't appear on the benchmark, so why bother?
> I am lost a little now. I thought we are talking about a potential speedup. If > reading stream 0 always has an empty queue - mission accomplished, an extra > check for entry queue size will ensure correctness at low complexity, and the > speedup is there. Let's drop this for now, it may be clearer/less confusing if I just make a proposal CL rather than discussing a potential CL N + 2 or something when asking for review of CL N --- this CL isn't intended to change any ordering constraints, after all, just shave some cycles and an eventloop roundtrip. > OK, I see no issues with the code, it should be good for benchmarking. If the > benchmark does not demonstrate any effect, it would be simpler to not commit it. > Would you agree to such plan? It does demonstrate it --- see the CL description --- but in terms of CPU only. Also avoid ~5us or so for the loop + callback portion, on top of numbers given. > Maybe something changed, but I thought we just maintain the crc as we read (if > the incoming reads are sequential), maybe there is an extra roundtrip > accidentally, but I don't remember it, the data is available in the entry at the > last read anyway, and we can return an error without posting to the loop. That's correct, except for what happens at very end: once SimpleEntryImpl finishes the last sequential ReadData it does a PostTaskAndReply to SimpleSynchronousEntry::CheckEOFRecord which then reads the footer and verifies the passed-in CRC. Whether the extra round trip matters, I can't say. (Might also be an opportunity to combine two reads there, if that matters on OS X, and if seek + readv would be faster).
On 2017/04/12 16:49:54, Maks Orlovich wrote: > > OK, I see no issues with the code, it should be good for benchmarking. If the > > benchmark does not demonstrate any effect, it would be simpler to not commit > it. > > Would you agree to such plan? > > It does demonstrate it --- see the CL description --- but in terms of CPU only. > Also avoid ~5us or so for the loop + callback portion, on top of numbers given. Oh, sorry, I did not notice :( My threshold of useful CPU time savings on IO thread is 5ms for a heavy page load (200 resources). If it is a bigger saving, it does not automatically mean usefulness, but this would be an interesting topic to talk about, as it could be more on higher percentiles. For example, I recently argued against a saving of 20ms (but that was a bigger complication). So if this is about .5us saving for a read, looking at our size distribution (Net.HttpJob.PrefilterBytesRead.Cache), and the fact that buffers are 32KiB, we would have around 230 reads per page load, it gives a number that is 50 times smaller than my threshold. Is this about right? If so, then I am not sure why we should review it carefully and fix the potential fallout of edge cases on Stable. Feel free to start a discussion about useful IO thread performance improvements on loading-dev@, I know other people may have different opinions than mine.
> My threshold of useful CPU time savings on IO thread is 5ms for a heavy page > load (200 resources). If it is a bigger saving, it does not automatically mean > usefulness, but this would be an interesting topic to talk about, as it could be > more on higher percentiles. For example, I recently argued against a saving of > 20ms (but that was a bigger complication). > > So if this is about .5us saving for a read, looking at our size distribution > (Net.HttpJob.PrefilterBytesRead.Cache), and the fact that buffers are 32KiB, we > would have around 230 reads per page load, it gives a number that is 50 times > smaller than my threshold. Is this about right? > > If so, then I am not sure why we should review it carefully and fix the > potential fallout of edge cases on Stable. > > Feel free to start a discussion about useful IO thread performance improvements > on loading-dev@, I know other people may have different opinions than mine. Thanks for sharing your rule; I don't really have a good sense for how small changes tend to affect the overall system[1]. Do you feel avoiding going back to event loop to post result isn't usually worth it either? Also, how do you feel about SimpleCache using somewhere in the 1-4% or so of IO thread CPU? (might be somewhat overestimated by tracing though, come to think of it) [1] It makes me feel better about potentially adding like 0.02us on Windows to save like 4K per entry on Android :)
On 2017/04/13 13:16:26, Maks Orlovich wrote: > > My threshold of useful CPU time savings on IO thread is 5ms for a heavy page > > load (200 resources). If it is a bigger saving, it does not automatically mean > > usefulness, but this would be an interesting topic to talk about, as it could > be > > more on higher percentiles. For example, I recently argued against a saving of > > 20ms (but that was a bigger complication). > > > > So if this is about .5us saving for a read, looking at our size distribution > > (Net.HttpJob.PrefilterBytesRead.Cache), and the fact that buffers are 32KiB, > we > > would have around 230 reads per page load, it gives a number that is 50 times > > smaller than my threshold. Is this about right? > > > > If so, then I am not sure why we should review it carefully and fix the > > potential fallout of edge cases on Stable. > > > > Feel free to start a discussion about useful IO thread performance > improvements > > on loading-dev@, I know other people may have different opinions than mine. > > Thanks for sharing your rule; I don't really have a good sense for how small > changes tend to affect the overall system[1]. Do you feel avoiding going back to > event loop to post result isn't usually worth it either? > > Also, how do you feel about SimpleCache using somewhere in the 1-4% or so of > IO thread CPU? (might be somewhat overestimated by tracing though, come to think > of it) If it's chunked enough and the IO thread stays responsive, even 20% CPU won't worry me much, since cache is so coupled with networking. I think the queuing is a bigger problem (what bmaurer@ talked about at BlinkOn7 - the prelude.js being over-queued, slides 23-24 in [1]). The problem is less true now, as we throttle H/2 (ask jkarlin@ for details), but still is a problem, and changes like this one can potentially help, but the metric should be something like time-to-first-bytes-in-renderer, while IO thread CPU time is hard to project to it. [1] https://www.dropbox.com/s/er04jbxu510qkge/blinkon%20jan2017.pdf
The CQ bit was checked by morlovich@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
FYI: Josh expressed interest in having reads return synchronously when discussing the prefetch change, so I'll be reviving this, but looks like it'll need to be re-uploaded to the new review system.
The CQ bit was checked by morlovich@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Implement quick path for stream 0 read when nothing else is going on. The biggest effect of this is that it avoids going back to event loop before invoking the callback. There are also CPU savings, going something like 0.74us -> 0.21us (or 0.66us -> 0.17us) for the call itself, not counting the delayed callback cost. These of course are pretty minor when there are ms of I/O involved, but I may need to chip at every bit. To measure this, another microbenchmark was added. The benchmark itself is part of motivation for the change: there are some things that seem worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction, such as the RAM-munching std::queue object, the amount of packing/unpacking we do when the queue is usually empty in the first place, and some of the logic duplication between FooInternal() methods and some of the quick paths in corresponding Foo (which this makes worse). Obviously all of those would need to be measured, and that's a tricky task. BUG=293014 ========== to ========== This includes stream 0 all the time, and stream 1 if available from a prefetch. This change does regress some zero-length-read type cases from non-idle being sync to reduce code duplication, which may be worth reconsidering and undoing. This also includes a microbenchmark to measure some of our bookkeeping overhead, which is also improved a bit by this, but it's not the sort of numbers (order of half a microseconds) to get overly excited about. BUG=293014 ==========
Dry run: This issue passed the CQ dry run.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2017/09/22 14:51:11, Maks Orlovich wrote: > FYI: Josh expressed interest in having reads return synchronously when > discussing the prefetch change, so I'll be reviving this, but looks like it'll > need to be re-uploaded to the new review system. Moved to: https://chromium-review.googlesource.com/#/c/chromium/src/+/682139 |