Issue 2807433002: SimpleCache: synchronously reply on reads from idle if data is in memory.

Maks Orlovich

The CQ bit was checked by morlovich@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-06 18:59:39 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2807433002/1

3 years, 8 months ago (2017-04-06 19:01:01 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-06 19:37:13 UTC) #3

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/424001)

3 years, 8 months ago (2017-04-06 19:37:14 UTC) #4

Maks Orlovich

On 2017/04/06 19:37:14, commit-bot: I haz the power wrote: > Dry run: Try jobs failed ...

3 years, 8 months ago (2017-04-06 20:22:47 UTC) #5

Maks Orlovich

The CQ bit was checked by morlovich@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-06 20:42:04 UTC) #6

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2807433002/20001

3 years, 8 months ago (2017-04-06 20:43:32 UTC) #7

Maks Orlovich

> > Right. Needs some of the range changing that's only really done in > ...

3 years, 8 months ago (2017-04-06 20:47:23 UTC) #8

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-06 21:55:27 UTC) #9

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-06 21:55:28 UTC) #10

Maks Orlovich

Description was changed from ========== Implement quick path for stream 0 read. Hmm. This avoids ...

3 years, 8 months ago (2017-04-07 15:28:16 UTC) #11

Maks Orlovich

The CQ bit was checked by morlovich@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-07 15:28:22 UTC) #12

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2807433002/40001

3 years, 8 months ago (2017-04-07 15:28:51 UTC) #13

Maks Orlovich

Description was changed from ========== Implement quick path for stream 0 read when nothing else ...

3 years, 8 months ago (2017-04-07 15:29:13 UTC) #14

Description was changed from

==========
Implement quick path for stream 0 read when nothing else is going on.

The biggest effect of this is that it avoids going back to event loop before
invoking the callback. There are also CPU savings, going something like 0.74us
-> 0.21us (or 0.66us -> 0.17us) for the call 
itself, not counting the delayed callback cost. These of course are 
pretty minor when there are ms of I/O involved, but I may need to chip
at every bit.

To measure this, another microbenchmark was added. The benchmark itself is part
of motivation for the change: there are some things that seem 
worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction,
such as the RAM-munching std::queue object, the amount of packing/unpacking we
do when the queue is usually empty in the first place, and some of the logic
duplication between FooInternal() methods and some of the quick paths in
corresponding Foo (which this makes worse). Obviously all of those would need to
be measured, and that's a tricky task.

BUG=293014
==========

to

==========
Implement quick path for stream 0 read when nothing else is going on.

The biggest effect of this is that it avoids going back to event loop before
invoking the callback. There are also CPU savings, going something like 0.74us
-> 0.21us (or 0.66us -> 0.17us) for the call 
itself, not counting the delayed callback cost. These of course are pretty minor
when there are ms of I/O involved, but I may need to chip
at every bit.

To measure this, another microbenchmark was added. The benchmark itself is part
of motivation for the change: there are some things that seem 
worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction,
such as the RAM-munching std::queue object, the amount of packing/unpacking we
do when the queue is usually empty in the first place, and some of the logic
duplication between FooInternal() methods and some of the quick paths in
corresponding Foo (which this makes worse). Obviously all of those would need to
be measured, and that's a tricky task.

BUG=293014
==========

Maks Orlovich

Description was changed from ========== Implement quick path for stream 0 read when nothing else ...

3 years, 8 months ago (2017-04-07 15:29:55 UTC) #15

Description was changed from

==========
Implement quick path for stream 0 read when nothing else is going on.

The biggest effect of this is that it avoids going back to event loop before
invoking the callback. There are also CPU savings, going something like 0.74us
-> 0.21us (or 0.66us -> 0.17us) for the call 
itself, not counting the delayed callback cost. These of course are pretty minor
when there are ms of I/O involved, but I may need to chip
at every bit.

To measure this, another microbenchmark was added. The benchmark itself is part
of motivation for the change: there are some things that seem 
worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction,
such as the RAM-munching std::queue object, the amount of packing/unpacking we
do when the queue is usually empty in the first place, and some of the logic
duplication between FooInternal() methods and some of the quick paths in
corresponding Foo (which this makes worse). Obviously all of those would need to
be measured, and that's a tricky task.

BUG=293014
==========

to

==========
Implement quick path for stream 0 read when nothing else is going on.

The biggest effect of this is that it avoids going back to event loop before
invoking the callback. There are also CPU savings, going something like 0.74us
-> 0.21us (or 0.66us -> 0.17us) for the call 
itself, not counting the delayed callback cost. These of course are pretty minor
when there are ms of I/O involved, but I may need to chip at every bit.

To measure this, another microbenchmark was added. The benchmark itself is part
of motivation for the change: there are some things that seem 
worth improving in the SimpleEntryImpl -> SimpleSynchronousEntry interaction,
such as the RAM-munching std::queue object, the amount of packing/unpacking we
do when the queue is usually empty in the first place, and some of the logic
duplication between FooInternal() methods and some of the quick paths in
corresponding Foo (which this makes worse). Obviously all of those would need to
be measured, and that's a tricky task.

BUG=293014
==========

Maks Orlovich

morlovich@chromium.org changed reviewers: + pasko@chromium.org

3 years, 8 months ago (2017-04-07 15:33:30 UTC) #16

Maks Orlovich

Come to think of it, might also not produce one netlog entry, and change the ...

3 years, 8 months ago (2017-04-07 15:33:30 UTC) #17

pasko

Avoiding a delay from a posttask sounds handy, and it happens every time the entry ...

3 years, 8 months ago (2017-04-07 16:13:39 UTC) #18

Maks Orlovich

On 2017/04/07 16:13:39, pasko wrote: > Avoiding a delay from a posttask sounds handy, and ...

3 years, 8 months ago (2017-04-07 16:35:17 UTC) #19

On 2017/04/07 16:13:39, pasko wrote:
> Avoiding a delay from a posttask sounds handy, and it happens every time the
> entry is opened. Given that the entry queue is empty anyway, there should be
no
> reordering problem in simplecache. 

Oh, right, the RunNextIfNeeded() will be a no-op.

I don't think layers above will be confused
> (since memory cache can do the same).
> 
> The netlog is important, however. If you find a non-intrusive way to maintain
> the netlog in the same, I'd be OK with that.

It'd basically be copying a few more lines from ReadDataInternal; though at this
point 
I may just try step 2: just calling ReadDataInternal directly (with an extra
bool telling to return 
rather than callback) from ReadData when pending_operations_.size() == 0 &&
state_ == STATE_READY,
and removing some of the duplication of handling of invalid-range cases.

Longer-term I am wondering if I could just store a callback to ReadDataInternal
when we are not idle --- 
SimpleEntryOperation is copied by value a bunch ATM, and has a
scoped_refptr<net::IOBuffer> buf_;
can mostly get to it incrementally by just adding a callback state to
SimpleEntryOperation to start with.

> 
> Microbenchmark while is good to estimate a few properties right now, sounds
like
> potential overkill to maintain. I might be wrong, did not read it.
> 
>
https://codereview.chromium.org/2807433002/diff/40001/net/disk_cache/simple/s...
> File net/disk_cache/simple/simple_entry_impl.cc (left):
> 
>
https://codereview.chromium.org/2807433002/diff/40001/net/disk_cache/simple/s...
> net/disk_cache/simple/simple_entry_impl.cc:392: // TODO(felipeg):
Optimization:
> Add support for truly parallel read
> fyi: this one did not show much opportunity on the metrics
> (SimpleCache.Http.WriteDependencyType, SimpleCache.Http.ReadIsParallelizable,
> SimpleCache.Http.EntryOperationsPending) feel free to remove

Yeah, though it was helpful in determining that the queue length is typically
one. I am strongly 
tempted to remove all the stuff that computes parallelism, too, since it's a lot
of complexity and introduces 
trickiness in changing how queuing works, due to executing_operation_

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-07 16:57:30 UTC) #20

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-07 16:57:31 UTC) #21

Maks Orlovich

A side thought: it came up during the talk that HttpCache* might start accessing headers ...

3 years, 8 months ago (2017-04-07 19:21:50 UTC) #22

Maks Orlovich

The CQ bit was checked by morlovich@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-09 12:23:17 UTC) #23

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2807433002/60001

3 years, 8 months ago (2017-04-09 12:23:23 UTC) #24

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-09 14:02:28 UTC) #25

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-09 14:02:29 UTC) #26

pasko

On 2017/04/07 19:21:50, Maks Orlovich wrote: > A side thought: it came up during the ...

3 years, 8 months ago (2017-04-12 14:23:57 UTC) #27

Maks Orlovich

> I can only see something breaking if it writes the new version of headers ...

3 years, 8 months ago (2017-04-12 14:38:41 UTC) #28

Maks Orlovich

https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/simple_entry_impl.cc File net/disk_cache/simple/simple_entry_impl.cc (left): https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/simple_entry_impl.cc#oldcode837 net/disk_cache/simple/simple_entry_impl.cc:837: RecordReadResult(cache_type_, READ_RESULT_BAD_STATE); I should probably double-check the condition for ...

3 years, 8 months ago (2017-04-12 14:55:49 UTC) #29

pasko

On 2017/04/12 14:38:41, Maks Orlovich wrote: > > I can only see something breaking if ...

3 years, 8 months ago (2017-04-12 15:15:13 UTC) #31

On 2017/04/12 14:38:41, Maks Orlovich wrote:
> > I can only see something breaking if it writes the new version of headers
> > and reads back the old version (without waiting for the write completion)
> 
> Hmm, we could end up w/new headers but no new data being visible, though. That
> sounds potentially bad?

Since the all entry ops are queued, this can only happen after a browser crash,
right? I think it's unlikely to be a problem, and if this world is able to
observe such a problem at some point, we can mitigate: temporarily corrupt the
entry every time we write stream0 and un-corrupt after the next data write.

> I don't know how much HttpCache cares. Sadly none of
> this stuff seems specified in disk_cache::Backend docs... I should probably
> ask Shivani about how much HttpCacheTransaction relies on this stuff.
> 
> (Of course, that also leaves the other clients, but those might just be simple
> enough to straight up inspect).

I think other clients don't use stream0 (appcache - not sure, anyone still using
it?:)

> > which would be a misuse of the API and I am having a hard time imagining
> > where this edge case would be useful. Hence, probably safe? We already write
> > stream0 synchronously to memory without queuing (when the queue is empty).
> > Same spirit, let's do it?
> 
> Well, the question is whether we can drop the "queue is empty" condition for
> stream 0.

Isn't it always empty when stream0 is read? I think our most common client has
to ReadData(stream0) after OpenEntry() before it can think of reading other
streams.

> Anyway, I would really appreciate feedback on this CL proper: I think at its
> current scope it can help see if those thread ping-pongs matter much

How do you want to see if it helps? Would there be a study parameter in the next
change or you'd compare histograms from different channels? (spoiler: the latter
does not work)

> --- I think the one for CRC computation is avoidable as well (cleanly, but far
> more complexly than this).

I won't bother optimizing the CRC computation. I believe the effect is small.

Maks Orlovich

> Since the all entry ops are queued, this can only happen after a browser ...

3 years, 8 months ago (2017-04-12 15:28:05 UTC) #32

> Since the all entry ops are queued, this can only happen after a browser
crash,
> right? I think it's unlikely to be a problem, and if this world is able to
> observe such a problem at some point, we can mitigate: temporarily corrupt the
> entry every time we write stream0 and un-corrupt after the next data write.

Well, again, the question is whether we can avoid caring about the queue for
stream 0.
The disk state on crash issue is indeed already largely done as you suggest, by 
having a bogus footer. That's another thing that may be possible to speed up a
bit, 
by combining some of the length fiddling into writes.

(Sadly it'd be a bit on the hacky and fiddly side, especially since Windows
doesn't
 seem to have a real replacement for writev...)

(Stream 0 is only written at Close, though)

> I think other clients don't use stream0 (appcache - not sure, anyone still
using
> it?:)
> 
> > > which would be a misuse of the API and I am having a hard time imagining
> > > where this edge case would be useful. Hence, probably safe? We already
write
> > > stream0 synchronously to memory without queuing (when the queue is empty).
> > > Same spirit, let's do it?
> > 
> > Well, the question is whether we can drop the "queue is empty" condition for
> > stream 0.
> 
> Isn't it always empty when stream0 is read? I think our most common client has
> to ReadData(stream0) after OpenEntry() before it can think of reading other
> streams.

Oh, right, background: Shivani mentioned during my talk that that
HttpCacheTransaction may try 
accessing both streams in parallel, for writes, I think. I should probably pin
down the details.

> > Anyway, I would really appreciate feedback on this CL proper: I think at its
> > current scope it can help see if those thread ping-pongs matter much
> 
> How do you want to see if it helps? Would there be a study parameter in the
next
> change or you'd compare histograms from different channels? (spoiler: the
latter
> does not work)

I would hope it would be big enough to see in trendlines on some of the cache
latency metrics,
but that's probably naive for anything that's not an order-of-magnitude change
to something that's
measured more directly.

Studies are just too slow when I am trying to understand a problem more than to
confirm 
a solution :(

> > --- I think the one for CRC computation is avoidable as well (cleanly, but
far
> > more complexly than this).
> 
> I won't bother optimizing the CRC computation. I believe the effect is small.

Well, it's one more roundtrip around event loop on read before the final
callback, I think?

pasko

On 2017/04/12 15:28:05, Maks Orlovich wrote: > > Since the all entry ops are queued, ...

3 years, 8 months ago (2017-04-12 16:12:40 UTC) #33

On 2017/04/12 15:28:05, Maks Orlovich wrote:
> > Since the all entry ops are queued, this can only happen after a browser
> crash,
> > right? I think it's unlikely to be a problem, and if this world is able to
> > observe such a problem at some point, we can mitigate: temporarily corrupt
the
> > entry every time we write stream0 and un-corrupt after the next data write.
> 
> Well, again, the question is whether we can avoid caring about the queue for
> stream 0.

I am lost a little now. I thought we are talking about a potential speedup. If
reading stream 0 always has an empty queue - mission accomplished, an extra
check for entry queue size will ensure correctness at low complexity, and the
speedup is there.

> > > > stream0 synchronously to memory without queuing (when the queue is
empty).
> > > > Same spirit, let's do it?
> > > 
> > > Well, the question is whether we can drop the "queue is empty" condition
for
> > > stream 0.
> > 
> > Isn't it always empty when stream0 is read? I think our most common client
has
> > to ReadData(stream0) after OpenEntry() before it can think of reading other
> > streams.
> 
> Oh, right, background: Shivani mentioned during my talk that that
> HttpCacheTransaction may try 
> accessing both streams in parallel, for writes, I think. I should probably pin
> down the details.

All HttpCacheTransaction only can request things in parallel if the disk cache
returns asynchronously. So if it asks stream0 first, and the backend responds
right away (as this change suggests), then there would be no parallelism.

> > > Anyway, I would really appreciate feedback on this CL proper: I think at
> > > its current scope it can help see if those thread ping-pongs matter much
> > 
> > How do you want to see if it helps? Would there be a study parameter in the
> > next change or you'd compare histograms from different channels? (spoiler:
> > the latter does not work)
> 
> I would hope it would be big enough to see in trendlines on some of the cache
> latency metrics,
> but that's probably naive for anything that's not an order-of-magnitude change
> to something that's
> measured more directly.
> 
> Studies are just too slow when I am trying to understand a problem more than
to
> confirm 
> a solution :(

OK, I see no issues with the code, it should be good for benchmarking. If the
benchmark does not demonstrate any effect, it would be simpler to not commit it.
Would you agree to such plan?

> > > --- I think the one for CRC computation is avoidable as well (cleanly, but
> far
> > > more complexly than this).
> > 
> > I won't bother optimizing the CRC computation. I believe the effect is
small.
> 
> Well, it's one more roundtrip around event loop on read before the final
> callback, I think?

Maybe something changed, but I thought we just maintain the crc as we read (if
the incoming reads are sequential), maybe there is an extra roundtrip
accidentally, but I don't remember it, the data is available in the entry at the
last read anyway, and we can return an error without posting to the loop.

pasko

https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/simple_entry_impl.cc File net/disk_cache/simple/simple_entry_impl.cc (left): https://codereview.chromium.org/2807433002/diff/60001/net/disk_cache/simple/simple_entry_impl.cc#oldcode837 net/disk_cache/simple/simple_entry_impl.cc:837: RecordReadResult(cache_type_, READ_RESULT_BAD_STATE); On 2017/04/12 14:55:49, Maks Orlovich wrote: > ...

3 years, 8 months ago (2017-04-12 16:12:51 UTC) #34

Maks Orlovich

> I am lost a little now. I thought we are talking about a potential ...

3 years, 8 months ago (2017-04-12 16:49:54 UTC) #35

pasko

On 2017/04/12 16:49:54, Maks Orlovich wrote: > > OK, I see no issues with the ...

3 years, 8 months ago (2017-04-13 12:25:47 UTC) #36

Maks Orlovich

> My threshold of useful CPU time savings on IO thread is 5ms for a ...

3 years, 8 months ago (2017-04-13 13:16:26 UTC) #37

pasko

On 2017/04/13 13:16:26, Maks Orlovich wrote: > > My threshold of useful CPU time savings ...

3 years, 8 months ago (2017-04-13 13:58:20 UTC) #38

On 2017/04/13 13:16:26, Maks Orlovich wrote:
> > My threshold of useful CPU time savings on IO thread is 5ms for a heavy page
> > load (200 resources). If it is a bigger saving, it does not automatically
mean
> > usefulness, but this would be an interesting topic to talk about, as it
could
> be
> > more on higher percentiles. For example, I recently argued against a saving
of
> > 20ms (but that was a bigger complication).
> > 
> > So if this is about .5us saving for a read, looking at our size distribution
> > (Net.HttpJob.PrefilterBytesRead.Cache), and the fact that buffers are 32KiB,
> we
> > would have around 230 reads per page load, it gives a number that is 50
times
> > smaller than my threshold. Is this about right?
> > 
> > If so, then I am not sure why we should review it carefully and fix the
> > potential fallout of edge cases on Stable.
> > 
> > Feel free to start a discussion about useful IO thread performance
> improvements
> > on loading-dev@, I know other people may have different opinions than mine.
> 
> Thanks for sharing your rule; I don't really have a good sense for how small 
> changes tend to affect the overall system[1]. Do you feel avoiding going back
to
> event loop to post result isn't usually worth it either?
> 
> Also, how do you feel about SimpleCache using somewhere in the 1-4% or so of 
> IO thread CPU? (might be somewhat overestimated by tracing though, come to
think
> of it)

If it's chunked enough and the IO thread stays responsive, even 20% CPU won't
worry me much, since cache is so coupled with networking. I think the queuing is
a bigger problem (what bmaurer@ talked about at BlinkOn7 - the prelude.js being
over-queued, slides 23-24 in [1]). The problem is less true now, as we throttle
H/2 (ask jkarlin@ for details), but still is a problem, and changes like this
one can potentially help, but the metric should be something like
time-to-first-bytes-in-renderer, while IO thread CPU time is hard to project to
it.

[1] https://www.dropbox.com/s/er04jbxu510qkge/blinkon%20jan2017.pdf

Maks Orlovich

The CQ bit was checked by morlovich@chromium.org to run a CQ dry run

3 years, 3 months ago (2017-09-19 14:35:25 UTC) #39

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2807433002/80001

3 years, 3 months ago (2017-09-19 14:35:35 UTC) #40

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 3 months ago (2017-09-19 15:46:43 UTC) #41

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 3 months ago (2017-09-19 15:46:45 UTC) #42

Maks Orlovich

FYI: Josh expressed interest in having reads return synchronously when discussing the prefetch change, so ...

3 years, 3 months ago (2017-09-22 14:51:11 UTC) #43

Maks Orlovich

The CQ bit was checked by morlovich@chromium.org to run a CQ dry run

3 years, 3 months ago (2017-09-22 17:43:19 UTC) #44

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2807433002/120001

3 years, 3 months ago (2017-09-22 17:43:29 UTC) #45

Maks Orlovich

Description was changed from ========== Implement quick path for stream 0 read when nothing else ...

3 years, 3 months ago (2017-09-22 17:47:32 UTC) #46

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 3 months ago (2017-09-22 19:31:02 UTC) #47

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 3 months ago (2017-09-22 19:31:08 UTC) #48

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 3 months ago (2017-09-22 19:31:09 UTC) #49

Maks Orlovich

3 years, 2 months ago (2017-09-25 17:00:11 UTC) #50

On 2017/09/22 14:51:11, Maks Orlovich wrote:
> FYI: Josh expressed interest in having reads return synchronously when
> discussing the prefetch change, so I'll be reviving this, but looks like it'll
> need to be re-uploaded to the new review system.

Moved to:
https://chromium-review.googlesource.com/#/c/chromium/src/+/682139

Issue 2807433002: SimpleCache: synchronously reply on reads from idle if data is in memory.

Description

Patch Set 1 #

Patch Set 2 : git cl try #

Patch Set 3 : Benchmark for simple reads; meant to help evaluate any changes to early dispatch logic in general. #

Patch Set 4 : Experimental: upload a bit of an alternative approach for discussion #

Patch Set 5 : Resurrect it, and update for the preread changes. #

Patch Set 6 : A bunch of refactors and cleanups. A buggy bunch I want to read a combined diff for to help spot th… #

Patch Set 7 : Fix a couple of silly mistakes, restore some comments at proper spots #

Patch Set 8 : Use the proper limit for the barrier. #

Messages