Recent comments posted to this site:
I've thought up a way to solve this problem, it's in external remote querying transition.
There is a difference between a WHEREIS that for some reason itself hit the network, and a single network connection in PREPARE. The language was really talking about the former, which would make whereis on a large number of files painful. Not saying it wouldn't be better to avoid the latter too; if the user is only running whereis on 1 file the overhead is equally as bad.
Hmm, there is that "long running network connections" section that encourages using PREPARE that way, I think the idea was to make it as simple as possible to implement an external remote. All of git-annex's built-in remotes defer anything like that until it's needed.
In a way the real problem here is that WHEREIS is something most remotes will never need to implement, but it's queried of all of them. If only the few remotes that implement it needed to avoid network connections in PREPARE, that would not be much trouble to do.
PREPARE-LOCAL would need to be a protocol extension, so special remotes would have to be modified to request it, and those that are not modified would still have the overhead. Would that be any more likely to happen/easier to do than modifying all special remotes to defer network connections until needed?
"Note that users expect
git annex whereis
to run fast, without eg, network access"
Currently, git-annex spins up a remote process for every git annex whereis
command that involves a file present on the remote (w/o chunking & encryption). As most remotes establish their network connection during the PREPARE phase, the command is slowed down, especially with bad internet connection. So I propose an extension PREPARE-LOCAL
that tells the remote to get all necessary config information but skip the networking.
Alternatively, the remotes could wait to establish network connection until the first transfer command is sent but I think something like PREPARE-LOCAL
would be the cleaner solution.
After switching many internal types to ByteString.
(Note that stack build --profile built this with -O, not -O2, so it's not as fast as it ought to be, but the cost centers are probably fairly accurate still.)
Mon Jan 14 17:17 2019 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS find
total time = 3.07 secs (3074 ticks @ 1000 us, 1 processor)
total alloc = 1,880,855,184 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
inAnnex'.checkindirect Annex.Content Annex/Content.hs:(106,9)-(121,39) 31.6 41.0
splitc Utility.Split Utility/Split.hs:(24,1)-(26,25) 4.8 5.2
keyFile' Annex.Locations Annex/Locations.hs:(518,1)-(524,30) 4.5 5.2
encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(189,1)-(191,70) 3.1 3.6
>>=.\ Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44) 2.5 0.7
_encodeFilePath Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(111,1)-(114,49) 2.5 2.7
>>=.\.succ' Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76 2.2 0.2
fileKey' Annex.Locations Annex/Locations.hs:(532,1)-(541,41) 2.2 1.5
getState Annex Annex.hs:(254,1)-(257,27) 2.1 0.4
getAnnexLinkTarget'.probesymlink Annex.Link Annex/Link.hs:77:9-62 1.9 2.5
w82s Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:217:1-15 1.8 5.2
keyPath Annex.Locations Annex/Locations.hs:(551,1)-(553,23) 1.7 3.5
keyFile'.esc Annex.Locations Annex/Locations.hs:(520,9)-(524,30) 1.6 4.7
fileKey'.go Annex.Locations Annex/Locations.hs:535:9-63 1.6 1.6
s2w8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:214:1-15 1.3 3.5
withPtr Basement.Block.Base Basement/Block/Base.hs:(395,1)-(404,31) 1.3 0.5
parseLinkTarget Annex.Link Annex/Link.hs:(247,1)-(255,25) 1.2 3.8
parseKeyVariety Types.Key Types/Key.hs:(135,1)-(184,41) 1.2 0.0
assertLocal Git Git.hs:(123,1)-(129,28) 0.8 1.6
decodeBS' Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:151:1-31 0.6 2.4
Notice that the percent of time inAnnex' went up from 14.1% to 31.6%. That and getAnnexLinkTarget are the meat of the IO, so it's good for them to get a higher percent of the CPU, to the extent they're IO bound. It seems like getAnnexLinkTarget also lost a lot of non-IO overhead.
There are still some overheads from conversion to and from ByteString, but the above does seem like a good improvement.
Mon Jan 14 17:56 2019 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS find --not --in web
total time = 7.62 secs (7622 ticks @ 1000 us, 1 processor)
total alloc = 1,908,064,368 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catObjectDetails.\ Git.CatFile Git/CatFile.hs:(83,88)-(91,97) 6.5 3.8
catchDefaultIO Utility.Exception Utility/Exception.hs:57:1-53 6.4 2.5
parseResp Git.CatFile Git/CatFile.hs:(141,1)-(152,28) 4.9 5.0
MAIN MAIN <built-in> 4.6 0.4
>>=.\ Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44) 4.6 1.7
>>=.\.succ' Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76 4.1 0.7
getState Annex Annex.hs:(254,1)-(257,27) 2.7 1.1
simplifyPath Utility.Path Utility/Path.hs:(38,1)-(50,48) 2.6 6.8
splitc Utility.Split Utility/Split.hs:(24,1)-(26,25) 2.5 5.1
keyFile' Annex.Locations Annex/Locations.hs:(518,1)-(524,30) 1.7 5.2
getAnnexLinkTarget'.probesymlink Annex.Link Annex/Link.hs:77:9-62 1.7 2.6
journalFile Annex.Journal Annex/Journal.hs:(107,1)-(112,33) 1.7 5.7
catches Control.Monad.Catch src/Control/Monad/Catch.hs:(795,1)-(799,76) 1.7 2.7
query.send Git.CatFile Git/CatFile.hs:137:9-32 1.6 0.5
delEntry Utility.Env Utility/Env.hs:(57,1)-(60,48) 1.6 0.8
encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(189,1)-(191,70) 1.4 3.5
query Git.CatFile Git/CatFile.hs:(130,1)-(138,26) 1.3 0.0
Notice that allocations dropped by 1/3rd!
Otherwise, not a large change here..
On the doc it's said that
"Note that setting annex.thin only has any effect on systems that support hard links. It is supported on Windows, but not on FAT filesystems."
Having read that, I was thinking that I'd be able to use annex.thin with NTFS but it doesn't work. I'd specify clearly that NTFS would also not work with annex.thin
Thanks
The page about unlocked files says:
setting annex.thin only has any effect on systems that support hard links. It is supported on Windows, but not on FAT filesystems.
Obvious reason is that it's not something often used or that has had much
demand for being sped up. And in particular it's written as a call to
unannex each annexed file, and that runs git rm --cached
once per file,
which can be slow.
But there are also non-obvious things that it may need to do that can be slow. For example, if two files in the git repo point to the same git-annex object, it has to make a copy of the object to one of the worktree files.