Recent comments posted to this site:

comment 23 70dcb7e7ffdd14351adaf4c40ee7fdd0
[[!comment Error: unsupported page format hs]]
Tue Jan 22 16:28:14 2019
comment 3 e6ce9bb92c973350852c9498b7ffb50f
[[!comment Error: unsupported page format sh]]
Tue Jan 22 16:28:14 2019

I've thought up a way to solve this problem, it's in external remote querying transition.

Comment by joey Thu Jan 17 15:49:05 2019

There is a difference between a WHEREIS that for some reason itself hit the network, and a single network connection in PREPARE. The language was really talking about the former, which would make whereis on a large number of files painful. Not saying it wouldn't be better to avoid the latter too; if the user is only running whereis on 1 file the overhead is equally as bad.

Hmm, there is that "long running network connections" section that encourages using PREPARE that way, I think the idea was to make it as simple as possible to implement an external remote. All of git-annex's built-in remotes defer anything like that until it's needed.

In a way the real problem here is that WHEREIS is something most remotes will never need to implement, but it's queried of all of them. If only the few remotes that implement it needed to avoid network connections in PREPARE, that would not be much trouble to do.

PREPARE-LOCAL would need to be a protocol extension, so special remotes would have to be modified to request it, and those that are not modified would still have the overhead. Would that be any more likely to happen/easier to do than modifying all special remotes to defer network connections until needed?

Comment by joey Wed Jan 16 18:24:08 2019

"Note that users expect git annex whereis to run fast, without eg, network access"

Currently, git-annex spins up a remote process for every git annex whereis command that involves a file present on the remote (w/o chunking & encryption). As most remotes establish their network connection during the PREPARE phase, the command is slowed down, especially with bad internet connection. So I propose an extension PREPARE-LOCAL that tells the remote to get all necessary config information but skip the networking.

Alternatively, the remotes could wait to establish network connection until the first transfer command is sent but I think something like PREPARE-LOCAL would be the cleaner solution.

Comment by lykos Tue Jan 15 15:47:39 2019

After switching many internal types to ByteString.

(Note that stack build --profile built this with -O, not -O2, so it's not as fast as it ought to be, but the cost centers are probably fairly accurate still.)

        Mon Jan 14 17:17 2019 Time and Allocation Profiling Report  (Final)

           git-annex +RTS -p -RTS find

        total time  =        3.07 secs   (3074 ticks @ 1000 us, 1 processor)
        total alloc = 1,880,855,184 bytes  (excludes profiling overheads)

COST CENTRE                      MODULE                         SRC                                                 %time %alloc

inAnnex'.checkindirect           Annex.Content                  Annex/Content.hs:(106,9)-(121,39)                    31.6   41.0
splitc                           Utility.Split                  Utility/Split.hs:(24,1)-(26,25)                       4.8    5.2
keyFile'                         Annex.Locations                Annex/Locations.hs:(518,1)-(524,30)                   4.5    5.2
encodeW8                         Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:(189,1)-(191,70)        3.1    3.6
>>=.\                            Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44)    2.5    0.7
_encodeFilePath                  Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:(111,1)-(114,49)        2.5    2.7
>>=.\.succ'                      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76           2.2    0.2
fileKey'                         Annex.Locations                Annex/Locations.hs:(532,1)-(541,41)                   2.2    1.5
getState                         Annex                          Annex.hs:(254,1)-(257,27)                             2.1    0.4
getAnnexLinkTarget'.probesymlink Annex.Link                     Annex/Link.hs:77:9-62                                 1.9    2.5
w82s                             Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:217:1-15                1.8    5.2
keyPath                          Annex.Locations                Annex/Locations.hs:(551,1)-(553,23)                   1.7    3.5
keyFile'.esc                     Annex.Locations                Annex/Locations.hs:(520,9)-(524,30)                   1.6    4.7
fileKey'.go                      Annex.Locations                Annex/Locations.hs:535:9-63                           1.6    1.6
s2w8                             Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:214:1-15                1.3    3.5
withPtr                          Basement.Block.Base            Basement/Block/Base.hs:(395,1)-(404,31)               1.3    0.5
parseLinkTarget                  Annex.Link                     Annex/Link.hs:(247,1)-(255,25)                        1.2    3.8
parseKeyVariety                  Types.Key                      Types/Key.hs:(135,1)-(184,41)                         1.2    0.0
assertLocal                      Git                            Git.hs:(123,1)-(129,28)                               0.8    1.6
decodeBS'                        Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:151:1-31                0.6    2.4

Notice that the percent of time inAnnex' went up from 14.1% to 31.6%. That and getAnnexLinkTarget are the meat of the IO, so it's good for them to get a higher percent of the CPU, to the extent they're IO bound. It seems like getAnnexLinkTarget also lost a lot of non-IO overhead.

There are still some overheads from conversion to and from ByteString, but the above does seem like a good improvement.

        Mon Jan 14 17:56 2019 Time and Allocation Profiling Report  (Final)

           git-annex +RTS -p -RTS find --not --in web

        total time  =        7.62 secs   (7622 ticks @ 1000 us, 1 processor)
        total alloc = 1,908,064,368 bytes  (excludes profiling overheads)

COST CENTRE                      MODULE                         SRC                                                 %time %alloc

catObjectDetails.\               Git.CatFile                    Git/CatFile.hs:(83,88)-(91,97)                        6.5    3.8
catchDefaultIO                   Utility.Exception              Utility/Exception.hs:57:1-53                          6.4    2.5
parseResp                        Git.CatFile                    Git/CatFile.hs:(141,1)-(152,28)                       4.9    5.0
MAIN                             MAIN                           <built-in>                                            4.6    0.4
>>=.\                            Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44)    4.6    1.7
>>=.\.succ'                      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76           4.1    0.7
getState                         Annex                          Annex.hs:(254,1)-(257,27)                             2.7    1.1
simplifyPath                     Utility.Path                   Utility/Path.hs:(38,1)-(50,48)                        2.6    6.8
splitc                           Utility.Split                  Utility/Split.hs:(24,1)-(26,25)                       2.5    5.1
keyFile'                         Annex.Locations                Annex/Locations.hs:(518,1)-(524,30)                   1.7    5.2
getAnnexLinkTarget'.probesymlink Annex.Link                     Annex/Link.hs:77:9-62                                 1.7    2.6
journalFile                      Annex.Journal                  Annex/Journal.hs:(107,1)-(112,33)                     1.7    5.7
catches                          Control.Monad.Catch            src/Control/Monad/Catch.hs:(795,1)-(799,76)           1.7    2.7
query.send                       Git.CatFile                    Git/CatFile.hs:137:9-32                               1.6    0.5
delEntry                         Utility.Env                    Utility/Env.hs:(57,1)-(60,48)                         1.6    0.8
encodeW8                         Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:(189,1)-(191,70)        1.4    3.5
query                            Git.CatFile                    Git/CatFile.hs:(130,1)-(138,26)                       1.3    0.0

Notice that allocations dropped by 1/3rd!

Otherwise, not a large change here..

Comment by joey Mon Jan 14 21:18:17 2019
@joey I'm obviously missing something here, why would a shorter way to write that only be useful for direct mode? I don't understand what the connection is between direct mode and wanting to specify whether this is a "regular git" file or an annexed file (except that direct mode is not supported in v7)? I thought it was considered supported to have a mix of both large binary files and text files? Even if some text files are large, I think I want to add them as files whose content is tracked by git, so I think I want to choose 'by hand' -- is that not really supported / considered a bad idea for some reason?
Comment by timeless-ventricle Sun Jan 6 12:24:49 2019

On the doc it's said that

"Note that setting annex.thin only has any effect on systems that support hard links. It is supported on Windows, but not on FAT filesystems."

Having read that, I was thinking that I'd be able to use annex.thin with NTFS but it doesn't work. I'd specify clearly that NTFS would also not work with annex.thin

Thanks

Comment by colin.brosseau Thu Jan 3 18:04:58 2019

The page about unlocked files says:

setting annex.thin only has any effect on systems that support hard links. It is supported on Windows, but not on FAT filesystems.

Comment by chocolate.camera Tue Jan 1 18:17:21 2019

Obvious reason is that it's not something often used or that has had much demand for being sped up. And in particular it's written as a call to unannex each annexed file, and that runs git rm --cached once per file, which can be slow.

But there are also non-obvious things that it may need to do that can be slow. For example, if two files in the git repo point to the same git-annex object, it has to make a copy of the object to one of the worktree files.

Comment by joey Tue Jan 1 16:26:57 2019