Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] what kind of http request method using with file crawling? #20

Open
sho-suzuki opened this issue Feb 13, 2018 · 9 comments
Open
Labels

Comments

@sho-suzuki
Copy link

plugin version

1.3.1

gitbucket version

4.20

what is matter

under the proxy environment . I can't get content from files but can get issue, wikis.
fess-crawler.log is as follows,

# file crawling log
2018-02-13 18:12:32,511 [5DFNjmEBO7Desvq7XhyO-1] INFO  Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge
[2018-02-13 18:12:35,028 [5DFNjmEBO7Desvq7XhyO-1] WARN  Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e
org.codelibs.fess.crawler.exception.CrawlingAccessException: Failed to parse http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:184) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
        at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.codelibs.fess.crawler.exception.MultipleCrawlingAccessException: 
Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true;
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true; 
 Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): 
 http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/hoge?ref=b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e&large_file=true
        at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:95) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
        ... 9 more

# issue crawl log
2018-02-13 18:43:02,794 [5DFNjmEBO7Desvq7XhyO-1] INFO  Get a content from http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/17

On Linux, both requests seem to return the same result.

# file request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/contents/README.md
{"message":"Requires authentication"}
# issue request
curl http://gitbucket:8080/gitbucket/api/v3/repos/name/repo/issues/21
{"message":"Requires authentication"}

I think that it may be a problem in setting proxy. (Proxy discards file request)
I would like to know about the http request of the file crawl API.

thanks.

@keiichiw
Copy link
Contributor

keiichiw commented Feb 13, 2018

I'm not sure that your problem is caused by the proxy but could you try the following command?

$ curl -H "Authorization: token <token>" "http://localhost:8080/gitbucket/api/v3/repos/<user name>/<repository name>/contents/<file name>?ref=<commit hash>&large_file=true"

The value <token> is the one generated by GitBucket here.

The value <commit hash> is b7d5e8b5fba9a7927ff2b5106066e790ad2ced4e in your case.
It can be obtained by:

$ curl -H "Authorization: token <token>" "http://localhost:8080/gitbucket/api/v3/repos/<user name>/<repository name>/git/refs/heads/master

If you want to learn how Fess gets files more, see GitBucketDataStoreImpl.java.

@sho-suzuki
Copy link
Author

thanks @kw-udon.
I got a response when I submitted a command you pointed out.

# curl -H "Authorization: token 284530a64e55176f9ed9*********" "http://gitbucket:8080/gitbucket/api/v3/repos/root/name/contents/hoge?ref=efcd9adbec49f73f762b7b2127153593024e4bea&large_file=true"

{"type":"file","name":"hoge","path":"hoge","sha":"efcd9adbec49f73f762b7b2127153593024e4bea","content":"IyBBcHAgYXJ0aWZhY3RzCi9fYnVpbGQKLLmV4cw==","encoding":"base64","download_url":"http://gitbucket:8080/gitbucket/api/v3/repos/root/name/raw/efcd9adbec49f73f762b7b2127153593024e4bea/hoge"}

so proxy didn't discard request and refused.

@keiichiw
Copy link
Contributor

MultipleCrawlingAccessException is occured in your log file, but I don't know what can raise this exception.
Do you have any idea @marevol?

@marevol
Copy link
Contributor

marevol commented Feb 14, 2018

Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)):

The cause is above. It's a network problem.
I think that the problem is a proxy setting or the like.

@sho-suzuki
Copy link
Author

@marevol @kw-udon There is only one crawler that crawls gitbucket.
How do I get detailed logs to execute curl request when crawling starts?

@marevol
Copy link
Contributor

marevol commented Feb 15, 2018

@sho-suzuki
Copy link
Author

@marevol thanks!
I set the crawl log level info to debug, fess-crawler.log is as follows.

  • fess-crawler.log
2018-02-15 14:15:37,744 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Accessing http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
2018-02-15 14:15:37,745 [5DFNjmEBO7Desvq7XhyO-1] DEBUG CookieSpec selected: default
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection request: [route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection leased: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 1 of 20; total allocated: 1 of 200]
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Opening connection {}->http://gitbucket:8080
2018-02-15 14:15:37,746 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connecting to gitbucket/IP:8080
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG http-outgoing-1: Shutdown connection
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection discarded
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection released: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 20; total allocated: 0 of 200]
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Cancelling request execution
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Failed to access to http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
org.codelibs.fess.crawler.exception.CrawlingAccessException: Connection time out(Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)): http://gitbucket:8080/gitbucket/api/v3/repos/user/repo/contents/hoge?ref=37cce0819cdf0a357e0b5e9bc373030dbfa84cd6&large_file=true
        at org.codelibs.fess.crawler.client.http.HcHttpClient.processHttpMethod(HcHttpClient.java:820) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.doHttpMethod(HcHttpClient.java:623) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.doGet(HcHttpClient.java:582) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.AbstractCrawlerClient.execute(AbstractCrawlerClient.java:142) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.FaultTolerantClient.execute(FaultTolerantClient.java:67) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.helper.DocumentHelper.processRequest(DocumentHelper.java:148) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeFileContent(GitBucketDataStoreImpl.java:291) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.lambda$storeData$4713(GitBucketDataStoreImpl.java:134) ~[classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:441) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.crawlFileContents(GitBucketDataStoreImpl.java:447) [classes/:?]
        at org.codelibs.fess.ds.impl.GitBucketDataStoreImpl.storeData(GitBucketDataStoreImpl.java:124) [classes/:?]
        at org.codelibs.fess.ds.impl.AbstractDataStoreImpl.store(AbstractDataStoreImpl.java:106) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.process(DataIndexHelper.java:236) [classes/:?]
        at org.codelibs.fess.helper.DataIndexHelper$DataCrawlingThread.run(DataIndexHelper.java:222) [classes/:?]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to gitbucket:8080 [gitbucket/IP] failed: Connection refused (Connection refused)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.4.jar:4.5.4]
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG http-outgoing-1: Shutdown connection
2018-02-15 14:15:37,747 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection discarded
2018-02-15 14:15:37,748 [5DFNjmEBO7Desvq7XhyO-1] DEBUG Connection released: [id: 1][route: {}->http://gitbucket:8080][total kept alive: 0; route allocated: 0 of 2
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_161]
        at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_161]
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.4.jar:4.5.4]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.4.jar:4.5.4]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.executeHttpClient(HcHttpClient.java:852) ~[fess-crawler-2.0.1.jar:?]
        at org.codelibs.fess.crawler.client.http.HcHttpClient.processHttpMethod(HcHttpClient.java:660) ~[fess-crawler-2.0.1.jar:?]
        ... 13 more
...
2018-02-15 14:15:42,103 [CoreLib-TimeoutManager] DEBUG Closing expired connections
2018-02-15 14:15:42,105 [CoreLib-TimeoutManager] DEBUG Closing connections idle longer than 60000 MILLISECONDS

From this log connection appears to be disconnected by connection timeout or connection refused.
and I also changed gitbucket logback-setting.xml like this, but no application log found.

  • logback-setting.xml
<configuration debug="true" scan="true" scanPeriod="60 seconds"> 
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <!-- encoders are  by default assigned the type
         ch.qos.logback.classic.encoder.PatternLayoutEncoder -->
  
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>INFO</level>
        </filter>
        <encoder>
            <pattern> %date %-4relative [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
 
    <appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <!-- encoders are  by default assigned the type
         ch.qos.logback.classic.encoder.PatternLayoutEncoder -->
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <!-- rollover daily and compress-->
            <fileNamePattern>/gitbucket/log/gitbucket-%d{yyyy-MM-dd}.%i.log.gz</fileNamePattern>
            <!-- compressed logs are remains 30 days and then deleted -->
            <maxHistory>30</maxHistory>
            <timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
                <maxFileSize>25MB</maxFileSize>
            </timeBasedFileNamingAndTriggeringPolicy>
        </rollingPolicy>
 
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>INFO</level>
        </filter>
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} %-4relative [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>
 
    <root level="DEBUG">
        <appender-ref ref="STDOUT"/>
        <appender-ref ref="ROLLING"/>
    </root>
</configuration>

any ideas?

@marevol
Copy link
Contributor

marevol commented Feb 19, 2018

Did you configure proxy settings?
See codelibs/fess#1066

@sho-suzuki
Copy link
Author

@marevol yes. I configured proxy setting in fess_config.properties

http.proxy.host=proxy_IP
http.proxy.port=proxy_port
http.proxy.username=
http.proxy.password=
  • my proxy does not authenticate users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants