http://bugzilla.meego.com/show_bug.cgi?id=1007
Summary: nightly testing + valgrind results
Classification: MeeGo Projects
Product: SyncEvolution
Version: unspecified
Platform: Netbook
OS/Version: IA
Status: ASSIGNED
Severity: enhancement
Priority: High
Component: SyncEvolution
AssignedTo: syncevolution-bugs(a)meego.bugs
ReportedBy: jingke.zhang(a)intel.com
QAContact: jingke.zhang(a)intel.com
CC: syncevolution-bugs(a)meego.bugs,
syncevolution-default-bugs(a)meego.bugs
Estimated Hours: 0.0
This is from
http://bugzilla.moblin.org/show_bug.cgi?id=6457
Description From pohly 2009-09-27 13:32:12 PST (-) [reply]
While running some routine valgrind checks of the dbus-api branch I noticed a
regression imported from "master" (invalid memory access to result of
SyncSourceBackendDebug()). The code went into "master" earlier last week. The
nightly testing should have found this problem, because syncs should be run
under valgrind control.
Do we still do that? If we do, did valgrind find the problem? It should be
reported at the end of the client-test run to stdout/stderr and result in a
non-zero return code of client-test.
I suspect that since the migration to the HTML format for test results this
particular aspect of testing is no longer included in the output and thus might
have been missed last week.
Yongsheng, can you check this?
------- Comment #1 From yongsheng zhu 2009-09-27 18:04:59 PST (-) [reply]
-------
ok, we'll investigate it.
------- Comment #2 From yongsheng zhu 2009-09-27 18:16:16 PST (-) [reply]
-------
funambol: CLIENT_TEST_SERVER=funambol
CLIENT_TEST_SOURCES=vcard21,ical20,itodo20,text
CLIENT_TEST_SKIP=Client::Sync::ical20::Retry,Client::Sync::ical20::Suspend,Client::Sync::ical20::Resend,Client::Sync::vcard21::Retry,Client::Sync::vcard21::Suspend,Client::Sync::vcard21::Resend,Client::Sync::itodo20::Retry,Client::Sync::itodo20::Suspend,Client::Sync::itodo20::Resend,Client::Sync::text::Retry,Client::Sync::text::Suspend,Client::Sync::text::Resend,Client::Sync::vcard21_ical20_itodo20_text::Retry,Client::Sync::vcard21_ical20_itodo20_text::Suspend,Client::Sync::vcard21_ical20_itodo20_text::Resend,Client::Sync::text_itodo20_ical20_vcard21::Retry,Client::Sync::text_itodo20_ical20_vcard21::Suspend,Client::Sync::text_itodo20_ical20_vcard21::Resend
CLIENT_TEST_XML=1 CLIENT_TEST_MAX_ITEMSIZE=2048 CLIENT_TEST_DELAY=10
CLIENT_TEST_FAILURES= CLIENT_TEST_COMPARE_LOG=T CLIENT_TEST_RESEND_TIMEOUT=5
CLIENT_TEST_INTERRUPT_AT=1 SYNC_EVOLUTION_EVO_CALENDAR_DELAY=1
CLIENT_TEST_ALARM=1200 CLIENT_TEST_LOG=
CLIENT_TEST_EVOLUTION_PREFIX=file:///work/runtests/head/databases schroot -p -c
testing-amd64 -- /work/runtests/dbus-session.sh /work/runtests/valgrindcheck.sh
env LD_LIBRARY_PATH=build-synthesis/src/.libs ./client-test Client::Sync:
failed (return code 256)
Do you mean this return code 256? If yes, we'll add it soon.
------- Comment #3 From pohly 2009-09-27 22:51:59 PST (-) [reply] -------
(In reply to comment #2)
Do you mean this return code 256? If yes, we'll add it soon.
Yes, please. Any non-zero return indicates an error.
To detect whether valgrind ran and reported errors, you would have to scan the
text output.
------- Comment #4 From yongsheng zhu 2009-09-28 19:02:39 PST (-) [reply]
-------
the fix is done and should go into effect today.
------- Comment #5 From pohly 2009-09-29 13:04:41 PST (-) [reply] -------
(In reply to comment #3)
To detect whether valgrind ran and reported errors, you would have to
scan the
text output.
(In reply to comment #4)
the fix is done and should go into effect today.
How do you detect whether valgrind reported an error? The output from today's
test says "Valgrind: failed (return code 256)" for some servers, for example
Memotoo. But the output for valgrind doesn't show anything from valgrind:
http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing...
If there was an error from valgrind, can this be made available as a link in
the report? Either link to the output.txt or some reduced version of it with
just the valgrind reports.
Speaking of the report and links, today's report has some inconsistent counts
(negative number for skipped, zero total), no information about which tests
failed for a server and the copy of the result files have permission issues on
runtests.syncevolution.org:
http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing...
=> permission denied
------- Comment #6 From yongsheng zhu 2009-09-29 18:15:21 PST (-) [reply]
-------
(In reply to comment #5)
> (In reply to comment #3)
> > To detect whether valgrind ran and reported errors, you would have to scan the
> > text output.
>
> (In reply to comment #4)
>
the fix is done and should go into effect today.
>
> How do you detect whether valgrind reported an error? The output from today's
> test says "Valgrind: failed (return code 256)" for some servers, for
example
> Memotoo. But the output for valgrind doesn't show anything from valgrind:
>
>
http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing...
Here is the result from this text output:
schroot -p -c testing-amd64 -- /work/runtests/dbus-session.sh
/work/runtests/valgrindcheck.sh env LD_LIBRARY_PATH=build-synthesis/src/.libs
./client-test Client::Sync: failed (return code 256)
If there was an error from valgrind, can this be made available as a
link in
the report? Either link to the output.txt or some reduced version of it with
just the valgrind reports.
I'll add it.
Speaking of the report and links, today's report has some
inconsistent counts
(negative number for skipped, zero total), no information about which tests
failed for a server and the copy of the result files have permission issues on
runtests.syncevolution.org:
http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing...
=> permission denied
strange. I could access it. The negative number is because
Congwu made some
changes on collecting all cases. we'll work this to solve it. thanks
------- Comment #7 From pohly 2009-09-29 23:49:28 PST (-) [reply] -------
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #3)
> > > To detect whether valgrind ran and reported errors, you would have to scan
the
> > > text output.
> >
> > (In reply to comment #4)
> >
the fix is done and should go into effect
today.
> >
> > How do you detect whether valgrind reported an error? The output from
today's
> > test says "Valgrind: failed (return code 256)" for some servers, for
example
> > Memotoo. But the output for valgrind doesn't show anything from valgrind:
> >
> >
http://runtests.syncevolution.org/2009-09-28-22-00/head-evolution-testing...
> Here is the result from this text output:
> schroot -p -c testing-amd64 -- /work/runtests/dbus-session.sh
> /work/runtests/valgrindcheck.sh env LD_LIBRARY_PATH=build-synthesis/src/.libs
> ./client-test Client::Sync: failed (return code 256)
True, but the 256 return code wasn't caused by valgrindcheck.sh. There is no
error output from valgrind.
My apologies for a probably misleading comment earlier: 256 indicates that
there was a problem, which *might* have been an error found by valgrind.
Scanning the error output is necessary to determine whether it really was
valgrind or just some regular CPPUnit test failure.
Perhaps this can be changed: valgrindcheck.sh could return RET=100 if it found
something *and* override the return code even if the original application also
failed. Here's the patch:
--- a/test/valgrindcheck.sh
+++ b/test/valgrindcheck.sh
@@ -27,15 +27,13 @@ killvalgrind () {
( set -x; env GLIBCXX_FORCE_NEW=1 G_SLICE=always-malloc G_DEBUG=gc-friendly
valgrind $VALGRIND_ARGS --l
RET=$?
-if [ $RET == 0 ]; then
- # give other valgrind instances some time to settle down, then kill them
- sleep 1
- killvalgrind -15
- sleep 5
- killvalgrind -9
- if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
- RET=1
- fi
+# give other valgrind instances some time to settle down, then kill them
+sleep 1
+killvalgrind -15
+sleep 5
+killvalgrind -9
+if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
+ RET=100
fi
exit $RET
------- Comment #8 From yongsheng zhu 2009-09-30 00:42:15 PST (-) [reply]
-------
--- a/test/valgrindcheck.sh
+++ b/test/valgrindcheck.sh
@@ -27,15 +27,13 @@ killvalgrind () {
( set -x; env GLIBCXX_FORCE_NEW=1 G_SLICE=always-malloc G_DEBUG=gc-friendly
valgrind $VALGRIND_ARGS --l
RET=$?
-if [ $RET == 0 ]; then
- # give other valgrind instances some time to settle down, then kill them
- sleep 1
- killvalgrind -15
- sleep 5
- killvalgrind -9
- if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
- RET=1
- fi
+# give other valgrind instances some time to settle down, then kill them
+sleep 1
+killvalgrind -15
+sleep 5
+killvalgrind -9
+if grep '^==[0-9][0-9]*==' $LOGFILE >/dev/null; then
+ RET=100
fi
exit $RET
ok, I'll add this and check the return code. thanks
------- Comment #9 From yongsheng zhu 2009-10-12 22:03:19 PST (-) [reply]
-------
ok, I'll add this and check the return code. thanks
I have
submitted the changes. waiting for test results.
------- Comment #10 From yongsheng zhu 2009-10-14 20:05:41 PST (-) [reply]
-------
(In reply to comment #9)
>
ok, I'll add this and check the return code. thanks
> I
have submitted the changes. waiting for test results.
These should be done according to test results.
------- Comment #11 From pohly 2009-10-16 09:16:07 PST (-) [reply] -------
(In reply to comment #10)
> (In reply to comment #9)
> >
ok, I'll add this and check the return
code. thanks
> > I have submitted the changes. waiting for test results.
>
> These should be done according to test results.
The test results only showed "valgrind ok", they didn't show that valgrind
errors are found. So I set out to trigger that by adding an intentional memory
leak into SyncContext::sync().
I found several problems:
- valgrind is run on the libtool shell wrapper with --trace-children=no,
so it never checked our code; fixed
- in
~tester/results/2009-10-16-17-33/head-evolution-testing/6-scheduleworld/output.txt
one can see that valgrind found something, but the return code isn't 100
I suspect that the return code has to be decoded to distinguish integer error
codes from "killed by signal" results. Yongsheng, can you check this?
To reproduce this problem more or less quickly I used:
diff --git a/src/syncevo/SyncContext.cpp b/src/syncevo/SyncContext.cpp
index 499f54b..aa791f7 100644
--- a/src/syncevo/SyncContext.cpp
+++ b/src/syncevo/SyncContext.cpp
@@ -1462,6 +1462,9 @@ SyncMLStatus SyncContext::sync(SyncReport *report)
{
SyncMLStatus status = STATUS_OK;
+ // intentional memory leak
+ malloc(100);
+
if (!exists()) {
SE_LOG_ERROR(NULL, NULL, "No configuration for server \"%s\"
found.",
m_server.c_str());
throwError("cannot proceed without configuration");
and
runtests testing scheduleworld sanity
The malloc() is removed again and not needed, because valgrind finds problems
in some of the non-SyncEvolution processes that we end up tracing. Either we
write suppressions for these problems or we must configure valgrind to only
trace our binary (might be hard).
------- Comment #12 From yongsheng zhu 2009-10-18 22:44:51 PST (-) [reply]
-------
I suspect that the return code has to be decoded to distinguish
integer error
codes from "killed by signal" results. Yongsheng, can you check this?
On
Unix, the return value of os.system() is the exit status of the process
encoded in the format specified for os.wait().
For os.wait(),the return value is a 16-bit number, whose low byte is the signal
number that killed the process, and whose high byte is the exit status (if the
signal number is zero);
Thus the return codes printed in the result are always multiples of 256
------- Comment #13 From yongsheng zhu 2009-10-19 00:29:05 PST (-) [reply]
-------
(In reply to comment #12)
> I suspect that the return code has to be decoded to distinguish
integer error
> codes from "killed by signal" results. Yongsheng, can you check this?
On Unix, the return value of os.system() is the exit status of the process
encoded in the format specified for os.wait().
For os.wait(),the return value is a 16-bit number, whose low byte is the signal
number that killed the process, and whose high byte is the exit status (if the
signal number is zero);
Thus the return codes printed in the result are always multiples of 256
The patch for this issue was also comitted in the branch 'nightly' in the
directory '~tester/syncevolution-testing/syncevolution'
------- Comment #14 From yongsheng zhu 2009-10-19 00:33:05 PST (-) [reply]
-------
Either we write suppressions for these problems or we must configure
valgrind to only
trace our binary (might be hard).
We have to add *too many* suppressions if using
suppressions. But it seems to
me harder to configure valgrind to only trace our binary.
------- Comment #15 From yongsheng zhu 2009-10-19 18:20:26 PST (-) [reply]
-------
I found several problems:
- valgrind is run on the libtool shell wrapper with --trace-children=no,
so it never checked our code; fixed
- in
A problem is that the nightly test could not be completed in the same day it
starts because valgrind make apps running slowly.
------- Comment #16 From pohly 2009-10-20 00:53:17 PST (-) [reply] -------
(In reply to comment #14)
> Either we write suppressions for these problems or we must
configure valgrind to only
> trace our binary (might be hard).
We have to add *too many* suppressions if using suppressions.
How many errors did you find? Looking at tonight's "scheduleworld" run I
see
one error for /usr/lib/bonobo-activation/bonobo-activation-server and one for
an unknown binary. Everything else is in SyncEvolution. "Everything" luckily
isn't that much: one leak related to libical.
This doesn't look that bad.
What is more annoying is that multiple processes writing into the same valgrind
file seem to add nul bytes to the file. This is visible when viewing with
"less" and makes me wonder whether the output is really complete.
So, here are two ideas how we can avoid this:
* compile the "testing" binaries with --disable-shared --enable-static to
avoid the libtool wrapper script, then stop using --trace-children=yes
* run the bonobo-activation-server in advance, so that valgrind doesn't
catch it
I'll have a look at this.
(In reply to comment #15)
> I found several problems:
> - valgrind is run on the libtool shell wrapper with --trace-children=no,
> so it never checked our code; fixed
> - in
A problem is that the nightly test could not be completed in the same day it
starts because valgrind make apps running slowly.
Is that a real problem for us or hypothetical? Tonight's run hasn't completed
yet, but that isn't unusual. valgrind is a indeed a problem for CPU bound code,
but we are mostly limited by IO performance and the obligatory sleeps() that we
need to not overload servers and for the time-based change tracking.
------- Comment #17 From yongsheng zhu 2009-10-20 18:25:47 PST (-) [reply]
-------
Is that a real problem for us or hypothetical? Tonight's run
hasn't completed
yet, but that isn't unusual. valgrind is a indeed a problem for CPU bound code,
but we are mostly limited by IO performance and the obligatory sleeps() that we
need to not overload servers and for the time-based change tracking.
We have such this case. Friday's head-evolution-testing was completed in the
sunday morning. The most important concern is that we'll share some resources,
such as valgrind generation output file(/work/runtests/valgrind.out). Two
processes reading/writing it concurrently could make valgrind checking
unexpected.
------- Comment #18 From yongsheng zhu 2009-10-20 18:37:39 PST (-) [reply]
-------
How many errors did you find? Looking at tonight's
"scheduleworld" run I see
one error for /usr/lib/bonobo-activation/bonobo-activation-server and one for
an unknown binary. Everything else is in SyncEvolution. "Everything" luckily
isn't that much: one leak related to libical.
This doesn't look that bad.
Yes, you are right. each 'interoperability
server' has several error, some of
them are the same. so I'll add them to the suppression file if they are not
sync mem errors.
scheduleword: 4
synthesis:11
funambol: 6
zyb:7
google: 7
mobical: 14
memotoo:12
------- Comment #19 From yongsheng zhu 2009-10-20 20:17:52 PST (-) [reply]
-------
Yes, you are right. each 'interoperability server' has
several error, some of
them are the same. so I'll add them to the suppression file if they are not
sync mem errors.
scheduleword: 4
synthesis:11
funambol: 6
zyb:7
google: 7
mobical: 14
memotoo:12
I've added memory errors of other libraries into evo.supp. And there are 3
memory errors from synthesis. A bug #7121 is opened to track them.
------- Comment #20 From pohly 2009-10-21 03:17:35 PST (-) [reply] -------
(In reply to comment #19)
> Yes, you are right. each 'interoperability server' has
several error, some of
> them are the same. so I'll add them to the suppression file if they are not
> sync mem errors.
>
> scheduleword: 4
> synthesis:11
> funambol: 6
> zyb:7
> google: 7
> mobical: 14
> memotoo:12
I've added memory errors of other libraries into evo.supp.
Careful here, please. For example, you added this one here:
+# ==30565== 6,564 bytes in 144 blocks are definitely lost in loss record 48 of
72
+# ==30565== at 0x4C2391E: malloc (vg_replace_malloc.c:207)
+# ==30565== by 0x5B1FA21: strdup (in /lib/libc-2.9.so)
+# ==30565== by 0xD9E080C: icalparameter_new_from_value_string (in
/usr/lib/libical.so.0.43.0)
+# ==30565== by 0xD9F6AF0: icalparser_add_line (in
/usr/lib/libical.so.0.43.0)
+# ==30565== by 0xD9F7088: icalparser_parse (in /usr/lib/libical.so.0.43.0)
+# ==30565== by 0xD9F7230: icalparser_parse_string (in
/usr/lib/libical.so.0.43.0)
+# ==30565== by 0xA2CB224:
SyncEvo::EvolutionCalendarSource::insertItem(std::string const&, std::stri
The suppression for it contains the full stack. That has the effect that the
slightest change in how we call the function will no longer match the
suppression, causing the error to pop up again. Other calls into
icalparser_parse_string() are not matched by the suppression either.
When adding a new suppression, think carefully about which stack entries to
include in the match. In the example above,
malloc
strdup
icalparameter_new_from_value_string
would be much to broad. Including SyncEvolution as caller is too narrow. The
right approach IMHO would be to identify the leak with
icalparser_parse_string() as top-level call. Because it does not return the
string which is leaked here, we can be reasonable sure that our code is not at
fault.
A similar problem occurs when specifying object files:
obj:/usr/lib/libgnome-2.so.0.2600.0
This won't match anymore when the library version or install path changes. It's
better to use wildcards:
obj:*libgnome-2*
As before, use common sense to choose between "too limited" and "too
broad".
All of the new suppressions should be reviewed taking that into account.
However, perhaps we don't need them at all. I'm about to try out compiling with
--disable-shared for valgrind testing.
------- Comment #21 From pohly 2009-10-21 05:58:17 PST (-) [reply] -------
(In reply to comment #16)
* compile the "testing" binaries with --disable-shared
--enable-static to
avoid the libtool wrapper script, then stop using --trace-children=yes
I've made that change. I'm going to run the full test once to check how slow it
really is. cron job is disabled and needs to be reactivated.
------- Comment #22 From yongsheng zhu 2009-10-21 18:24:41 PST (-) [reply]
-------
The suppression for it contains the full stack. That has the effect
that the
slightest change in how we call the function will no longer match the
suppression, causing the error to pop up again. Other calls into
icalparser_parse_string() are not matched by the suppression either.
When adding a new suppression, think carefully about which stack entries to
include in the match. In the example above,
malloc
strdup
icalparameter_new_from_value_string
would be much to broad. Including SyncEvolution as caller is too narrow. The
right approach IMHO would be to identify the leak with
icalparser_parse_string() as top-level call. Because it does not return the
string which is leaked here, we can be reasonable sure that our code is not at
fault.
A similar problem occurs when specifying object files:
obj:/usr/lib/libgnome-2.so.0.2600.0
This won't match anymore when the library version or install path changes. It's
better to use wildcards:
obj:*libgnome-2*
As before, use common sense to choose between "too limited" and "too
broad".
Yes, reasonable. I'll refine the suppressions.
------- Comment #23 From yongsheng zhu 2009-10-21 18:28:40 PST (-) [reply]
-------
However, perhaps we don't need them at all. I'm about to try
out compiling with
--disable-shared for valgrind testing.
ok,I'd hold on to fix up suppressions
and see your experimental results.
------- Comment #24 From pohly 2009-10-23 04:35:01 PST (-) [reply] -------
(In reply to comment #21)
(In reply to comment #16)
> * compile the "testing" binaries with --disable-shared --enable-static to
> avoid the libtool wrapper script, then stop using --trace-children=yes
I've made that change. I'm going to run the full test once to check how slow it
really is.
The whole run completed in 9 hours. This is fast enough to not overlap, but to
be sure I added a check to runtests.
It seems that someone has already enabled the cron job already.
(In reply to comment #23)
> However, perhaps we don't need them at all. I'm about to
try out compiling with
> --disable-shared for valgrind testing.
ok,I'd hold on to fix up suppressions and see your experimental results.
I removed them from the "nightly" branch. They are in bug6547-valgrind, so if
we find that we need them, we can merge and refine them.
------- Comment #25 From pohly 2009-12-09 08:39:35 PST (-) [reply] -------
(In reply to comment #24)
(In reply to comment #21)
> (In reply to comment #16)
> > * compile the "testing" binaries with --disable-shared
--enable-static to
> > avoid the libtool wrapper script, then stop using --trace-children=yes
>
> I've made that change. I'm going to run the full test once to check how slow
it
> really is.
The whole run completed in 9 hours. This is fast enough to not overlap, but to
be sure I added a check to runtests.
I'm not sure whether I ran the right binary at that time.
When rechecking on this issue I found that an intentional memory leak was not
reported. Looking further I found that this was because valgrind was invoked
like this:
/work/runtests/dbus-session.sh /work/runtests/valgrindcheck.sh env
LD_LIBRARY_PATH=build-synthesis/src/.libs ./client-test
Client::Sync::vcard30::testItems
In other words, valgrind was checking the "env" command and due to not tracing
children, not client-test.
I've changed the order. Expect "valgrind" failed errors. I'll rebase on
the
latest upstream libsynthesis, then we'll need to work on fixing or supressing
issues.
--
Configure bugmail:
http://bugzilla.meego.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
You are watching someone on the CC list of the bug.